Workflow Hacks #1 - dots. Tokyo
-
Upload
taro-l-saito -
Category
Engineering
-
view
1.881 -
download
2
Transcript of Workflow Hacks #1 - dots. Tokyo
![Page 2: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/2.jpg)
Workflow Hacks! #1
2
![Page 3: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/3.jpg)
アンケート• 終了後 メールにてアンケートを送付します
• 質問内容
• 現在、どのようなシステムを使っているか?
• ワークフローでどのような問題を解決したいか?
• 回答いただいた方に、抽選でTreasure Dataパーカーをプレゼント!
3
![Page 4: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/4.jpg)
About Me: Taro L. Saito
4
2007 University of Tokyo. Ph.D. XML DBMS, Transaction Processing
Relational-Style XML Query [SIGMOD 2008]
~ 2014 Assistant Professor at University of Tokyo Genome Science Research
- Big Data Processing - Distributed Computing
2014.03~ Treasure Data, Inc. Tokyo
2015.07~ Treasure Data, Inc. Mountain View, CA
![Page 5: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/5.jpg)
![Page 6: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/6.jpg)
![Page 7: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/7.jpg)
![Page 8: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/8.jpg)
Cloud Platform for Data Analytics
8
• Importing 1,000,000~ records / sec. • Presto (Distributed SQL engine)
• 50,000~ queries / day • Processing 10 trillion records / day • http://qiita.com/xerial/items/a9093b60062f2c613fda
Import Export
StoreAnalyze with Presto/Hive
(Distributed SQL Engine)
EnterpEnterprise
Data
BI
![Page 9: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/9.jpg)
Workflow Fundamental Features• Dependency management
• task1 -> task2 -> task3 … • Scheduling • Execution monitoring • State management
• Error handling • Easy access to logs • Notification
9
![Page 10: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/10.jpg)
Workflow Tools• Workflow Management Tools • Python: Luigi, Airflow, pinball • For Hadoop: Oozie (XML) • Script-based: Makefile, Azkaban • Biological Science: Galaxy (Web UI), nextflow • Domestic: JP1, Hinemos
• Dataflow DSL • Spark, Flink, DriadLINQ, TensorFlow • Cascading (Java -> MR), Scalding (Scala -> MR)
10
![Page 11: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/11.jpg)
Dataflow DSL• Translate this data processing program
• into a cluster computing program
11
A B
A0
A1
A2
B1
B2
f
B0
C
C
g
map reduce
f g
![Page 12: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/12.jpg)
Redbook: Dataflow Engines• Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis
• http://www.redbook.io/ch5-dataflow.html
• DryadLINQ • Most influential interface
for dataflow DSL • SQL-like operation • Functional style
• Spark • SparkSQL
• 70% of Spark accesses • Dataset API
• Shift to the dataframe based API
12
![Page 13: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/13.jpg)
Dataflow -> Execution Plan• Example - Hive: SQL to MapReduce
• Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
GROUP BY page
13
HDFS
A0B0
A1
A2
BB1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
![Page 14: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/14.jpg)
Workflows
14
Af
B Cg
D E
F
G
![Page 15: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/15.jpg)
Hadoop is not enough• C. Olston et al. [SIGMOD 2011]
• continuous processing • independent scheduling
• Incremental processing • Google Parcolator [OSDI 2010] • Naiad - Differential Workflow
Microsoft [SOSP 2013]
15
![Page 16: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/16.jpg)
Continuous Processing• The Dataflow Model
• Akidau et al., Google [VLDB2015]
• Unbounded data processing • late-coming data
• Integration of • batch processing • accumulation
16
![Page 17: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/17.jpg)
Cluster Computing with Dryad M. Budiu, 2008
![Page 18: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/18.jpg)
Cluster Computing with Dryad M. Budiu, 2008
Workflow Hacks!
![Page 19: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/19.jpg)
Airflow
19
![Page 20: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/20.jpg)
Airflow• Best practices with Airflow - An open source
platform for workflows & schedules (Nov 2015) • At Silicon Valley Data Engineering Meetup
• https://youtu.be/dgaoqOZlvEA
20
![Page 21: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/21.jpg)
Workflow Development• Programmatic
• Generate workflows by code • Configuration as Code
• Workflow reuse/overwrite • object oriented
• Parameterization
21
![Page 22: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/22.jpg)
Luigi• Luigiによるワークフロー管理
• http://qiita.com/k24d/items/fb9bed08423e6249d376
22
![Page 23: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/23.jpg)
Nextflow• http://www.nextflow.io/
23
![Page 24: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/24.jpg)
Dataflow DSL vs Workflow DSL• Dataflow
• A -> B -> C -> … • Data dependencies
• Workflow • Task A -> Task B -> Task C -> …
• Task dependencies • Data transfer is optional (through file or DB)
• + Scheduling • + Task names
• For monitoring, redo, etc.
24
![Page 25: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/25.jpg)
Weavelet (wvlet)• Object-oriented workflow DSL for Scala
• Workflow reuse, extension, override • Parameterization • Function := Task, Workflow := Class
25
![Page 26: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/26.jpg)
Isolating DAG generation and its execution
• Alternatives of MR • Tez • Pig on Spark https://issues.apache.org/jira/browse/PIG-4059
• Asakusa on Hadoop, Spark
26
Local
Hadoop
Spark
Result
DSL generates DAG
![Page 27: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/27.jpg)
Stream DSL• Add “moving stream” support to Dataflow DSL
• ”moving" streams and "resting" datasets • Example
• Spark Streaming • Spark DSL + Micro-batch for stream
• Microsoft Azure Stream SQL • Windowing support for moving data
• Norikra • Stream processing with SQL
• Reactive programming • ReactiveX (Netflix), Akka Streaming (beta) <- Stream DSL (DAG) • Back-pressure support
• Controlling data transfer speed from receiver side
27
![Page 28: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/28.jpg)
Task Execution Retry• リトライと冪等性のデザインパターン
• http://frsyuki.hatenablog.com/entry/2014/06/09/164559
• System failures • Process is not responding
• network, hardware failures • Middleware failures
• provisioning failures, missing components
• User failures • Wrong configuration • Programming error
28
![Page 29: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/29.jpg)
Retry Example• Example: Task calling a REST API /create/xxx
• Client: First attempt • Server returns 200 Success
• But failed to get the status code • Client retries the task
• Get 409 conflict error (entry xxx is already created)
• Solution (Application side) • Handle 409 error as success in the client (idempotent
execution) • More strict approach
• Making xxx unique for each request
29
![Page 30: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/30.jpg)
Fault Tolerance• Presto: Distributed query engine developed by Facebook
• Uses HTTP data transfer
• No fault-tolerance
• 99.5% of queries finishes without any failure
• For queries processing 10 billions or more rows => Drops to 85%
30
A0B0
A1
A2
BB1
B2
B3
A
map reduce mergesplit
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
![Page 31: Workflow Hacks #1 - dots. Tokyo](https://reader031.fdocuments.net/reader031/viewer/2022030305/587323b81a28ab673e8b7f4d/html5/thumbnails/31.jpg)
Summary• Recent workflow tools
• Driven by Python community • Because of this book! (=>) • Airflow, Luigi, etc.
• Workflow manager • Handle system failures, monitoring
• Workflow development • DAG based DSL (dataflow, workflow, stream processing) -> Execution • Does not cover application logic errors
• Idempotent execution • Requires splitting large tasks into smaller ones
31