Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh...
Transcript of Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh...
![Page 1: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/1.jpg)
Wrangler Predictable and Faster Jobs using
Fewer Resources!
Neeraja J. Yadwadkar, ���Ganesh Ananthanarayanan and Randy Katz
1
![Page 2: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/2.jpg)
Parallel Data Analytics!Job completed
Job queue
Master
Slaves 2
![Page 3: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/3.jpg)
Stragglers!Job completed
Job queue
Master
Slaves 3
![Page 4: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/4.jpg)
Defining Stragglers!
Task Execution Time
Tasks of a job
4
![Page 5: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/5.jpg)
Impact of Stragglers!Presence of Stragglers in real-world production level traces*:
Workload Stragglers
Facebook 2009 (FB2009)
Facebook 2010 (FB2010)
Cloudera’s Customer b (CC_b)
Cloudera’s Customer e (CC_e)
24 %
23 %
28 %
22 %
When replayed using SWIM+ on a 50 node EC2 cluster….
*captured for over 6 months from about 4000 machines in total
Threshold = 1.3 * median
+Chen Y., et al., The Case for Evaluating MapReduce Performance Using Workload Suites, MASCOTS’11 5
![Page 6: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/6.jpg)
Impact of Stragglers!
Dolly, NSDI’13 6
![Page 7: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/7.jpg)
Speculative Execution!
Job queue
Replicating
TS: In progress
Job completed
TS !
TS !
TS !
7
![Page 8: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/8.jpg)
Existing Approaches
8
Wait
Replicate
Wasted Time in detecting stragglers
Wasted Resources LATE
OSDI’08
Speculative Execution
OSDI’04
Mantri OSDI’10 Dolly
NSDI’13
Wrangler
![Page 9: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/9.jpg)
Design Goals!
1. Identify stragglers as early as possible
2. Schedule tasks for improved job finishing times
1. To avoiding creation of stragglers 2. To avoid replication
9
Avoid Wasting
Resources
Avoid Wasting Time in detecting stragglers
![Page 10: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/10.jpg)
Defer potential stragglers!
Job Submitted Map finished Map finished Job finished Job finished
Net Gain
Map1
Map2
Map3
Reduce Deferred
Assignments
Time
10
![Page 11: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/11.jpg)
Model Builder
Predictive Scheduler
Worker
Master
Worker 1
Workers
Our proposal: Wrangler
“Input” Features
Scheduling Decisions
11
![Page 12: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/12.jpg)
Selecting “Input” Features!
MapReduce: Dean, et. al, OSDI’04
LATE: Zaharia, et. al, OSDI’08
Mantri: Ananthanarayanan, et. al, OSDI’10
CPU - cpu_idle - cpu_system - cpu_user - …
Network - bytes_in - bytes_out - …
Memory - mem_buffers - mem_cached - mem_free - mem_shared - mem_total - …
Disk - swap_free - swap_total - disk_free - disk_total - …
Other - Jvm. memHeapUsedM - Jvm. threadsBlocked - Jvm. gcTimeMillis - … - datanode.blocks_written - … - rpc.callQueueLen - rpc.OpenConnections - …
12
![Page 13: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/13.jpg)
Selecting “Input” Features!
Key observation using Subset selection methods
Features of importance vary across nodes and across time!
• Complex task-to-node interactions • Complex task-to-task interactions • Heterogeneous clusters • Heterogeneous task requirements
Why? 13
![Page 14: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/14.jpg)
Model Builder
Predictive Scheduler
Worker
Master
Worker 1
Workers
Our proposal: Wrangler
“Input” Features
Scheduling Decisions
14
![Page 15: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/15.jpg)
Model Builder
Predictive Scheduler
Worker
Master
Worker 1
Workers
Our proposal: Wrangler
Scheduling Decisions
Utilization Counters
15
![Page 16: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/16.jpg)
Model Builder
Predictive Scheduler
Worker
Master
Worker 1
Workers
Our proposal: Wrangler
Scheduling Decisions
Utilization Counters
16
![Page 17: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/17.jpg)
Unknown and dynamic causes behind stragglers
A Challenge in Model Building !
17
Desired: An ability of predicting stragglers without needing domain-specific knowledge
Approach: Classification! Techniques that learn to adapt to changing
causes automatically
![Page 18: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/18.jpg)
Classification for Predicting Stragglers!
18
{Utilization Counters} Straggler Non-Straggler (~100)
Predict:
Build a model:
{Utilization Counters, Straggler/Non-Straggler} Learning Classifier
Classifier
For every node in a cluster,
![Page 19: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/19.jpg)
Classification Technique: SVM
19
Non-Stragglers
Stragglers
feature1
feat
ure 2
Separating Hyper-plane
![Page 20: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/20.jpg)
Classification Accuracy using SVM!
20
0
20
40
60
80
100
120
FB2009 FB2010 CC_b CC_e
Pre
dict
ion
Acc
urac
y (%
) Stragglers (True Positive) Non-Stragglers (True Negative)
Is this accuracy good enough?
![Page 21: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/21.jpg)
Worker
Master
Worker 1
Workers
Our proposal: Wrangler
Utilization Counters
Scheduling Decisions
Model Builder
Predictive Scheduler
21
![Page 22: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/22.jpg)
Predictive Scheduler (Naïve)!
22
Defer scheduling a task on a node that is predicted to create a
straggler
![Page 23: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/23.jpg)
Intuition…. !
Time
Resource Usage
Key: Better load-balancing
1. Improved job completion
2. Reduced resource consumption
23
![Page 24: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/24.jpg)
Does the Naïve approach improve job completion?!
!4.22% !5.53% !5.39%
11.34%
!5.51%
43.59%
58.87% 62.10%
43.13%
22.46%
!10%0%10%20%30%40%50%60%70%80%90%
100%
avg% 95p% 97p% 99p% 99.9p%
Percen
tage%Red
uc;o
n%
Predic;on%without%confidence%measure%Predic;on%with%confidence%measure%(p=0.8)%
Baseline: Speculative Execution
Workload: CC_b
24
![Page 25: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/25.jpg)
Utilization Counters
25
Model Builder
Predictive Scheduler
Worker
Master
Worker 1
Workers
Our proposal: Wrangler
Scheduling Decisions
Confident? Yes
![Page 26: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/26.jpg)
Utilization Counters
26
Model Builder
Predictive Scheduler
Worker
Master
Worker 1
Workers
Our proposal: Wrangler
Scheduling Decisions
Confident? Yes
Only confident predictions influence scheduling decisions
![Page 27: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/27.jpg)
Confidence Measure
Non-Stragglers
Stragglers
feature1
feat
ure 2
Separating Hyper-plane
Corresponds to P: confidence
threshold 27
Low-confidence zone
![Page 28: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/28.jpg)
p: Confidence Threshold!
• Value calculated via cross-validation
• Verified via empirical sensitivity analysis
(refer to paper for details)
28
![Page 29: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/29.jpg)
Evaluation!
• Aim I: Does Wrangler Improve Job Completion Times?
• Aim II: Does Wrangler Reduce Resources Consumed?
29
![Page 30: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/30.jpg)
Aim 1: Does Wrangler Improve Job Completion Times?!
!4.22% !5.53% !5.39%
11.34%
!5.51%
43.59%
58.87% 62.10%
43.13%
22.46%
!10%0%10%20%30%40%50%60%70%80%90%
100%
avg% 95p% 97p% 99p% 99.9p%
Percen
tage%Red
uc;o
n%
Predic;on%without%confidence%measure%Predic;on%with%confidence%measure%(p=0.8)%
Workload: CC_b Baseline: Speculative Execution
Confidence measure is the key! 30
![Page 31: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/31.jpg)
Aim II: Does Wrangler Reduce Resources Consumed?!
55.09
24.77
40.15
8.24
0 10 20 30 40 50 60 70 80 90
100
FB2009 FB2010 CC_b CC_e
Percen
tage Red
uc;o
n in
Total Task-‐Second
s
Baseline: Speculative Execution 31
![Page 32: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/32.jpg)
Load-Balancing with Wrangler!
32
Few highly loaded nodes
Workload: FB2010
![Page 33: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/33.jpg)
Sophisticated schedulers exist….why Wrangler?!
• Difficult to – An9cipate the dynamically changing causes – Deal with over 100 resource u9liza9on counters – Build a generic and unbiased scheduler
Wrangler statistically learns to
predict stragglers without needing domain-specific knowledge
33
![Page 34: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/34.jpg)
Status and Future Directions!
• Status: – Improving accuracy using lesser training data – Improving job completions further
• Future Directions: – Generalization to other problems in cluster
computing environments • Fault Detection • Resource Allocation
34
![Page 35: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"](https://reader037.fdocuments.net/reader037/viewer/2022100309/6073d9dc854bfc010a423608/html5/thumbnails/35.jpg)
Conclusions!• ML techniques enable early prediction of tasks
that are likely to straggle. • Use of confidence measure makes straggler
predictions useful
• Wrangler: – Enables faster jobs – Uses fewer resources
Contact: [email protected] 35