Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh...

Post on 27-Oct-2020

20 views 1 download

Transcript of Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh...

Wrangler  Predictable and Faster Jobs using

Fewer Resources!

Neeraja J. Yadwadkar, ���Ganesh Ananthanarayanan and Randy Katz

1

Parallel Data Analytics!Job completed

Job queue

Master

Slaves 2

Stragglers!Job completed

Job queue

Master

Slaves 3

Defining Stragglers!

Task Execution Time

Tasks of a job

4

Impact of Stragglers!Presence of Stragglers in real-world production level traces*:

Workload Stragglers

Facebook 2009 (FB2009)

Facebook 2010 (FB2010)

Cloudera’s Customer b (CC_b)

Cloudera’s Customer e (CC_e)

24 %

23 %

28 %

22 %

When replayed using SWIM+ on a 50 node EC2 cluster….

*captured for over 6 months from about 4000 machines in total

Threshold = 1.3 * median

+Chen Y., et al., The Case for Evaluating MapReduce Performance Using Workload Suites, MASCOTS’11 5

Impact of Stragglers!

Dolly, NSDI’13 6

Speculative Execution!

Job queue

Replicating

TS:  In  progress  

Job completed

TS !

TS !

TS !

7

Existing Approaches

8

Wait

Replicate

Wasted Time in detecting stragglers

Wasted Resources LATE

OSDI’08

Speculative Execution

OSDI’04

Mantri OSDI’10 Dolly

NSDI’13

Wrangler

Design Goals!

1.  Identify stragglers as early as possible

2.  Schedule tasks for improved job finishing times

1.  To avoiding creation of stragglers 2.  To avoid replication

9

Avoid Wasting

Resources

Avoid Wasting Time in detecting stragglers

Defer potential stragglers!

Job Submitted Map finished Map finished Job finished Job finished

Net Gain

Map1

Map2

Map3

Reduce Deferred

Assignments

Time

10

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

“Input” Features

Scheduling Decisions

11

Selecting “Input” Features!

MapReduce: Dean, et. al, OSDI’04

LATE: Zaharia, et. al, OSDI’08

Mantri: Ananthanarayanan, et. al, OSDI’10

CPU -  cpu_idle -  cpu_system -  cpu_user -  …

Network -  bytes_in -  bytes_out -  …

Memory -  mem_buffers -  mem_cached -  mem_free -  mem_shared -  mem_total -  …

Disk -  swap_free -  swap_total -  disk_free -  disk_total -  …

Other -  Jvm. memHeapUsedM -  Jvm. threadsBlocked -  Jvm. gcTimeMillis -  … -  datanode.blocks_written -  … -  rpc.callQueueLen -  rpc.OpenConnections -  …

12

Selecting “Input” Features!

Key observation using Subset selection methods

Features of importance vary across nodes and across time!

•  Complex task-to-node interactions •  Complex task-to-task interactions •  Heterogeneous clusters •  Heterogeneous task requirements

Why?  13

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

“Input” Features

Scheduling Decisions

14

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

Scheduling Decisions

Utilization Counters

15

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

Scheduling Decisions

Utilization Counters

16

Unknown and dynamic causes behind stragglers

A Challenge in Model Building !

17

Desired: An ability of predicting stragglers without needing domain-specific knowledge

Approach: Classification! Techniques that learn to adapt to changing

causes automatically

Classification for Predicting Stragglers!

18

{Utilization Counters} Straggler Non-Straggler (~100)

Predict:

Build a model:

{Utilization Counters, Straggler/Non-Straggler} Learning Classifier

Classifier

For every node in a cluster,

Classification Technique: SVM

19

Non-Stragglers

Stragglers

feature1

feat

ure 2

Separating Hyper-plane

Classification Accuracy using SVM!

20

0  

20  

40  

60  

80  

100  

120  

FB2009   FB2010   CC_b   CC_e  

Pre

dict

ion

Acc

urac

y (%

) Stragglers (True Positive) Non-Stragglers (True Negative)

Is this accuracy good enough?

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

Utilization Counters

Scheduling Decisions

Model Builder

Predictive Scheduler

21

Predictive Scheduler (Naïve)!

22

Defer scheduling a task on a node that is predicted to create a

straggler

Intuition…. !

Time

Resource Usage

Key: Better load-balancing

1. Improved job completion

2. Reduced resource consumption

23

Does the Naïve approach improve job completion?!

!4.22% !5.53% !5.39%

11.34%

!5.51%

43.59%

58.87% 62.10%

43.13%

22.46%

!10%0%10%20%30%40%50%60%70%80%90%

100%

avg% 95p% 97p% 99p% 99.9p%

Percen

tage%Red

uc;o

n%

Predic;on%without%confidence%measure%Predic;on%with%confidence%measure%(p=0.8)%

Baseline: Speculative Execution

Workload: CC_b

24

Utilization Counters

25

Model Builder

Predictive Scheduler

Worker  

Master

Worker 1

Workers

Our proposal: Wrangler

Scheduling Decisions

Confident? Yes  

Utilization Counters

26

Model Builder

Predictive Scheduler

Worker  

Master

Worker 1

Workers

Our proposal: Wrangler

Scheduling Decisions

Confident? Yes  

Only confident predictions influence scheduling decisions

Confidence Measure

Non-Stragglers

Stragglers

feature1

feat

ure 2

Separating Hyper-plane

Corresponds to P: confidence

threshold 27

Low-confidence zone

p: Confidence Threshold!

•  Value calculated via cross-validation

•  Verified via empirical sensitivity analysis

(refer to paper for details)

28

Evaluation!

•  Aim I: Does Wrangler Improve Job Completion Times?

•  Aim II: Does Wrangler Reduce Resources Consumed?

29

Aim 1: Does Wrangler Improve Job Completion Times?!

!4.22% !5.53% !5.39%

11.34%

!5.51%

43.59%

58.87% 62.10%

43.13%

22.46%

!10%0%10%20%30%40%50%60%70%80%90%

100%

avg% 95p% 97p% 99p% 99.9p%

Percen

tage%Red

uc;o

n%

Predic;on%without%confidence%measure%Predic;on%with%confidence%measure%(p=0.8)%

Workload: CC_b Baseline: Speculative Execution

Confidence measure is the key! 30

Aim II: Does Wrangler Reduce Resources Consumed?!

55.09

24.77

40.15

8.24

0 10 20 30 40 50 60 70 80 90

100

FB2009 FB2010 CC_b CC_e

Percen

tage  Red

uc;o

n  in  

Total  Task-­‐Second

s  

Baseline: Speculative Execution 31

Load-Balancing with Wrangler!

32

Few highly loaded nodes

Workload: FB2010

Sophisticated schedulers exist….why Wrangler?!

•  Difficult  to    – An9cipate  the  dynamically  changing  causes  – Deal  with  over  100  resource  u9liza9on  counters  – Build  a  generic  and  unbiased  scheduler  

 Wrangler statistically learns to

predict stragglers without needing domain-specific knowledge

33

Status and Future Directions!

•  Status: –  Improving accuracy using lesser training data –  Improving job completions further

•  Future Directions: – Generalization to other problems in cluster

computing environments •  Fault Detection •  Resource Allocation

34

Conclusions!•  ML techniques enable early prediction of tasks

that are likely to straggle. •  Use of confidence measure makes straggler

predictions useful

•  Wrangler: – Enables faster jobs – Uses fewer resources

Contact: neerajay@eecs.berkeley.edu 35