Mesos: A Platform for Fine-Grained Resource Sharing in the Data...

32
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center Ion Stoica, UC Berkeley Joint work with: D. Borthakur, K. Elmeleegy, A. Ghodsi, B. Hindman, A. Joseph, R. Katz, A. Konwinski, J. Sen Sarma, S. Shenker, M. Zaharia

Transcript of Mesos: A Platform for Fine-Grained Resource Sharing in the Data...

Page 1: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Mesos: A Platform for Fine-Grained Resource Sharing in the Data

Center

Ion Stoica, UC Berkeley

Joint work with: D. Borthakur, K. Elmeleegy, A. Ghodsi, B. Hindman, A. Joseph, R. Katz, A. Konwinski, J. Sen Sarma,

S. Shenker, M. Zaharia

Page 2: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Challenge   Rapid innovation in cloud computing

  No single framework optimal for all applications   Running each framework on its dedicated cluster

  Expensive   Hard to share data

2

Dryad

Pregel

Cassandra Hypertable

Need to run multiple frameworks on same cluster

Page 3: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Computation Model   Each framework (e.g., Hadoop) manages and runs

one or more jobs

  A job consists of one or more tasks

  A task (e.g., map, reduce) consists of one or more processes running on same machine

  Today’s, a framework assume it controls the cluster   Implement sophisticated schedulers, e.g., Quincy, LATE, Fair

sharing, Maui, … 3

Page 4: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Our Solution: Mesos   “Operating system” for very large clusters (tens of

thousands nodes)

  Provides   Fine grained sharing of resource across frameworks   Execution environment for tasks   Isolation

  Focus of this talk: resource management and scheduling

4

Page 5: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Approach: Global Scheduler

  Advantages: can achieve optimal schedule   Disadvantages:

  Complexity hard to scale and ensure resilience   Hard to anticipate future’s frameworks’ requirements   Need to refactor existing frameworks

5

Global Scheduler

Organization policies

Resource availability

Job/tasks estimated running times

Fwk/Job requirements Task schedule

Page 6: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Our Approach: Distributed Scheduler

  Advantages:   Simple easier to scale and make resilient   Easy to port existing frameworks, support new ones

  Disadvantages:   Distributed scheduling decision not optimal

6

Master (Global

Scheduler)

Organization policies

Resource availability

Framework scheduler

Task schedule

Fwk schedule

Framework scheduler

Framework scheduler

Page 7: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Resource Offers   Resource offer: set of available resources on a node

  E.g., m1: <1 cpu, 1GB>, m2: <4 cpu, 16GB>

  Master sends resource offers to frameworks   Frameworks select which offers to accept and which

tasks to run

7

Page 8: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Mesos Architecture

8

Slave  1  

Hadoop  Executor  MPI  executor  

Slave  2  

Hadoop  Executor  

Slave  3  

Mesos  Master  

Alloca9on  Module  

Framework  Scheduler  Hadoop  JobTracker  

Framework  Scheduler  MPI  Scheduler  

Resource  

offers  Status  

Slaves  con9nuously  send  status  updates  about  resources  

Pluggable  Scheduler  to  pick  framework  to    send  an  offer  to  

Framework  Scheduler  selects  resources  and  

provides  tasks  

Framework  executors  launch  tasks  and  may  persist  across  tasks  

Launch  Hadoop  task  2  

task  1  

task  2  

task  1  

Page 9: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Outline

  Dominant Resource Fairness

  Delay scheduling – dealing with constraints

9

Page 10: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Problem

  Different types of resources

  Frameworks (users) prioritize resources differently   E.g., I/O bounded vs. CPU bounded jobs

10

How to allocate resources fairly?

Page 11: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Model   Demand vectors to express user’s requests

  D1=<2, 3, 1>, U1 tasks need 2 units of R1, 3 units of R2 and one unit of R3

  Users get resources in multiples of their demand vector   Resource offer = k x demand vector, k > 0

  Note: demand vector not needed in practice   Use actual allocation for scheduling decisions

11

Page 12: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Problem Example

  30 CPU and 30 GB RAM   U1 needs <1 CPU, 1 GB RAM> per task   U2 needs <1 CPU, 3 GB RAM> per task

12

How many tasks should each task run?

Page 13: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Strawman: Asset Fairness

  Equalize sum of allocated resources

  30 CPU and 30 GB RAM   U1 needs <1 CPU, 1 GB RAM> per task   U2 needs <1 CPU, 3 GB RAM> per task

  Allocation   U1: 12 tasks: 12 CPUs, 12 GB (∑=24)   U2: 6 tasks: 6 CPUs, 18 GB (∑=24)

13

CPU  

User  1   User  2  100%  

50%  

0%  RAM  

User 1 has less than 50% of each resource !

Page 14: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Share Guarantee Property

  No incentive to share:   User 1 is better off just taking half of cluster!

  Share guaranteed property:   Every user should get 1/n of at least one resource   “You shouldn’t be worse off than if you ran your own

cluster with 1/n of the resources”

14

Page 15: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Dominant Resource Fairness

  Dominant resource: resource the user has the largest share of, e.g.,   Total capacity: <10, 4>   Demand vector: <2, 1>

  Max-min fairness across dominant resources   If infinite demand, equalize dominant resource shares

15

dominant resource, R2

Satisfies share guarantee property

Page 16: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

DRF Example

  30 CPU and 30 GB RAM   U1 needs <2 CPU, 1 GB RAM> per task   U2 needs <1 CPU, 4 GB RAM> per task

  Allocation   U1: 10 tasks: 20 CPUs (67%), 10 GB   U2: 5 tasks: 5 CPUs, 20 GB (67%)

16

CPU  

User  1   User  2  100%  

50%  

0%  RAM  

Page 17: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Other Desirable Properties

  Strategy proofness   Pareto efficiency   Envy-freeness   Single-resource fairness   Bottleneck fairness   Population monotonicity   Resource monotonicity

17

Page 18: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Related Work

  Resource allocation a very popular topic in micro-economic theory

  Competitive Equilibrium from Equal Incomes (CEEI) [Varian74]   Give each user 1/n of every resource   Let users trade in a perfectly competitive market until

Pareto efficiency   Economists’ main method of fair allocation [Young94]

18

Page 19: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

DRF vs. CEEI   U1=<16,1> U2=<1,2>

19

CPU RAM CPU RAM

CPU RAM

user 2 user 1 100%

50%

0% CPU RAM

100%

50%

0%

DRF CEEI

DRF more fair, CEEI better utilization

Page 20: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Strategy Proofness   A framework cannot increase its share by lying

  DRF is strategy proof   Allocation is determined by dominant share   Increasing non-dominant share can only hurt

  CEEI is not strategy proof

  In general, schedulers that maximize utilization can be gamed   A user with demand matching available resources is picked

20

Page 21: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Strategy Proof: DRF vs CEEI (1/2)   U1=<16,1> U2=<1,2>

CPU RAM

user 2 user 1 100%

50%

0% CPU RAM

100%

50%

0%

DRF CEEI

21

Page 22: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Strategy Proof: DRF vs CEEI (2/2)   U1=<16,4> U2=<1,2>

  With CEEI a user can increase its dominant share by lying

CPU RAM

user 2 user 1 100%

50%

0% CPU RAM

100%

50%

0%

DRF CEEI

22

Page 23: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Property Summary

  Several other schedulers evaluated… 23

Property Asset CEEI DRF Share guarantee ✔ ✔ Strategy proofness ✔ Pareto efficiency ✔ ✔ ✔ Envy-free ✔ ✔ ✔ Single resource fairness ✔ ✔ ✔ Bottleneck res. fairness ✔ ✔ Population monotonicity ✔ ✔ Resource monotonicity

Page 24: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Dynamic Resource Sharing with DRF

24

0%  

10%  

20%  

30%  

40%  

50%  

60%  

70%  

80%  

90%  

100%  

1   31   61   91   121   151   181   211   241   271   301   331  

Share  of  Cluster  

Time  (s)  

MPI  

Hadoop  

Spark  

Page 25: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Outline

  Dominant Resource Fairness

  Delay scheduling – dealing with constraints

25

Page 26: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Fair Allocation Hurts Locality (1/2)

26

Fwk 2 Master Fwk 1

Scheduling  order  

Slave Slave Slave Slave Slave Slave 4 2

1 1 2 2 3 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1

Task 2 Task 5 Task 3 Task 1 Task 7 Task 4

File 1: File 2:

Page 27: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Fair Allocation Hurts Locality (2/2)

27

Fwk 1 Master Fwk 2

Scheduling  order  

Slave Slave Slave Slave Slave Slave 4 2

1 2 2 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1

Task 2 Task 5 Task 3 Task 1

File 1: File 2:

Task 1 Task 7 Task 2 Task 4 Task 3

1 3

Page 28: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Delay Scheduling

  Relax scheduling: frameworks wait for a limited time if they cannot launch local tasks

  Result: Waiting a few seconds is enough to get nearly 100% locality

28

Page 29: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Delay Scheduling Example

29

Job 1 Master Job 2

Scheduling  order  

Slave Slave Slave Slave Slave Slave 4 2

1 1 2 2 3 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1

Task 2 Task 3

File 1:

File 2:

Task 8 Task 7 Task 2 Task 4 Task 6 Task 5 Task 1 Task 1 Task 3

Wait!

Idea: Wait a short time to get data-local scheduling opportunities

Page 30: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

How Long to Wait?   Model: Poisson arrival of data-local offers

  Expected response time gain if waiting time w

  Typical values   t : (avg. task length)/((tasks per node)x(file replication))   d >= (local avg. task length)

30

G(w) = (1 – e-w/t)×(d – t)

Expected time to get a local offer

Delay from running non-locally

Optimal wait time is infinity

Page 31: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Locality and Execution Time   Running 16 Hadoop instances

31

Static partitioning

Mesos, no delay sched.

Mesos, delay sched., 1sec

Mesos, delay sched., 5sec

Static partitioning

Mesos, no delay sched.

Mesos, delay sched., 1sec

Mesos, delay sched., 5sec

Page 32: Mesos: A Platform for Fine-Grained Resource Sharing in the Data …sands.sce.ntu.edu.sg/visitortalks/stoica2010.pdf · 2015-03-26 · Our Solution: Mesos “Operating system” for

Summary   Challenge: allow frameworks to share a cluster

  Mesos: “OS” for sharing clusters   Support existing frameworks   Fault-tolerant & scale to 10,000s of machines   Delay scheduling part of Hadoop distribution   Ongoing deployments at Twitter and Facebook

  Open source: github.com/mesos

  Research challenges   Fairness for constrained & multiple resource types   Distributed vs. centralized schedulers 32