Mesos: A PlaPorm for Fine-‐Grained Resource Sharing in the Data ...

Mesos: A Pla+orm for Fine-‐Grained Resource Sharing in the Data Center

Ion Stoica,

UC Berkeley

Joint work with: A. Ghodsi, B. Hindman, A. Joseph, R. Katz, A. Konwinski, S. Shenker, and M. Zaharia

Challenge   Rapid innovaEon in cloud compuEng

  No single framework opEmal for all applicaEons

  Running each framework on its dedicated cluster   Expensive   Hard to share data

2

Dryad

Pregel

Cassandra Hypertable

Need to run multiple frameworks on same cluster

Computa?on Model   Each framework (e.g., Hadoop, Torque) manages and runs one or more jobs

  A job consists of one or more tasks

  A task (e.g., map, reduce) consists of one or more processes running on same machine

  Today’s, a framework assume it controls the cluster   Implement sophisEcated schedulers, e.g., Quincy, LATE, Fair

sharing, Maui, … 3

Computa?on Environment

4

Cloud computing

HPC Grid

Typical workload

Data intensive Computation intensive

Computation intensive

Task granularity

Small (median < 1min)

Large Large

Hardware Commodity Specialized Commodity Data Distributed

across nodes (e.g., 10TB per node)

Partitioned (usually, comp. & storage different)

Partitioned

Fine grain scheduling Need data

locality

Where We Want to Go

Hadoop

Pregel

MPI Shared cluster

Today: static partitioning Mesos: dynamic sharing

uniprogrammed multiprogrammed

Solu?on

  Mesos is a common resource sharing layer over which diverse frameworks can run

6

Mesos

Node Node Node Node

Hadoop MPI …

Node Node

Hadoop

Node Node

MPI

…

Mesos Goals

  High u?liza?on of resources

  Support diverse frameworks (current & future)

  Scalability to 10,000’s of nodes

  Reliability in face of failures

Resulting design: Small microkernel-like core that pushes scheduling logic to

frameworks

Mesos Provides…

  Fine grained sharing of resource across frameworks   Unit of scheduling: task

  ExecuEon environment for tasks

  IsolaEon

  Focus of this talk: resource management and scheduling

8

Approach: Global Scheduler

  Advantages: can achieve opEmal schedule

  Disadvantages:   Complexity hard to scale and ensure resilience   Hard to anEcipate future’s frameworks’ requirements

  Need to refactor exisEng frameworks 9

Global Scheduler

Organization policies

Resource availability

Job/tasks estimated running times

Fwk/Job requirements Task schedule

Our Approach: Distributed Scheduler

  Advantages:   Simple easier to scale and make resilient   Easy to port exisEng frameworks, support new ones

  Disadvantages:   Distributed scheduling decision not opEmal

10

Master (Global

Scheduler)



Framework scheduler

Task schedule

Fwk schedule

Framework scheduler

Framework scheduler

Resource Offers   Unit of allocaEon: resource offer

  Vector of available resources on a node   E.g., m1: <1 cpu, 1GB>, m2: <4 cpu, 16GB>

  Master sends resource offers to frameworks   Frameworks select which offers to accept and which tasks to run

11

Mesos Architecture

12

Slave 1

Hadoop Executor MPI executor

Slave 2

Hadoop Executor

Slave 3

Mesos Master

AllocaEon Module

Framework Scheduler Hadoop JobTracker

Framework Scheduler MPI Scheduler

Resource

offers Status

Slaves conEnuously send status updates about resources

Pluggable Scheduler to pick framework to send an offer to

Framework Scheduler selects resources and

provides tasks

Framework executors launch tasks and may persist across tasks

Launch Hadoop task 2

task 1

task 2

task 1

Outline

  Global scheduler: Dominant Resource Fairness

  AnalyEc evaluaEon of distributed scheduler efficiency

  Experimental results

13

Single Resource: Fair Sharing

  n users want to share a resource (e.g. CPU)   SoluEon: allocate each 1/n of the shared resource

  Generalized by max-‐min fairness   Handles if a user wants less than its fair share   E.g. user 1 wants no more than 20%

  Generalized by weighted max-‐min fairness   Give weights to users according to importance   User 1 gets weight 1, user 2 weight 2

14

CPU 100%

50%

0%

33%

33%

33%

100%

50%

0%

20%

40%

40%

100%

50%

0%

33%

66%

Why Max-‐Min Fairness?   Weighted Fair Sharing / Propor>onal Shares

  User 1 gets weight 2, user 2 weight 1   Priori>es

  Give user 1 weight 1000, user 2 weight 1   Reverva>ons

  Ensure user 1 gets 10% of a resource   Give user 1 weight 10, sum weights ≤ 100

  Isola>on Policy   Users cannot affect others beyond their fair share

  Widely used:   OS: RR, prop. sharing, logery, Linux’s cfs, …   Networking: wfq, wf2q, sfq, drr, csfq, ...   Datacenters: Hadoop’s fair sched, capacity sched, Qunincy

15

Proper?es of Max-‐Min Fairness

  Share guarantee   Each user gets at least 1/n of the resource   But will get less if her demand is less

  Strategy-‐proof   Users are not beger off by asking for more than they need   Users have no reason to lie

  Max-‐min fairness is the only ”reasonable” mechanism with these two properEes

16

Why is Max-‐Min Fairness not Enough?

  Job scheduling in datacenters is not only about a single resource   Jobs consume CPU, memory, disk, and I/O

  What are task demands today?

17

Heterogeneous Resource Demands

18

Most task need ~ <2 CPU, 2 GB RAM>

Some tasks are memory-‐intensive

Some tasks are CPU-‐intensive

2000-node Hadoop Cluster at Facebook (Oct 2010)!

Problem

Single resource example   1 resource: CPU   User 1 wants <1 CPU> per task

  User 2 wants <3 CPU> per task

Mul;-‐resource example   2 resources: CPUs & mem

  User 1 wants <1 CPU, 4 GB> per task   User 2 wants <3 CPU, 1 GB> per task

  What’s a fair alloca>on? 19

CPU

100%

50%

0%

CPU

100%

50%

0% mem

? ?

50%

50%

  Asset Fairness   Equalize each user’s sum of resource shares

  Cluster with 70 CPUs, 70 GB RAM   U1 needs <2 CPU, 2 GB RAM> per task   U2 needs <1 CPU, 2 GB RAM> per task

  Asset fairness yields   U1: 15 tasks: 30 CPUs, 30 GB (∑=60)   U2: 20 tasks: 20 CPUs, 40 GB (∑=60)

A Natural Policy

CPU

User 1 User 2 100%

50%

0% RAM

43%

57%

43%

28%

Problem: violate share guarantee User 1 has < 50% of both CPUs and RAM

Be_er off in a separate cluster with 50% of the resources

Challenge

  Can we find a fair sharing policy that provides   Share guarantee   Strategy-‐proofness

  Max-‐min fairness for a single resource had these properEes   Can we generalize max-‐min fairness to mulEple resources?

21

Chea?ng the Scheduler

  Users willing to game the system to get more resources

  Real-‐life examples   A cloud provider had quotas on map and reduce slots

Some users found out that the map-‐quota was low

  Users implemented maps in the reduce slots!

  A search company provided dedicated machines to users that could ensure certain level of uElizaEon (e.g. 80%)

  Users used busy-‐loops to inflate u?liza?on

22

Dominant Resource Fairness

  A user’s dominant resource is the resource user has the biggest share of   Example: Total resources: <10 CPU, 4 GB> User 1’s allocaEon: <2 CPU, 1 GB> Dominant resource is memory as 1/4 > 2/10 (1/5)

  A user’s dominant share is the fracEon of the dominant resource she is allocated   User 1’s dominant share is 25% (1/4)

23

Dominant Resource Fairness (2)   Apply max-‐min fairness to dominant shares   Equalize the dominant share of the users

  Example: Total resources: <9 CPU, 18 GB> User 1 demand: <1 CPU, 4 GB> dom res: mem User 2 demand: <3 CPU, 1 GB> dom res: CPU

24

User 1

User 2

100%

50%

0% CPU

(9 total) mem

(18 total)

3 CPUs 12 GB

6 CPUs 2 GB

66%

66%

Online DRF Scheduler

  O(log n) Eme per decision using binary heaps

25

Whenever there are available resources and tasks to run:

Schedule a task to the user with smallest dominant share

Why not Use Pricing?

  Approach   Set prices for each good   Let users buy what they want

  Problem   How do we determine the right prices for different goods?

26

How Would an Economist Solve It?

  Let the market determine the prices

  Compe;;ve Equilibrium from Equal Incomes (CEEI)   Give each user 1/n of every resource

  Let users trade in a perfectly compeEEve market

  Not strategy-‐proof!

27

DRF vs CEEI   User 1: <1 CPU, 4 GB> User 2: <3 CPU, 1 GB>

  DRF more fair, CEEI better utilization

  User 1: <1 CPU, 4 GB> User 2: <3 CPU, 2 GB>   User 2 increased her share of both CPU and memory by lying!

28

CPU mem

user 2

user 1

100%

50%

0%

CPU mem

100%

50%

0%


Compe??ve Equilibrium from Equal Incomes

66%

66%

55%

91%

CPU mem

100%

50%

0%

CPU mem

100%

50%

0%


Compe??ve Equilibrium from Equal Incomes

66%

66%

60%

80%

Gaming U?liza?on-‐Op?mal Schedulers

29

•  Cluster with <100 CPU, 100 GB> •  2 users, each demanding <1 CPU, 2 GB> per task

•  User 1 lies and demands <2 CPU, 2 GB> •  U?liza?on-‐Op?mal scheduler prefers user 1

User 1 User 2

CPU

100%

50%

0% mem

100%

50%

0% CPU mem

50%

50%

95%

Proper?es of Policies

30

Property Asset CEEI DRF Share guarantee ✔ ✔ Strategy-proofness ✔ ✔ Pareto efficiency ✔ ✔ ✔ Envy-freeness ✔ ✔ ✔ Single resource fairness ✔ ✔ ✔ Bottleneck res. fairness ✔ ✔ Population monotonicity ✔ ✔ Resource monotonicity

Assuming non-zero demands, DRF is the only allocation that is strategy proof and provides sharing incentive (Eric Friedman, Cornell)

Outline




31

How well does Distributed Scheduler Work?

  Simplified model for workload

  Evaluate main performance metrics   Job waiEng Eme   CompleEon Eme

  UElizaEon 32

Master (scheduler)



Framework scheduler

Task schedule

Fwk schedule

Framework scheduler

Framework scheduler

Job/tasks estimated running times

Fwk/Job requirements

Workload Characteris?cs

  Elas?c vs. rigid jobs:   Elas?c: can dynamically adjust allocaEon during execuEon

(e.g., Hadoop, Spark)   Rigid: cannot run before they acquire all resources (e.g.,

MPI)

  Picky vs. non-‐picky jobs   Picky: can only run on a subset of resources (e.g., need

GPU, public IP address)

  Homogeneous vs. heterogeneous task duraEons 33

Metrics

  Job ramp-‐up ?me: Eme it takes a new framework to achieve its target allocaEon (e.g., fair share);

  Job comple?on ?me: Eme it takes a job to complete, assuming one job per framework;

  Cluster u?liza?on

34

Simplified System Model

  One job per framework

  AllocaEon granularity: slot

  Each framework has a target alloca?on, determined by allocaEon policy   E.g., fair sharing, reservaEon, priority

35

Ideal System

  Job gets its target allocaEon instantaneously   k = target allocaEon (slots)   T = average task duraEon   m = number of tasks in a job   Job compleEon Eme ≈ (m/k) T

36 job ready/starts job ends

# of slots in system

time

(m/k) T

k

target allocation

Elas?c Framework

  Can scale allocaEon up and down during job execuEon   Longer compleEon Eme: take Eme to ramp-‐up to target allocaEon   High uElizaEon: can use a slot as soon as it becomes available


target allocation

# slots in system

time

Rigid Framework

  Cannot start before achieving target allocaEon   Higher compleEon Eme   Lower uElizaEon: cannot use slots as soon as it acquire

them


target allocation

# slots in system

time

wasted resources

Analysis Result Summary

  T – mean task duraEon   k – target allocaEon of job (in slots)   m – number of job’s tasks   β T – job compleEon Eme in ideal system, where β ≈

m/k

39

Elastic Framework Rigid Framework Const. dist Exp. dist Const. dist Exp. dist

Ramp-up time

T (ln k) T T (ln k) T

Completion time

(1/2 + β) T (1 + β) T (1 + β) T (ln k + β) T

Utilization 1 1 β / (1/2 + β) β /(ln k + β-1)

Within T of ideal completion time

Good performance only for larger jobs, and only if β >> ln k

Modeling Picky Jobs

  k – number of slots with property P requested by job

  n – number of slots in system with property P

  Examples of property P:   A slot running on a node storing a job’s input block

  A slot running on a node with GPU

  Job pickiness, p = k/n 40

Picky Jobs: Wai?ng Time

  Assume only one framework request slots with property P   E.g., all other frameworks have achieved target allocaEon

41

Property: waiting time of a job with pickiness factor p ≈ p-percentile of task duration distribution

  Example: if p = 0.9, waiting time is equal to 90-th percentile of T

Summary

42

Elastic Picky Homogeneous Performance Examples Y N Y Best MapReduce, Spark Y N N Very Good Web framework Y Y Y Very Good GPU-based ML

frameworks Y Y N Good N N Y Good storage services

(e.g., memcached, dist. file systems)

N N N Good MPI N Y Y Good long running services

with special needs (e.g., DNS)

N Y N Worst

Outline




43

Implementa?on Stats

20,000 lines of C++

Master failover using ZooKeeper

Frameworks ported: Hadoop, MPI, Torque

New specialized framework: Spark, for iteraEve jobs (up to 20× faster than Hadoop)

Open source in Apache Incubator

Users

  Twi_er uses Mesos on > 100 nodes to run ~12 producEon services (mostly stream processing)

  Berkeley machine learning researchers are running several algorithms at scale on Spark

  Conviva is using Spark for data analyEcs

  UCSF medical researchers are using Mesos to run Hadoop and eventually non-‐Hadoop apps

45

Dynamic Resource Sharing   100 node cluster

46

Mesos vs Static Partitioning

  Compared performance with staEcally parEEoned cluster where each framework gets 25% of nodes

47

Framework Speedup on Mesos

Facebook Hadoop Mix 1.14×

Large Hadoop Mix 2.10×

Spark 1.26×

Torque / MPI 0.96×

Scalability   50,000 simulated nodes using 200 VMs

  200 frameworks with 30 second tasks   Measured: Eme to launch + get status update

48

0

0.25

0.5

0.75

1

0 10000 20000 30000 40000 50000

Task

Lau

nch

Ove

rhea

d (s

econ

ds)

Number of Nodes

Related Work   HPC schedulers (e.g. Torque, LSF, Sun Grid Engine)

  Coarse-‐grained sharing for rigid jobs (e.g. MPI)

  Virtual machine clouds   Coarse-‐grained sharing similar to HPC

  Condor   Centralized scheduler based on matchmaking

  Next-‐GeneraEon Hadoop   Redesign of Hadoop to have per-‐applicaEon masters

  Also aims to support non-‐MapReduce jobs

  Based on resource request language with locality prefs

49

Conclusion

  Mesos shares clusters efficiently among diverse frameworks thanks to two design elements   Fine-‐grained sharing at the level of tasks

  Resource offers, a scalable mechanism for applicaEon-‐controlled scheduling

  Enable co-‐existence of current frameworks and development of new specialized ones

  In use at Twiger, UC Berkeley, Conviva and UCSF

50

h_p://www.mesosproject.org

Mesos: A PlaPorm for Fine-‐Grained Resource Sharing in the Data ...

Documents

Transcript of Mesos: A PlaPorm for Fine-‐Grained Resource Sharing in the Data ...