Mesos: A PlaPorm for Fine-‐Grained Resource Sharing in the Data ...
Transcript of Mesos: A PlaPorm for Fine-‐Grained Resource Sharing in the Data ...
Mesos: A Pla+orm for Fine-‐Grained Resource Sharing in the Data Center
Ion Stoica,
UC Berkeley
Joint work with: A. Ghodsi, B. Hindman, A. Joseph, R. Katz, A. Konwinski, S. Shenker, and M. Zaharia
Challenge Rapid innovaEon in cloud compuEng
No single framework opEmal for all applicaEons
Running each framework on its dedicated cluster Expensive Hard to share data
2
Dryad
Pregel
Cassandra Hypertable
Need to run multiple frameworks on same cluster
Computa?on Model Each framework (e.g., Hadoop, Torque) manages and runs one or more jobs
A job consists of one or more tasks
A task (e.g., map, reduce) consists of one or more processes running on same machine
Today’s, a framework assume it controls the cluster Implement sophisEcated schedulers, e.g., Quincy, LATE, Fair
sharing, Maui, … 3
Computa?on Environment
4
Cloud computing
HPC Grid
Typical workload
Data intensive Computation intensive
Computation intensive
Task granularity
Small (median < 1min)
Large Large
Hardware Commodity Specialized Commodity Data Distributed
across nodes (e.g., 10TB per node)
Partitioned (usually, comp. & storage different)
Partitioned
Fine grain scheduling Need data
locality
Where We Want to Go
Hadoop
Pregel
MPI Shared cluster
Today: static partitioning Mesos: dynamic sharing
uniprogrammed multiprogrammed
Solu?on
Mesos is a common resource sharing layer over which diverse frameworks can run
6
Mesos
Node Node Node Node
Hadoop MPI …
Node Node
Hadoop
Node Node
MPI
…
Mesos Goals
High u?liza?on of resources
Support diverse frameworks (current & future)
Scalability to 10,000’s of nodes
Reliability in face of failures
Resulting design: Small microkernel-like core that pushes scheduling logic to
frameworks
Mesos Provides…
Fine grained sharing of resource across frameworks Unit of scheduling: task
ExecuEon environment for tasks
IsolaEon
Focus of this talk: resource management and scheduling
8
Approach: Global Scheduler
Advantages: can achieve opEmal schedule
Disadvantages: Complexity hard to scale and ensure resilience Hard to anEcipate future’s frameworks’ requirements
Need to refactor exisEng frameworks 9
Global Scheduler
Organization policies
Resource availability
Job/tasks estimated running times
Fwk/Job requirements Task schedule
Our Approach: Distributed Scheduler
Advantages: Simple easier to scale and make resilient Easy to port exisEng frameworks, support new ones
Disadvantages: Distributed scheduling decision not opEmal
10
Master (Global
Scheduler)
Organization policies
Resource availability
Framework scheduler
Task schedule
Fwk schedule
Framework scheduler
Framework scheduler
Resource Offers Unit of allocaEon: resource offer
Vector of available resources on a node E.g., m1: <1 cpu, 1GB>, m2: <4 cpu, 16GB>
Master sends resource offers to frameworks Frameworks select which offers to accept and which tasks to run
11
Mesos Architecture
12
Slave 1
Hadoop Executor MPI executor
Slave 2
Hadoop Executor
Slave 3
Mesos Master
AllocaEon Module
Framework Scheduler Hadoop JobTracker
Framework Scheduler MPI Scheduler
Resource
offers Status
Slaves conEnuously send status updates about resources
Pluggable Scheduler to pick framework to send an offer to
Framework Scheduler selects resources and
provides tasks
Framework executors launch tasks and may persist across tasks
Launch Hadoop task 2
task 1
task 2
task 1
Outline
Global scheduler: Dominant Resource Fairness
AnalyEc evaluaEon of distributed scheduler efficiency
Experimental results
13
Single Resource: Fair Sharing
n users want to share a resource (e.g. CPU) SoluEon: allocate each 1/n of the shared resource
Generalized by max-‐min fairness Handles if a user wants less than its fair share E.g. user 1 wants no more than 20%
Generalized by weighted max-‐min fairness Give weights to users according to importance User 1 gets weight 1, user 2 weight 2
14
CPU 100%
50%
0%
33%
33%
33%
100%
50%
0%
20%
40%
40%
100%
50%
0%
33%
66%
Why Max-‐Min Fairness? Weighted Fair Sharing / Propor>onal Shares
User 1 gets weight 2, user 2 weight 1 Priori>es
Give user 1 weight 1000, user 2 weight 1 Reverva>ons
Ensure user 1 gets 10% of a resource Give user 1 weight 10, sum weights ≤ 100
Isola>on Policy Users cannot affect others beyond their fair share
Widely used: OS: RR, prop. sharing, logery, Linux’s cfs, … Networking: wfq, wf2q, sfq, drr, csfq, ... Datacenters: Hadoop’s fair sched, capacity sched, Qunincy
15
Proper?es of Max-‐Min Fairness
Share guarantee Each user gets at least 1/n of the resource But will get less if her demand is less
Strategy-‐proof Users are not beger off by asking for more than they need Users have no reason to lie
Max-‐min fairness is the only ”reasonable” mechanism with these two properEes
16
Why is Max-‐Min Fairness not Enough?
Job scheduling in datacenters is not only about a single resource Jobs consume CPU, memory, disk, and I/O
What are task demands today?
17
Heterogeneous Resource Demands
18
Most task need ~ <2 CPU, 2 GB RAM>
Some tasks are memory-‐intensive
Some tasks are CPU-‐intensive
2000-node Hadoop Cluster at Facebook (Oct 2010)!
Problem
Single resource example 1 resource: CPU User 1 wants <1 CPU> per task
User 2 wants <3 CPU> per task
Mul;-‐resource example 2 resources: CPUs & mem
User 1 wants <1 CPU, 4 GB> per task User 2 wants <3 CPU, 1 GB> per task
What’s a fair alloca>on? 19
CPU
100%
50%
0%
CPU
100%
50%
0% mem
? ?
50%
50%
Asset Fairness Equalize each user’s sum of resource shares
Cluster with 70 CPUs, 70 GB RAM U1 needs <2 CPU, 2 GB RAM> per task U2 needs <1 CPU, 2 GB RAM> per task
Asset fairness yields U1: 15 tasks: 30 CPUs, 30 GB (∑=60) U2: 20 tasks: 20 CPUs, 40 GB (∑=60)
A Natural Policy
CPU
User 1 User 2 100%
50%
0% RAM
43%
57%
43%
28%
Problem: violate share guarantee User 1 has < 50% of both CPUs and RAM
Be_er off in a separate cluster with 50% of the resources
Challenge
Can we find a fair sharing policy that provides Share guarantee Strategy-‐proofness
Max-‐min fairness for a single resource had these properEes Can we generalize max-‐min fairness to mulEple resources?
21
Chea?ng the Scheduler
Users willing to game the system to get more resources
Real-‐life examples A cloud provider had quotas on map and reduce slots
Some users found out that the map-‐quota was low
Users implemented maps in the reduce slots!
A search company provided dedicated machines to users that could ensure certain level of uElizaEon (e.g. 80%)
Users used busy-‐loops to inflate u?liza?on
22
Dominant Resource Fairness
A user’s dominant resource is the resource user has the biggest share of Example: Total resources: <10 CPU, 4 GB> User 1’s allocaEon: <2 CPU, 1 GB> Dominant resource is memory as 1/4 > 2/10 (1/5)
A user’s dominant share is the fracEon of the dominant resource she is allocated User 1’s dominant share is 25% (1/4)
23
Dominant Resource Fairness (2) Apply max-‐min fairness to dominant shares Equalize the dominant share of the users
Example: Total resources: <9 CPU, 18 GB> User 1 demand: <1 CPU, 4 GB> dom res: mem User 2 demand: <3 CPU, 1 GB> dom res: CPU
24
User 1
User 2
100%
50%
0% CPU
(9 total) mem
(18 total)
3 CPUs 12 GB
6 CPUs 2 GB
66%
66%
Online DRF Scheduler
O(log n) Eme per decision using binary heaps
25
Whenever there are available resources and tasks to run:
Schedule a task to the user with smallest dominant share
Why not Use Pricing?
Approach Set prices for each good Let users buy what they want
Problem How do we determine the right prices for different goods?
26
How Would an Economist Solve It?
Let the market determine the prices
Compe;;ve Equilibrium from Equal Incomes (CEEI) Give each user 1/n of every resource
Let users trade in a perfectly compeEEve market
Not strategy-‐proof!
27
DRF vs CEEI User 1: <1 CPU, 4 GB> User 2: <3 CPU, 1 GB>
DRF more fair, CEEI better utilization
User 1: <1 CPU, 4 GB> User 2: <3 CPU, 2 GB> User 2 increased her share of both CPU and memory by lying!
28
CPU mem
user 2
user 1
100%
50%
0%
CPU mem
100%
50%
0%
Dominant Resource Fairness
Compe??ve Equilibrium from Equal Incomes
66%
66%
55%
91%
CPU mem
100%
50%
0%
CPU mem
100%
50%
0%
Dominant Resource Fairness
Compe??ve Equilibrium from Equal Incomes
66%
66%
60%
80%
Gaming U?liza?on-‐Op?mal Schedulers
29
• Cluster with <100 CPU, 100 GB> • 2 users, each demanding <1 CPU, 2 GB> per task
• User 1 lies and demands <2 CPU, 2 GB> • U?liza?on-‐Op?mal scheduler prefers user 1
User 1 User 2
CPU
100%
50%
0% mem
100%
50%
0% CPU mem
50%
50%
95%
Proper?es of Policies
30
Property Asset CEEI DRF Share guarantee ✔ ✔ Strategy-proofness ✔ ✔ Pareto efficiency ✔ ✔ ✔ Envy-freeness ✔ ✔ ✔ Single resource fairness ✔ ✔ ✔ Bottleneck res. fairness ✔ ✔ Population monotonicity ✔ ✔ Resource monotonicity
Assuming non-zero demands, DRF is the only allocation that is strategy proof and provides sharing incentive (Eric Friedman, Cornell)
Outline
Global scheduler: Dominant Resource Fairness
AnalyEc evaluaEon of distributed scheduler efficiency
Experimental results
31
How well does Distributed Scheduler Work?
Simplified model for workload
Evaluate main performance metrics Job waiEng Eme CompleEon Eme
UElizaEon 32
Master (scheduler)
Organization policies
Resource availability
Framework scheduler
Task schedule
Fwk schedule
Framework scheduler
Framework scheduler
Job/tasks estimated running times
Fwk/Job requirements
Workload Characteris?cs
Elas?c vs. rigid jobs: Elas?c: can dynamically adjust allocaEon during execuEon
(e.g., Hadoop, Spark) Rigid: cannot run before they acquire all resources (e.g.,
MPI)
Picky vs. non-‐picky jobs Picky: can only run on a subset of resources (e.g., need
GPU, public IP address)
Homogeneous vs. heterogeneous task duraEons 33
Metrics
Job ramp-‐up ?me: Eme it takes a new framework to achieve its target allocaEon (e.g., fair share);
Job comple?on ?me: Eme it takes a job to complete, assuming one job per framework;
Cluster u?liza?on
34
Simplified System Model
One job per framework
AllocaEon granularity: slot
Each framework has a target alloca?on, determined by allocaEon policy E.g., fair sharing, reservaEon, priority
35
Ideal System
Job gets its target allocaEon instantaneously k = target allocaEon (slots) T = average task duraEon m = number of tasks in a job Job compleEon Eme ≈ (m/k) T
36 job ready/starts job ends
# of slots in system
time
(m/k) T
k
target allocation
Elas?c Framework
Can scale allocaEon up and down during job execuEon Longer compleEon Eme: take Eme to ramp-‐up to target allocaEon High uElizaEon: can use a slot as soon as it becomes available
37 job ready/starts job ends
target allocation
# slots in system
time
Rigid Framework
Cannot start before achieving target allocaEon Higher compleEon Eme Lower uElizaEon: cannot use slots as soon as it acquire
them
38 job ready/starts job ends
target allocation
# slots in system
time
wasted resources
Analysis Result Summary
T – mean task duraEon k – target allocaEon of job (in slots) m – number of job’s tasks β T – job compleEon Eme in ideal system, where β ≈
m/k
39
Elastic Framework Rigid Framework Const. dist Exp. dist Const. dist Exp. dist
Ramp-up time
T (ln k) T T (ln k) T
Completion time
(1/2 + β) T (1 + β) T (1 + β) T (ln k + β) T
Utilization 1 1 β / (1/2 + β) β /(ln k + β-1)
Within T of ideal completion time
Good performance only for larger jobs, and only if β >> ln k
Modeling Picky Jobs
k – number of slots with property P requested by job
n – number of slots in system with property P
Examples of property P: A slot running on a node storing a job’s input block
A slot running on a node with GPU
Job pickiness, p = k/n 40
Picky Jobs: Wai?ng Time
Assume only one framework request slots with property P E.g., all other frameworks have achieved target allocaEon
41
Property: waiting time of a job with pickiness factor p ≈ p-percentile of task duration distribution
Example: if p = 0.9, waiting time is equal to 90-th percentile of T
Summary
42
Elastic Picky Homogeneous Performance Examples Y N Y Best MapReduce, Spark Y N N Very Good Web framework Y Y Y Very Good GPU-based ML
frameworks Y Y N Good N N Y Good storage services
(e.g., memcached, dist. file systems)
N N N Good MPI N Y Y Good long running services
with special needs (e.g., DNS)
N Y N Worst
Outline
Global scheduler: Dominant Resource Fairness
AnalyEc evaluaEon of distributed scheduler efficiency
Experimental results
43
Implementa?on Stats
20,000 lines of C++
Master failover using ZooKeeper
Frameworks ported: Hadoop, MPI, Torque
New specialized framework: Spark, for iteraEve jobs (up to 20× faster than Hadoop)
Open source in Apache Incubator
Users
Twi_er uses Mesos on > 100 nodes to run ~12 producEon services (mostly stream processing)
Berkeley machine learning researchers are running several algorithms at scale on Spark
Conviva is using Spark for data analyEcs
UCSF medical researchers are using Mesos to run Hadoop and eventually non-‐Hadoop apps
45
Dynamic Resource Sharing 100 node cluster
46
Mesos vs Static Partitioning
Compared performance with staEcally parEEoned cluster where each framework gets 25% of nodes
47
Framework Speedup on Mesos
Facebook Hadoop Mix 1.14×
Large Hadoop Mix 2.10×
Spark 1.26×
Torque / MPI 0.96×
Scalability 50,000 simulated nodes using 200 VMs
200 frameworks with 30 second tasks Measured: Eme to launch + get status update
48
0
0.25
0.5
0.75
1
0 10000 20000 30000 40000 50000
Task
Lau
nch
Ove
rhea
d (s
econ
ds)
Number of Nodes
Related Work HPC schedulers (e.g. Torque, LSF, Sun Grid Engine)
Coarse-‐grained sharing for rigid jobs (e.g. MPI)
Virtual machine clouds Coarse-‐grained sharing similar to HPC
Condor Centralized scheduler based on matchmaking
Next-‐GeneraEon Hadoop Redesign of Hadoop to have per-‐applicaEon masters
Also aims to support non-‐MapReduce jobs
Based on resource request language with locality prefs
49
Conclusion
Mesos shares clusters efficiently among diverse frameworks thanks to two design elements Fine-‐grained sharing at the level of tasks
Resource offers, a scalable mechanism for applicaEon-‐controlled scheduling
Enable co-‐existence of current frameworks and development of new specialized ones
In use at Twiger, UC Berkeley, Conviva and UCSF
50
h_p://www.mesosproject.org