Mesos: A Platform for Fine-Grained Resource Sharing in the...

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Benjamin Hindman, Andy Konwinski, Matei Zaharia,Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

University of California, Berkeley

AbstractWe present Mesos, a platform for sharing commod-ity clusters between multiple diverse cluster computingframeworks, such as Hadoop and MPI. Sharing improvescluster utilization and avoids per-framework data repli-cation. Mesos shares resources in a fine-grained man-ner, allowing frameworks to achieve data locality bytaking turns reading data stored on each machine. Tosupport the sophisticated schedulers of today’s frame-works, Mesos introduces a distributed two-level schedul-ing mechanism called resource offers. Mesos decideshow many resources to offer each framework, whileframeworks decide which resources to accept and whichcomputations to run on them. Our experimental resultsshow that Mesos can achieve near-optimal locality whensharing the cluster among diverse frameworks, can scaleup to 50,000 nodes, and is resilient to node failures.

1 IntroductionClusters of commodity servers have become a majorcomputing platform, powering both large Internet ser-vices and a growing number of data-intensive scientificapplications. Driven by these applications, researchersand practitioners have been developing a diverse array ofcluster computing frameworks to simplify programmingthe cluster. Prominent examples include MapReduce[24], Dryad [32], MapReduce Online [23] (which sup-ports streaming jobs), Pregel [36] (a specialized frame-work for graph computations), and others [35, 17, 30].

It seems clear that new cluster computing frameworks1

will continue to emerge, and that no framework will beoptimal for all applications. Therefore, organizationswill want to run multiple frameworks in the same clus-ter, picking the best one for each application. Sharinga cluster between frameworks improves utilization andallows applications to share access to large datasets thatmay be too costly to replicate.

1By framework we mean a software system that manages and exe-cutes one or more jobs on a cluster.

The solutions of choice to share a cluster today are ei-ther to statically partition the cluster and run one frame-work per partition, or allocate a set of VMs to eachframework. Unfortunately, these solutions achieve nei-ther high utilization nor efficient data sharing. The mainproblem is the mismatch between the allocation granular-ities of these solutions and of existing frameworks. Manyframeworks, such as Hadoop and Dryad, employ a fine-grained resource sharing model, where nodes are subdi-vided into “slots” and jobs are composed of short tasksthat are matched to slots [33, 45]. The short duration oftasks and the ability to run multiple tasks per node allowjobs to achieve high data locality, as each job will quicklyget a chance to run on nodes storing its input data. Shorttasks also allow frameworks to achieve high utilization,as jobs can rapidly scale when new nodes become avail-able. Unfortunately, because these frameworks are de-veloped independently, there is no way to perform fine-grained sharing across frameworks, making it difficult toshare clusters and data efficiently between them.

In this paper, we propose Mesos, a thin resource shar-ing layer that enables fine-grained sharing across diversecluster computing frameworks, by giving frameworks acommon interface for accessing cluster resources.

The main design question that Mesos must address ishow to match resources with tasks. This is challengingfor several reasons. First, a solution will need to sup-port a wide array of both current and future frameworks,each of which will have different scheduling needs basedon its programming model, communication pattern, taskdependencies, and data placement. Second, the solutionmust be highly scalable, as modern clusters contain tensof thousands of nodes and have hundreds of jobs withmillions of tasks active at a time. Third, the schedulingsystem must be fault-tolerant and highly available, as allthe applications in the cluster depend on it.

One approach would be for Mesos to implement a cen-tralized scheduler that takes as input framework require-ments, resource availability, and organizational policies,

1

and computes a global schedule for all tasks. While thisapproach can optimize scheduling across frameworks, itfaces several challenges. The first is complexity. Thescheduler would need to provide a sufficiently expres-sive API to capture all frameworks’ requirements, andto solve an on-line optimization problem for millionsof tasks. Even if such a scheduler were feasible, thiscomplexity would have a negative impact on its scala-bility and resilience. Second, as new frameworks andnew scheduling policies for current frameworks are con-stantly being developed [43, 45, 47, 34], it is not clearwhether we are even at the point to have a full specifi-cation of framework requirements. Third, many existingframeworks implement their own sophisticated schedul-ing [33, 45], and moving this functionality to a globalscheduler would require expensive refactoring.

Instead, Mesos takes a different approach: delegatingcontrol over scheduling to the frameworks. This is ac-complished through a new abstraction, called a resourceoffer, which encapsulates a bundle of resources that aframework can allocate on a cluster node to run tasks.Mesos decides how many resources to offer each frame-work, based on an organizational policy such as fair shar-ing, while frameworks decide which resources to acceptand which tasks to run on them. While this decentral-ized scheduling model may not always lead to globallyoptimal scheduling, we have found that it performs sur-prisingly well in practice, allowing frameworks to meetgoals such as data locality nearly perfectly. In addition,resource offers are simple and efficient to implement, al-lowing Mesos to be highly scalable and robust to failures.

Mesos’s flexible fine-grained sharing model also hasother advantages. First, even organizations that only useone framework can use Mesos to run multiple instancesof that framework in the same cluster, or multiple ver-sions of the framework. Our contacts at Yahoo! andFacebook indicate that this would be a compelling way toisolate production and experimental Hadoop workloadsand to roll out new versions of Hadoop [?, 12].

Second, by providing a means of sharing resourcesacross frameworks, Mesos allows framework develop-ers to build specialized frameworks targeted at particu-lar problem domains rather than one-size-fits-all abstrac-tions. Frameworks can therefore evolve faster and pro-vide better support for each problem domain.

We have implemented Mesos in 10,000 lines of C++.The system scales to 50,000 (emulated) nodes and usesZooKeeper [4] for fault tolerance. To evaluate Mesos, wehave ported three cluster computing systems to run overit: Hadoop, MPI, and the Torque batch scheduler. To val-idate our hypothesis that specialized frameworks providevalue over general ones, we have also built a new frame-work on top of Mesos called Spark, optimized for itera-tive jobs where a dataset is reused in many parallel oper-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1 10 100 1000 10000 100000

CD

F

Duration (s)

MapReduce JobsMap & Reduce Tasks

Figure 1: CDF of job and task durations in Facebook’s Hadoopdata warehouse (data from [45]).

ations, and shown that Spark can outperform Hadoop by10x in iterative machine learning workloads.

This paper is organized as follows. Section 2 detailsthe data center environment that Mesos is designed for.Section 3 presents the architecture of Mesos. Section 4analyzes our distributed scheduling model and character-izes the environments it works well in. We present ourimplementation of Mesos in Section 5 and evaluate it inSection 6. Section 7 surveys related work. Finally, weconclude in Section 8.

2 Target EnvironmentAs an example of a workload we aim to support, con-sider the Hadoop data warehouse at Facebook [5, 7].Facebook loads logs from its web services into a 1200-node Hadoop cluster, where they are used for applica-tions such as business intelligence, spam detection, andad optimization. In addition to “production” jobs that runperiodically, the cluster is used for many experimentaljobs, ranging from multi-hour machine learning compu-tations to 1-2 minute ad-hoc queries submitted interac-tively through an SQL interface to Hadoop called Hive[3]. Most jobs are short (the median being 84s long), andthe jobs are composed of fine-grained map and reducetasks (the median task being 23s), as shown in Figure 1.

To meet the performance requirements of these jobs,Facebook uses a fair scheduler for Hadoop that takesadvantage of the fine-grained nature of the workload tomake scheduling decisions at the level of map and re-duce tasks and to optimize data locality [45]. Unfortu-nately, this means that the cluster can only run Hadoopjobs. If a user wishes to write a new ad targeting al-gorithm in MPI instead of MapReduce, perhaps becauseMPI is more efficient for this job’s communication pat-tern, then the user must set up a separate MPI cluster andimport terabytes of data into it.2 Mesos aims to enablefine-grained sharing between multiple cluster computingframeworks, while giving these frameworks enough con-trol to achieve placement goals such as data locality.

2This problem is not hypothetical; our contacts at Yahoo! and Face-book report that users want to run MPI and MapReduce Online [13, 12].

2

Mesos slave Mesos slave Mesos slave MPI

executor

task

Hadoop executor

task

MPI executor

task task

Hadoop executor

task task

Mesos master

Hadoop scheduler

MPI scheduler

Standby master

Standby master

ZooKeeper quorum

Figure 2: Mesos architecture diagram, showing two runningframeworks (Hadoop and MPI).

3 ArchitectureWe begin our description of Mesos by presenting our de-sign philosophy. We then describe the components ofMesos, our resource allocation mechanisms, and howMesos achieves isolation, scalability, and fault tolerance.

3.1 Design PhilosophyThe Mesos goal is to provide a scalable and resilient corefor enabling various frameworks to efficiently share acluster. Because cluster frameworks are both highly di-verse and rapidly evolving, our overriding design philos-ophy has been to define a minimal interface that enablesefficient resource sharing across frameworks, while dele-gating the control on task scheduling and resource man-agement to frameworks. Pushing control to the frame-works has two benefits. First, it allows frameworks toimplement diverse approaches to various problems in thecluster (e.g., achieving data locality, dealing with faults),and to evolve these solutions independently. Second, itkeeps Mesos simple and minimizes the rate of changerequired of the system, which makes it easier to makeMesos scalable and robust.

Although Mesos provides a low-level interface, we ex-pect higher-level libraries implementing common func-tionality (such as fault tolerance) to be built on top ofit. These libraries would be analogous to library OSes inthe exokernel [26]. Putting this functionality in librariesrather than in Mesos allows Mesos to remain small andflexible, and lets the libraries evolve independently.

3.2 OverviewFigure 2 shows the main components of Mesos. Mesosconsists of a master process that manages slave daemonsrunning on each cluster node, and frameworks that runtasks on these slaves.

The master implements fine-grained sharing acrossframeworks using resource offers. Each resource of-fer contains a list of free resources on multiple slaves.The master decides how many resources to offer to each

FW Scheduler Job 1 Job 2 Framework 1

Allocation module

Mesos master

<s1, 4cpu, 4gb, … > 1 <fw1, task1, 2cpu, 1gb, … > <fw1, task2, 1cpu, 2gb, … > 4

Slave 1

Task Executor

Task

FW Scheduler Job 1 Job 2 Framework 2

Task Executor

Task

Slave 2

<s1, 4cpu, 4gb, … > <task1, s1, 2cpu, 1gb, … > <task2, s1, 1cpu, 2gb, … > 3 2

Figure 3: Resource offer example.

framework according to a given organizational policy,such as fair sharing, or strict priority. To support a di-verse set of policies, the master employs a modular ar-chitecture that makes it easy to add new allocation mod-ules via a pluggin mechanism. To make the master fault-tolerant we use ZooKeeper [4] to implement the failovermechanism (see Section 3.6).

A framework running on top of Mesos consists of twocomponents: a scheduler that registers with the masterto be offered resources, and an executor process that islaunched on slave nodes to run the framework’s tasks.While the master determines how many resources are of-fered to each framework, the frameworks’ schedulers se-lect which of the offered resources to use. When a frame-works accepts offered resources, it passes to Mesos a de-scription of the tasks it wants to run on them. In turn,Mesos launches the tasks on the corresponding slaves.

Figure 3 shows an example of how a framework getsscheduled to run a task. In step (1), slave 1 reports to themaster that it has 4 CPUs and 4 GB of memory free. Themaster then invokes the allocation policy module, whichtells it that framework 1 should be offered all availableresources. In step (2) the master sends a resource of-fer describing what is available on slave 1 to framework1. In step (3), the framework’s scheduler replies to themaster with information about two tasks to run on theslave, using �2 CPUs, 1 GB RAM� for the first task, and�1 CPUs, 2 GB RAM� for the second task. Finally, instep (4), the master sends the tasks to the slave, which al-locates appropriate resources to the framework’s execu-tor, which in turn launches the two tasks (depicted withdotted-line borders in the figure). Because 1 CPU and 1GB of RAM are still unallocated, the allocation modulemay now offer them to framework 2. In addition, thisresource offer process repeats when tasks finish and newresources become free.

While the thin interface provided by Mesos allowsit to scale and allows the frameworks to evolve inde-pendently, one question remains: how can the con-straints of a framework be satisfied without Mesos know-

3

ing about these constraints? In particular, how can aframework achieve data locality without Mesos know-ing which nodes store the data required by the frame-work? Mesos answers these questions by simply givingframeworks the ability to reject offers. A framework willreject the offers that do not satisfy its constraints and ac-cept the ones that do. In particular, we have found thata simple policy called delay scheduling [45], in whichframeworks wait for a limited time to acquire nodes stor-ing the input data, yields nearly optimal data locality. Wereport these results in Section 6.3.

In the remainder of this section, we describe howMesos performs two key functionalities: resource offerallocation (performed by allocation modules in the mas-ter), and resource isolation (performed by slaves). Wethen describe the main elements that make the resourceoffer mechanism robust and efficient.

3.3 Resource Allocation

Mesos delegates allocation decisions to a pluggable allo-cation module, so that organizations can tailor allocationto their needs. In normal operation, Mesos takes advan-tage of the fact that most tasks are short, and only real-locate resources when tasks finish. This usually happensfrequently enough so that new frameworks acquire theirshare quickly. For example, if a framework’s share is10% of the cluster, it needs to wait on average 10% ofthe mean task length to receive its share. If resources arenot freed quickly enough, the allocation module also hasthe ability to revoke (kill) tasks.

So far, we have implemented two simple allocationmodules: one that performs fair sharing [28] and one thatimplements strict priorities. Similar policies are used bycurrent schedulers for Hadoop and Dryad [33, 45].

3.3.1 Revocation

As described earlier, in an environment with fine-grainedtasks, Mesos can reallocate resources quickly by simplywaiting for tasks to finish. However, if a cluster becomesfilled by long tasks, e.g., due to a buggy job or a greedyframework, Mesos can also revoke (kill) tasks. Beforekilling a task, Mesos gives its framework a grace periodto clean it up. Mesos asks the respective executor to killthe task, but kills the entire executor and all its tasks ifit does not respond to the request. We leave it up to theallocation module to implement the policy for revokingtasks, but describe two related mechanisms here.

First, while killing a task has a low impact on manyframeworks (e.g., MapReduce or stateless web servers),it is harmful for frameworks with interdependent tasks(e.g., MPI). We allow these frameworks to avoid be-ing killed by letting allocation modules expose a guar-anteed allocation to each framework – a quantity ofresources that the framework may hold without losing

tasks. Frameworks read their guaranteed allocationsthrough an API call. Allocation modules are responsiblefor ensuring that the guaranteed allocations they providecan all be met concurrently. For now, we have kept thesemantics of guaranteed allocations simple: if a frame-work is below its guaranteed allocation, none of its tasksshould be killed, and if it is above, any of its tasks may bekilled. However, if this model is found to be too simple,it is also possible to let frameworks specify priorities fortheir tasks, so that the allocation module can try to killonly low-priority tasks.

Second, to decide when to trigger revocation, alloca-tion modules must know which frameworks would usemore resources if they were offered them. Frameworksindicate their interest in offers through an API call.

3.4 Isolation

Mesos provides performance isolation between frame-work executors running on the same slave by leveragingexisting OS isolation mechanisms. Since these mecha-nisms are platform-dependent, we support multiple iso-lation mechanisms through pluggable isolation modules.

In our current implementation, we use operating sys-tem container technologies, specifically Linux containers[10] and Solaris projects [16], to achieve isolation. Thesetechnologies can limit the CPU, physical memory, vir-tual memory, network bandwidth, and (in new Linux ker-nels) IO bandwidth usage of a process tree. In addition,they support dynamic reconfiguration of a container’s re-source limits, which is necessary for Mesos to be able toadd and remove resources from an executor as it startsand finishes tasks. In the future, it would also be attrac-tive to use virtual machines as containers. However, wehave not yet done this because current VM technologiesadd non-negligible overhead in data-intensive workloadsand have limited support for dynamic reconfiguration.

3.5 Making Resource Offers Scalable and Robust

Because task scheduling in Mesos is a distributed processin which the master and framework schedulers communi-cate, it needs to be efficient and robust to failures. Mesosincludes three mechanisms to help with this goal.

First, because some frameworks will always reject cer-tain resources, Mesos lets them short-circuit the rejectionprocess and avoid communication by providing filters tothe master. We support two types of filters: “only offernodes from list L” and “only offer nodes with at least Rresources free”. A resource that fails a filter is treated ex-actly like a rejected resource. By default, any resourcesrejected during an offer have a temporary 5 second filterplaced on them, to minimize the programming burden ondevelopers who do not wish to manually set filters.

Second, because a framework may take time to re-spond to an offer, Mesos counts resources offered to a

4

framework towards its share of the cluster for the purposeof allocation. This is a strong incentive for frameworksto respond to offers quickly and to filter out resourcesthat they cannot use, so that they can get offers for moresuitable resources faster.

Third, if a framework has not responded to an offerfor a sufficiently long time, Mesos rescinds the offer andre-offers the resources to other frameworks.

We also note that even without the use of filters, Mesoscan make tens of thousands of resource offers per second,because the scheduling algorithm it must perform (fairsharing) is highly efficient.

3.6 Fault Tolerance

Since all the frameworks depends on the master, it is crit-ical to make the master fault-tolerant. To achieve this weuse two techniques. First, we have designed the master tobe soft state, i.e., the master can reconstruct completelyits internal state from the periodic messages it gets fromthe slaves, and from the framework schedulers. Second,we have implemented a hot-standby design, where themaster is shadowed by several backups that are readyto take over when the master fails. Upon master fail-ure, we use ZooKeeper [4] to select a new master fromthe existing backups, and direct all slaves and frameworkschedulers to this master. Subsequently, the new masterwill reconstruct the internal state from the messages itreceives from slaves and framework schedulers.

Aside from handling master failures, Mesos reportstask, slave and executor failures to frameworks’ sched-ulers. Frameworks can then react to failures using thepolicies of their choice.

Finally, to deal with scheduler failures, Mesos allows aframework to register multiple schedulers such that whenone fails, another one is notified by the Mesos master totake over. Frameworks must use their own mechanismsto share state between their schedulers.

4 Mesos BehaviorIn this section, we study Mesos’s behavior for differentworkloads. Our main goal here is not to develop a de-tailed and exact model of the system, but to provide acoarse understanding of its behavior.

In short, we find that Mesos performs very well whenframeworks can scale up and down elastically, tasksdurations are homogeneous, and frameworks prefer allnodes equally (Section 4.2). When different frameworksprefer different nodes, we show that Mesos can emulate acentralized scheduler that uses fair sharing across frame-works (Section 4.2.3). In addition, we show that Mesoscan handle heterogeneous task durations without im-pacting the performance of frameworks with short tasks(§4.3). We also discuss how frameworks are incentivizedto improve their performance under Mesos, and argue

that these incentives also improve overall cluster utiliza-tion (§4.4). We conclude this section with some limita-tions of Mesos’s distributed scheduling model (§4.5).

4.1 Definitions, Metrics and Assumptions

In our discussion, we consider three metrics:• Framework ramp-up time: time it takes a new

framework to achieve its allocation (e.g., fair share);

• Job completion time: time it takes a job to complete,assuming one job per framework;

• System utilization: total cluster utilization.We characterize workloads along four attributes:• Scale up: Frameworks can elastically increase their

allocation to take advantage of free resources.

• Scale down: Frameworks can relinquish resourceswithout significantly impacting their performance.

• Minimum allocation: Frameworks require a certainminimum number of slots before they can start us-ing their slots.

• Task distribution: The distribution of the task dura-tions. We consider both homogeneous and hetero-geneous distributions.

We also differentiate between two types of resources:mandatory and preferred. We say that a resource ismandatory if a framework must acquire it in order to run.For example, we consider a machine with a public IP ad-dress a mandatory resource if there are frameworks thatneed to be externally accessible. In contrast, we say thata resource is preferred if a framework will perform “bet-ter” using it, but can also run using other resources. Forexample, a framework may prefer using a node that lo-cally stores its data, but it can remotely access the datafrom other nodes if it must.

We assume the amount of mandatory resources re-quested by a framework never exceeds its guaranteedshare. This ensures that frameworks will not get dead-locked waiting for the mandatory resources to becomeavailable. For simplicity, we also assume that all tasksrun on identical slices of machines, called slots, and thateach framework runs a single job.

We consider two types of frameworks: elastic andrigid. An elastic framework can scale its resources upand down, i.e., it can start using slots as soon as it ac-quires them, and can release slots as soon its task finish.In contrast, a rigid framework can start running its jobsonly after it has allocated all its slots, i.e., a rigid frame-work is a framework where the minimum allocation isequal to its full allocation.

4.2 Homogeneous Tasks

In the rest of this section we consider a task duration dis-tribution with mean Ts. For simplicity, we present results

5

for two distributions: constant task sizes and exponen-tially distributed task sizes. We consider a cluster with nslots and a framework, f , that is entitled to k slots. Weassume the framework runs a job which requires βkTs

computation time. Thus, assuming the framework has kslots, it takes the job βTs time to finish. When comput-ing the completion time of a job we assume that the lasttasks of the job running on the framework’s k slots finishat the same time. Thus, we relax the assumptions aboutthe duration distribution for the tasks of framework f .This relaxation does not impact any of the other metrics,i.e., ramp-up time and utilization.

Table 1 summarizes the job completion times and theutilization for the two types of frameworks and for thetwo types of task length distributions. In short, we findthat the elastic jobs perform very well. The expectedcompletion time of an elastic job is within Ts (i.e., themean duration of a task) of the completion time of thejob when it gets all its slots instantaneously. Further-more, elastic jobs fully utilize their allocated slots, sincethey can use every slot as soon as they get it. Rigidframeworks with constant task durations achieve simi-lar completion times, but provide worse utilizations, asthese jobs cannot start before they get their full alloca-tions. Finally, under exponential distribution, the com-pletion time of a rigid job can be delayed by as much asTs ln k where k is the framework’s allocation. Next, wediscuss each case in detail.

4.2.1 Elastic FrameworksAn elastic framework can opportunistically use any slotoffered by Mesos, and can relinquish slots without sig-nificantly impacting the performance of its jobs. We as-sume there are exactly k slots in the system that frame-work f prefers, and that f waits for these slots to becomeavailable to reach its allocation.

Framework ramp-up time If task durations are con-stant, it will take framework f at most Ts time to acquirek slots. This is simply because during a Ts interval, ev-ery slot will become available, which will enable Mesosto offer the framework all its k preferred slots.

If the duration distribution is exponential, the expectedramp-up time is Ts ln k. The framework needs to wait onaverage Ts/k to acquire the first slot from the set of its kpreferred slots, Ts/(k−1) to acquire the second slot fromthe remaining k− 1 slots in the set, and Ts to acquire thelast slot. Thus, the ramp-up time of f is

Ts × (1 + 1/2..+ 1/k) � Ts ln k. (1)

Job completion time Recall that βTs is the comple-tion time of the job in an ideal scenario in which theframeworks acquires all its k slots instantaneously. Iftask durations are constant, the completion time is onaverage (1/2 + β)Ts. To show this, assume the start-

ing and the ending times of the tasks are uniformly dis-tributed, i.e., during the ramp-up phase, f acquires oneslot every Ts/k on average. Thus, the framework’s jobcan use roughly Tsk/2 computation time during the firstTs interval. Once the framework acquires its k slots, itwill take the job (βkTs − Tsk/2)/k = (β − 1/2)Ts

time to complete. As a result the job completion timeis Ts + (β − 1/2)Ts = (1/2 + β)Ts (see Table 1).

In the case of the exponential distribution, the ex-pected completion time of the job is Ts(1 + β) (see [?]for derivation).

System utilization As long as frameworks can scaleup and down and there is enough demand in the system,the cluster will be fully utilized.

4.2.2 Rigid Frameworks

Some frameworks may not be able to start running jobsunless they reach a minimum allocation. One example isMPI, where all tasks must start a computation in sync. Inthis section we consider the worst case where the min-imum allocation constraint equals the framework’s fullallocation, i.e., k slots.

Job completion time While in this case the ramp-uptime remains unchanged, the job completion time willchange because the framework cannot use any slot beforereaching its full allocation. If the task duration distribu-tion is constant the completion time is simply Ts(1+β),as the framework doesn’t use any slot during the firstTs interval, i.e., until it acquires all k slots. If the dis-tribution is exponential, the completion time becomesTs(ln k + β) as it takes the framework Ts ln k to rampup (see Eq. 1).

System utilization Wasting allocated slots has also anegative impact on the utilization. If the tasks duration isconstant, and the framework acquires a slot every Ts/kon average, the framework will waste roughly Tsk/2computation time during the ramp-up phase. Once it ac-quire all slots, the framework will use βkTs to completeits job. The utilization achieved by the framework in thiscase is then βkTs/(kTs/2 + βkTs) � β/(1/2 + β). Ifthe distribution is exponential, the utilization is β/(ln k−1 + β) (see [?] for full derivation).

4.2.3 Placement Preferences

So far, we have assumed that frameworks have no slotpreferences. In practice, different frameworks prefer dif-ferent nodes and their preferences may change over time.In this section, we consider the case where frameworkshave different preferred slots.

The natural question is how well Mesos will work inthis case when compared to a centralized scheduler thathas full information about framework preferences. Weconsider two cases: (a) there exists a system configura-

6

Elastic Framework Rigid FrameworkConstant dist. Exponential dist. Constant dist. Exponential dist.

Completion time (1/2 + β)Ts (1 + β)Ts (1 + β)Ts (ln k + β)Ts

Utilization 1 1 β/(1/2 + β) β/(ln k − 1 + β)

Table 1: The job completion time and utilization for both elastic and rigid frameworks, and for both constant task durations andtask durations that follow an exponential distribution. The framework starts with zero slots. k represents the number of slots theframework is entitled under the given allocation, and βTs represents the time it takes a job to complete assuming the frameworkgets all k slots at once.

tion in which each framework gets all its preferred slotsand achieves its full allocation, and (b) there is no suchconfiguration, i.e., the demand for preferred slots exceedsthe supply.

In the first case, it is easy to see that, irrespective of theinitial configuration, the system will converge to the statewhere each framework allocates its preferred slots afterat most one Ts interval. This is simple because duringa Ts interval all slots become available, and as a resulteach framework will be offered its preferred slots.

In the second case, there is no configuration in whichall frameworks can satisfy their preferences. The keyquestion in this case is how should one allocate the pre-ferred slots across the frameworks demanding them. Inparticular, assume there are x slots preferred by m frame-works, where framework i requests ri such slots, and�m

i=1 ri > x. While many allocation policies are pos-sible, here we consider the weighted fair allocation pol-icy where the weight associated with a framework is itsintended allocation, si. In other words, assuming thateach framework has enough demand, framework i willget x×si/(

�mi=1 si).

The challenge with Mesos is that the scheduler doesnot know the preferences of each framework. Fortu-nately, it turns out that there is an easy way to achievethe fair allocation of the preferred slots described above:simply offer slots to frameworks proportionally to theirintended allocations. In particular, when a slot becomesavailable, Mesos can offer that slot to framework i withprobability si/(

�ni=1 si), where n is the total number

of frameworks in the system. Note that this scheme issimilar to lottery scheduling [42]. Furthermore, note thatsince each framework i receives roughly si slots duringa time interval Ts, the analysis of the ramp-up and com-pletion times in Section 4.2 still holds.

4.3 Heterogeneous Tasks

So far we have assumed that frameworks have homoge-neous task duration distributions. In this section, we dis-cuss heterogeneous tasks, in particular, tasks that are ei-ther short and long, where the mean duration of the longtasks is significantly longer than the mean of the shorttasks. Such heterogeneous workload can hurt frame-works with short jobs. Indeed, in the worst case, allnodes required by a short job might be filled with longtasks, so the job may need to wait a long time (relative to

its running time) to acquire resources on these nodes.The master can alleviate this by implementing an allo-

cation policy that limits the number of slots on each nodethat can be used by long tasks, e.g., no more than 50% ofthe slots on each node can run long tasks. This ensuresthere are enough short tasks on each node whose slots be-come available with high frequency, giving frameworksbetter opportunities to quickly acquire a slot on one oftheir preferred nodes. Note that the master does not needto know whether a task is short or long. By simply us-ing different timeouts to revoke short and long slots, themaster incentivizes the frameworks to run long tasks onlong slots only. Otherwise, if a framework runs a longtask on a short slot, its performance may suffer, as theslot will be most likely revoked before the task finishes.

4.4 Framework IncentivesMesos implements a decentralized scheduling approach,where each framework decides which offers to accept orreject. As with any decentralized system, it is impor-tant to understand the incentives of various entities in thesystem. In this section, we discuss the incentives of aframework to improve the response times of its jobs.

Short tasks: A framework is incentivized to use shorttasks for two reasons. First, it will be able to allocateany slots; in contrast frameworks with long tasks are re-stricted to a subset of slots. Second, using small tasksminimizes the wasted work if the framework loses a task,either due to revocation or simply due to failures.

No minimum allocation: The ability of a frameworkto use resources as soon as it allocates them–instead ofwaiting to reach a given minimum allocation–would al-low the framework to start (and complete) its jobs earlier.

Scale down: The ability to scale down allows a frame-work to grab opportunistically the available resources, asit can later release them with little negative impact.

Do not accept unknown resources: Frameworks areincentivized not to accept resources that they cannot usebecause most allocation policies will account for all theresources that a framework owns when deciding whichframework to offer resources to next.

We note that these incentives are all well aligned withour goal of improving utilization. When frameworks useshort tasks, Mesos can reallocate resources quickly be-tween them, reducing the need for wasted work due to

7

revocation. If frameworks have no minimum allocationand can scale up and down, they will opportunisticallyutilize all the resources they can obtain. Finally, if frame-works do not accept resources that they do not under-stand, they will leave them for frameworks that do.

4.5 Limitations of Distributed Scheduling

Although we have shown that distributed schedulingworks well in a range of workloads relevant to currentcluster environments, like any decentralized approach, itcan perform worse than a centralized scheduler. We haveidentified three limitations of the distributed model:

Fragmentation: When tasks have heterogeneous re-source demands, a distributed collection of frameworksmay not be able to optimize bin packing as well as a cen-tralized scheduler.

There is another possible bad outcome if allocationmodules reallocate resources in a naive manner: whena cluster is filled by tasks with small resource require-ments, a framework f with large resource requirementsmay starve, because whenever a small task finishes, fcannot accept the resources freed up by it, but otherframeworks can. To accommodate frameworks withlarge per-task resource requirements, allocation modulescan support a minimum offer size on each slave, and ab-stain from offering resources on that slave until this min-imum amount is free.

Note that the wasted space due to both suboptimal binpacking and fragmentation is bounded by the ratio be-tween the largest task size and the node size. Therefore,clusters running “larger” nodes (e.g., multicore nodes)and “smaller” tasks within those nodes (e.g., having acap on task resources) will be able to achieve high uti-lization even with a distributed scheduling model.

Interdependent framework constraints: It’s possibleto construct scenarios where, because of esoteric inter-dependencies between frameworks’ performance, only asingle global allocation of the cluster resources performswell. We argue such scenarios are rare in practice. Inthe model discussed in this section, where frameworksonly have preferences over placement, we showed thatallocations approximate those of optimal schedulers.

Framework complexity: Using resources offers maymake framework scheduling more complex. We argue,however, that this difficulty is not in fact onerous. First,whether using Mesos or a centralized scheduler, frame-works need to know their preferences; in a centralizedscheduler, the framework would need to express them tothe scheduler, whereas in Mesos, it needs to use them todecide which offers to accept. Second, many schedulingpolicies for existing frameworks are online algorithms,because frameworks cannot predict task times and mustbe able to handle node failures and slow nodes. These

policies are easily implemented using the resource offermechanism.

5 ImplementationWe have implemented Mesos in about 10,000 lines ofC++. The system runs on Linux, Solaris and Mac OS X.

Mesos applications can be programmed in C, C++,Java, Ruby and Python. We use SWIG [15] to generateinterface bindings for the latter three languages.

To reduce the complexity of our implementation, weuse a C++ library called libprocess [8] that providesan actor-based programming model using efficient asyn-chronous I/O mechanisms (epoll, kqueue, etc). Wealso leverage Apache ZooKeeper [4] to perform leaderelection, as described in Section 3.6. Finally, our currentframeworks use HDFS [2] to share data.

Our implementation can use Linux containers [10] orSolaris projects [16] to isolate applications. We currentlyisolate CPU cores and memory.3

We have implemented five frameworks on top ofMesos. First, we have ported three existing cluster sys-tems to Mesos: Hadoop [2], the Torque resource sched-uler [40], and the MPICH2 implementation of MPI [21].None of these ports required changing these frameworks’APIs, so all of them can run unmodified user programs.In addition, we built a specialized framework calledSpark, that we discuss in Section 5.3. Finally, to show theapplicability of running front-end frameworks on Mesos,we have developed a framework that manages an elasticweb server farm which we discuss in Section ??.

5.1 Hadoop PortPorting Hadoop to run on Mesos required relatively fewmodifications, because Hadoop concepts such as mapand reduce tasks correspond cleanly to Mesos abstrac-tions. In addition, the Hadoop “master”, known as theJobTracker, and Hadoop “slaves”, known as TaskTrack-ers, naturally fit into the Mesos model as a frameworkscheduler and executor.

We first modified the JobTracker to schedule MapRe-duce tasks using resource offers. Normally, the Job-Tracker schedules tasks in response to heartbeat mes-sages sent by TaskTrackers every few seconds, reportingthe number of free slots in which “map” and “reduce”tasks can be run. The JobTracker then assigns (prefer-ably data local) map and reduce tasks to TaskTrackers.To port Hadoop to Mesos, we reuse the heartbeat mecha-nism as described, but dynamically change the number ofpossible slots on the TaskTrackers. When the JobTrackerreceives a resource offer, it decides if it wants to run anymap or reduce tasks on the slaves included in the offer,

3Support for network and IO isolation was recently added to theLinux kernel [9] and we plan to extend our implementation to isolatethese resources too.

8

using delay scheduling [45]. If it does, it creates a newMesos task for the slave that signals the TaskTracker (i.e.the Mesos executor) to increase its number of total slots.The next time the TaskTracker sends a heartbeat, the Job-Tracker can assign the runnable map or reduce task tothe new empty slot. When a TaskTracker finishes run-ning a map or reduce task, the TaskTracker decrementsit’s slot count (so the JobTracker doesn’t keep sendingit tasks) and reports the Mesos task as finished to allowother frameworks to use those resources.

We also needed to change how map output data isserved to reduce tasks. Hadoop normally writes mapoutput files to the local filesystem, then serves these toreduce tasks using an HTTP server included in the Task-Tracker. However, the TaskTracker within Mesos runs asan executor, which may be terminated if it is not runningtasks, which would make map output files unavailableto reduce tasks. We solved this problem by providinga shared file server on each node in the cluster to servelocal files. Such a service is useful beyond Hadoop, toother frameworks that write data locally on each node.

In total, we added 1,100 lines of new code to Hadoop.

5.2 Torque and MPI Ports

We have ported the Torque cluster resource manager torun as a framework on Mesos. The framework consistsof a Mesos framework scheduler and framework execu-tor, written in 360 lines of Python code, that launch andmanage different components of Torque. In addition, wemodified 3 lines of Torque source code to allow it to elas-tically scale up and down on Mesos depending on thejobs in its queue.

After registering with the Mesos master, the frame-work scheduler configures and launches a Torque serverand then periodically monitors the server’s job queue.While the queue is empty, the framework scheduler re-leases all tasks (down to an optional minimum, which weset to 0) and refuses all resource offers it receives fromMesos. Once a job gets added to Torque’s queue (usingthe standard qsub command), the framework schedulerbegins accepting new resource offers. As long as thereare jobs in Torque’s queue, the framework scheduler ac-cepts offers as necessary to satisfy the constraints of asmany jobs in the queue as possible. On each node whereoffers are accepted Mesos launches the framework ex-ecutor, which in turn starts a Torque backend daemon andregisters it with the Torque server. When enough Torquebackend daemons have registered, the torque server willlaunch the next job in its queue.

Because jobs that run on Torque may not necessarilybe resilient to failures, Torque avoids having its tasks re-voked by never accepting resources beyond its guaran-teed allocation.

In addition to the Torque framework, we also created

. . .

w

f(x,w) w

f(x,w)

x

x

a) Dryad b) Spark

w

f(x,w) x

Figure 4: Data flow of a logistic regression job in Dryadvs. Spark. Solid lines show data flow within the framework.Dashed lines show reads from a distributed file system. Sparkreuses processes across iterations, only loading data once.

a Mesos MPI “wrapper” framework, written in 200 linesof Python code, for running MPI jobs directly on Mesos.

5.3 Spark FrameworkTo show the value of simple but specialized frameworks,we built Spark, a new framework for iterative jobs thatwas motivated from discussions with machine learningresearchers at our institution.

One iterative algorithm used frequently in machinelearning is logistic regression [11]. An implementationof logistic regression in Hadoop must run each iterationas a separate MapReduce job, because each iteration de-pends on values computed in the previous round. In thiscase, every iteration must re-read the input file from diskinto memory. In Dryad, the whole job can be expressedas a data flow DAG as shown in Figure 4a, but the datamust still must be reloaded from disk into memory ateach iteration. Reusing the data in memory between iter-ations in Dryad would require cyclic data flow.

Spark’s execution is shown in Figure 4b. Spark usesthe long-lived nature of Mesos executors to cache a sliceof the data set in memory at each executor, and then runmultiple iterations on this cached data. This caching isachieved in a fault-tolerant manner: if a node is lost,Spark remembers how to recompute its slice of the data.

Spark leverages Scala to provide a language-integratedsyntax similar to DryadLINQ [44]: users invoke paral-lel operations by applying a function on a special “dis-tributed dataset” object, and the body of the function iscaptured as a closure to run as a set of tasks in Mesos.Spark then schedules these tasks to run on executors thatalready have the appropriate data cached, using delayscheduling. By building on-top-of Mesos, Spark’s im-plementation only required 1300 lines of code.

Due to lack of space, we have limited our discussion ofSpark in this paper and refer the reader to [46] for details.

6 EvaluationWe evaluated Mesos through a series of experiments onthe Amazon Elastic Compute Cloud (EC2). We begin

9

with a macrobenchmark that evaluates how the systemshares resources between four workloads, and go on topresent a series of smaller experiments designed to eval-uate overhead, decentralized scheduling, our specializedframework (Spark), scalability, and failure recovery.

6.1 Macrobenchmark

To evaluate the primary goal of Mesos, which is enablingdiverse frameworks to efficiently share a cluster, we ran amacrobenchmark consisting of a mix of four workloads:• A Hadoop instance running a mix of small and large

jobs based on the workload at Facebook.

• A Hadoop instance running a set of large batch jobs.

• Spark running a series of machine learning jobs.

• Torque running a series of MPI jobs.We compared a scenario where the workloads ran as

four frameworks on a 96-node Mesos cluster using fairsharing to a scenario where they were each given a staticpartition of the cluster (24 nodes), and measured job re-sponse times and resource utilization in both cases. Weused EC2 nodes with 4 CPU cores and 15 GB of RAM.

We begin by describing the four workloads in moredetail, and then present our results.

6.1.1 Macrobenchmark Workloads

Facebook Hadoop Mix Our Hadoop job mix wasbased on the distribution of job sizes and inter-arrivaltimes at Facebook, reported in [45]. The workload con-sists of 100 jobs submitted at fixed times over a 25-minute period, with a mean inter-arrival time of 14s.Most of the jobs are small (1-12 tasks), but there are alsolarge jobs of up to 400 tasks.4 The jobs themselves werefrom the Hive benchmark [?], which contains four typesof queries: text search, a simple selection, an aggrega-tion, and a join that gets translated into multiple MapRe-duce steps. We grouped the jobs into eight bins of jobtype and size (listed in Table 2) so that we could com-pare performance in each bin. We also set the frameworkscheduler to perform fair sharing between its jobs, as thispolicy is used at Facebook.

Large Hadoop Mix To emulate batch workloads thatneed to run continuously, such as web crawling, we hada second instance of Hadoop run a series of IO-intensive2400-task text search jobs. A script launched ten of thesejobs, submitting each one after the previous one finished.

Spark We ran five instances of an iterative machinelearning job on Spark. These were launched by a scriptthat waited 2 minutes after each job finished to submitthe next. The job we used was alternating least squares, a

4We scaled down the largest jobs in [45] to have the workload fit aquarter of our cluster size.

Bin Job Type Map Tasks Reduce Tasks # Jobs Run1 selection 1 NA 382 text search 2 NA 183 aggregation 10 2 144 selection 50 NA 125 aggregation 100 10 66 selection 200 NA 67 text search 400 NA 48 join 400 30 2

Table 2: Job types for each bin in our Facebook Hadoop mix.

Figure 5: Framework shares over the duration of the mac-robenchmark. We observe that (a) by pooling resources amongthe frameworks, Mesos lets each workload scale up to fill gapsin the demand of others and (b) fine-grained sharing allowsframeworks to scale up and down rapidly (in tens of seconds).

collaborative filtering algorithm. This job is fairly CPU-intensive but also benefits from caching its input data oneach node, and needs to broadcast updated parameters toall nodes running its tasks on each iteration.

Torque / MPI Our Torque framework ran eight in-stances of the tachyon raytracing job [?] that is part ofthe SPEC MPI2007 benchmark. Six of the jobs ran smallproblem sizes and two ran larg ones. Both types used 24parallel tasks. We submitted these jobs at fixed times toboth clusters. The tachyon job is CPU-intensive.

6.1.2 Macrobenchmark ResultsA successful result for Mesos would show two things:that Mesos achieves higher utilization than static parti-tioning, and that jobs finish at least as fast in the sharedcluster as they do in dedicated ones, and possibly fasterdue to gaps in the demand of other frameworks. Our re-sults show both effects, as detailed below.

We show the fraction of CPU cores allocated to eachframework by Mesos over time in Figure 5. We see thatMesos enables each framework to scale up during peri-ods when other frameworks have low demands, and thuskeeps cluster nodes busier. For example, at time 350,when both Spark and the Facebook Hadoop frameworkhave no running jobs and Torque is using 1/8 of the clus-ter, the large-job Hadoop framework scales up to 7/8 of

10

0 0.2 0.4 0.6 0.8

1

0 200 400 600 800 1000 1200 1400 1600

Shar

e of

Clu

ster

Time (s)

(a) Facebook Hadoop Mix

Dedicated ClusterMesos

0 0.2 0.4 0.6 0.8

1

0 500 1000 1500 2000 2500 3000

Shar

e of

Clu

ster

Time (s)

(b) Large Hadoop Mix


0 0.2 0.4 0.6 0.8

1

0 200 400 600 800 1000 1200 1400 1600 1800

Shar

e of

Clu

ster

Time (s)

(c) Spark


0 0.2 0.4 0.6 0.8

1

0 200 400 600 800 1000 1200 1400 1600

Shar

e of

Clu

ster

Time (s)

(d) Torque / MPI


Figure 6: Comparison of cluster shares (fraction of CPUs) over time for each of the four frameworks, in the Mesos and static parti-tioning macrobenchmark scenarios. On Mesos, frameworks can scale up when their demand is high and that of other frameworksis low, and thus finish jobs faster. Note that the time axes for the plots are different (e.g., the large Hadoop mix takes 3200s whenrunning alone).

0 20 40 60 80

100

0 200 400 600 800 1000 1200 1400 1600CPU

Util

izat

ion

(%)

Time (s)

Mesos Static

0 10 20 30 40 50

0 200 400 600 800 1000 1200 1400 1600Mem

ory

Util

izat

ion

(%)

Time (s)

Mesos Static

Figure 7: In the macrobenchmark, average CPU and memoryutilization over time across all nodes in the Mesos scenario vs.all nodes used in the four static partitioned scenarios.

the cluster. In addition, we see that resources are reallo-cated rapidly (e.g. when a Facebook Hadoop job startsaround time 360) due to the fine-grained nature of tasks.Finally, higher allocation of nodes also translates intohigher CPU and memory utilization as show in Figure7. The shared Mesos cluster has on average 10% higherCPU utilization and 18% higher memory utilization thanthe statically partitioned cluster.

A second question is how much better jobs performunder Mesos than on dedicated clusters. We present thisdata in two ways. First, Figure 6 compares the resourceallocation over time of each framework in the sharedand dedicated clusters. Shaded areas show the alloca-tion in the dedicated cluster, while solid lines show theshare on Mesos. We see that the fine-grained frame-works (Hadoop and Spark) take advantage of Mesos toscale up beyond 1/4 of the cluster when global demand

Framework Sum of Exec Times on Dedicated Cluster (s)

Sum of Exec Times on Mesos (s)

Speedup

Facebook Hadoop Mix

7235 6319 1.14

Large Hadoop Mix

3143 1494 2.10

Spark 1684 1338 1.26

Torque / MPI 3210 3352 0.96

Table 3: Aggregate performance of each framework in the mac-robenchmark (sum of running times of all the jobs in the frame-work). The speedup column shows the relative gain on Mesos.

allows this, and consequently finish bursts of submit-ted jobs faster in Mesos. At the same time, Torqueachieves roughly similar allocations and job durationsunder Mesos (with some differences explained later).

Second, Tables 3 and 4 show a breakdown of job per-formance for each framework. In Table 3, we comparethe aggregate performance of each framework, definedas the sum of job running times, in the static partitioningand Mesos scenarios. We see the Hadoop and Spark jobsas a whole are finishing faster on Mesos, while Torque isslightly slower. The framework that gains the most is thelarge-job Hadoop mix, which almost always has tasks torun and fills in the gaps in demand of the other frame-works; this framework performs 2x better on Mesos.

Table 4 breaks down the results further by job type.We observe two notable trends. First, in the FacebookHadoop mix, the smaller jobs perform worse on Mesos.This is due to an interaction between the fair sharingperformed by the Hadoop framework (among its jobs)and the fair sharing performed by Mesos (among frame-works): During periods of time when Hadoop has morethan 1/4 of the cluster, if any jobs are submitted to theother frameworks, there is a delay before Hadoop gets a

11

Framework Job Type Time on Dedicated

Cluster (s) Avg. Speedup

on Mesos Facebook Hadoop

Mix selection (1) 24 0.84

text search (2) 31 0.90 aggregation (3) 82 0.94

selection (4) 65 1.40 aggregation (5) 192 1.26

selection (6) 136 1.71 text search (7) 137 2.14

join (8) 662 1.35 Large Hadoop Mix text search 314 2.21

Spark ALS 337 1.36 Torque / MPI small tachyon 261 0.91

large tachyon 822 0.88

Table 4: Performance of each job type in the macrobenchmark.Bins for the Facebook Hadoop mix are in parentheses.

new resource offer (because any freed up resources goto the framework farthest below its share), so any smalljob submitted during this time is delayed for a long timerelative to its length. In contrast, when running alone,Hadoop can assign resources to the new job as soon asany of its tasks finishes. This problem with hierarchicalfair sharing is also seen in networks [?], and could bemitigated either by running the small jobs on a separateframework or changing the allocation policy in Mesos(e.g. to use lottery scheduling instead of always offeringresources to the framework farthest below its share).

Lastly, Torque is the only framework that performedworse, on average, on Mesos. The large tachyon jobstook on average 2 minutes longer, while the small onestook 20s longer. Some of this delay is due to Torquehaving to wait to launch 24 tasks on Mesos before start-ing each job, but the average time this takes is 12s. Webelieve that the rest of the delay may be due to strag-glers (slow nodes). In our standalone Torque run, wenoticed two jobs each took about 60s longer to run thanthe others (Figure 6d). We discovered that both of thesejobs were using a node that performed slower on single-node benchmarks than the others (in fact, Linux reporteda 40% lower bogomips value on this node). Becausetachyon hands out equal amounts of work to each node,it runs as slowly as the slowest node. Unfortunately,when we ran the shared cluster scenario, we did not col-lect the data necessary to validate this hypothesis.

6.2 Overhead

To measure the overhead Mesos imposes when a singleframework uses the cluster, we ran two benchmarks us-ing MPI and Hadoop on an EC2 cluster with 50 nodes,each with 2 CPU cores and 6.5 GB RAM. We used theHigh-Performance LINPACK [19] benchmark for MPIand a WordCount job for Hadoop, and ran each job threetimes. The MPI job took on average 50.9s without Mesosand 51.8s with Mesos, while the Hadoop job took 160swithout Mesos and 166s with Mesos. In both cases, theoverhead of using Mesos was less than 4%.

0

120

240

360

480

600

0%

20%

40%

60%

80%

100%

Static partitioning

Mesos, no delay sched.

Mesos, 1s delay sched.

Mesos, 5s delay sched.

Job

Ru

nn

ing

TIm

e (s

)

Lo

cal M

ap T

asks

(%

)

Data Locality Job Running Times

Figure 8: Data locality and average job durations for 16Hadoop instances using static partitioning, Mesos, or Mesoswith delay scheduling.

6.3 Data Locality through Fine-Grained Sharingand Resource Offers

In this experiment, we evaluated how Mesos’ resourceoffer mechanism enables frameworks to control theirtasks’ placement, and in particular, data locality. Weran 16 instances of Hadoop using 93 EC2 nodes, eachwith 4 CPU cores and 15 GB RAM. Each node ran amap-only scan job that searched a 100 GB file spreadthroughout the cluster on a shared HDFS file system andoutputted 1% of the records. We tested four scenarios:giving each Hadoop instance its own 5-6 node static par-tition of the cluster (to emulate organizations that usecoarse-grained cluster sharing systems), and running allinstances on Mesos using either no delay scheduling, 1sdelay scheduling or 5s delay scheduling.

Figure 8 shows averaged measurements from the 16Hadoop instances across three runs of each scenario. Us-ing static partitioning yields very low data locality (18%)because the Hadoop instances are forced to fetch datafrom nodes outside their partition. In contrast, runningthe Hadoop instances on Mesos improves data locality,even without delay scheduling, because each Hadoop in-stance has tasks on more nodes of the cluster (there are4 tasks per node), and can therefore access more blockslocally. Adding a 1-second delay brings locality above90%, and a 5-second delay achieves 95% locality, whichis competitive with running one Hadoop instance aloneon the whole cluster. As expected, job performance im-proves with data locality: jobs run 1.7x faster in the 5sdelay scenario than with static partitioning.

6.4 Spark Framework

We evaluated the benefit of running iterative jobs usingthe specialized Spark framework we developed on topof Mesos (Section 5.3) over the general-purpose Hadoopframework. We used a logistic regression job imple-mented in Hadoop by machine learning researchers inour lab, and wrote a second version of the job usingSpark. We ran each version separately on 20 EC2 nodes,each with 4 CPU cores and 15 GB RAM. Each exper-

12

0

1000

2000

3000

4000

0 10 20 30

Ru

nn

ing

Tim

e (s

)

Number of Iterations

Hadoop Spark

Figure 9: Hadoop and Spark logistic regression running times.

iment used a 29 GB data file and varied the number oflogistic regression iterations from 1 to 30 (see Figure 9).

With Hadoop, each iteration takes 127s on average,because it runs as a separate MapReduce job. In contrast,with Spark, the first iteration takes 174s, but subsequentiterations only take about 6 seconds, leading to a speedupof up to 10x for 30 iterations. This happens because thecost of reading the data from disk and parsing it is muchhigher than the cost of evaluating the gradient functioncomputed by the job on each iteration. Hadoop incurs theread/parsing cost on each iteration, while Spark reusescached blocks of parsed data and only incurs this costonce. The longer time for the first iteration in Spark isdue to the use of slower text parsing routines.

6.5 Mesos Scalability

To evaluate Mesos’ scalability, we emulated large clus-ters by running up to 50,000 slave daemons on 99 Ama-zon EC2 nodes, each with 8 CPU cores and 6 GB RAM.We used one EC2 node for the master and the rest of thenodes to run slaves. During the experiment, each of 200frameworks running throughout the cluster continuouslylaunches tasks, starting one task on each slave that it re-ceives a resource offer for. Each task sleeps for a periodof time based on a normal distribution with a mean of30 seconds and standard deviation of 10s, and then ends.Each slave runs up to two tasks at a time.

Once the cluster reached steady-state (i.e., the 200frameworks achieve their fair shares and all resourceswere allocated), we launched a test framework that runs asingle 10 second task and measured how long this frame-work took to finish. This allowed us to calculate the extradelay incurred over 10s due to having to register with themaster, wait for a resource offer, accept it, wait for themaster to process the response and launch the task on aslave, and wait for Mesos to report the task as finished.

We plot this extra delay in Figure 10, showing aver-ages of 5 runs. We observe that the overhead remainssmall (less than one second) even at 50,000 nodes. Inparticular, this overhead is much smaller than the aver-age task and job lengths in data center workloads (seeSection 2). Because Mesos was also keeping the clus-ter fully allocated, this indicates that the master kept upwith the load placed on it. Unfortunately, the EC2 vir-

0

0.25

0.5

0.75

1

0 10000 20000 30000 40000 50000

Task

Lau

nch

O

verh

ead

(se

con

ds)

Number of Nodes

Figure 10: Mesos master’s scalability versus number of slaves.

tualized environment limited scalability beyond 50,000slaves, because at 50,000 slaves the master was process-ing 100,000 packets per second (in+out), which has beenshown to be the current achievable limit on EC2 [14].

6.6 Failure RecoveryTo evaluate recovery from master failures, we conductedan experiment similar to the scalability experiment, run-ning 200 to 4000 slave daemons on 62 EC2 nodes with4 cores and 15 GB RAM. We ran 200 frameworks thateach launched 20-second tasks, and two Mesos mastersconnected to a 5-node ZooKeeper quorum using a de-fault tick interval of 2 seconds. We synchronized the twomasters’ clocks using NTP and measured the mean timeto recovery (MTTR) after killing the active master. TheMTTR is the time for all of the slaves and frameworksto connect to the second master. In all cases, the MTTRwas between 4 and 8 seconds, with 95% confidence in-tervals of up to 3s on either side. Due to lack of space,we refer the reader to [?] for a graph of the results.

7 Related WorkHPC and Grid schedulers. The high performancecomputing (HPC) community has long been managingclusters [40, 48, 27, 20]. Their target environment typ-ically consists of specialized hardware, such as Infini-band, SANs, and parallel filesystems. Thus jobs do notneed to be scheduled local to their data. Furthermore,each job is tightly coupled, often using barriers or mes-sage passing. Thus, each job is monolithic, rather thancomposed of smaller fine-grained tasks. Consequently,a job does not dynamically grow or shrink its resourcedemands across machines during its lifetime. Moreover,fault-tolerance is achieved through checkpointing, ratherthan recomputing fine-grained tasks. For these reasons,HPC schedulers use centralized scheduling, and requirejobs to declare the required resources at job submissiontime. Jobs are then allocated course-grained allocationsof the cluster. Unlike the Mesos approach, this doesnot allow frameworks to locally access data distributedover the cluster. Furthermore, jobs cannot grow andshrink dynamically as their allocations change. In ad-dition to supporting fine-grained sharing, Mesos can runHPC schedulers, such as Torque, as frameworks, whichcan then schedule HPC workloads appropriately.

13

Grid computing has mostly focused on the problemof making diverse virtual organizations share geograph-ically distributed and separately administered resourcesin a secure and interoperable way. Mesos could wellbe used within a virtual organization, which is part ofa larger grid that, for example, runs Globus Toolkit.

Public and Private Clouds. Virtual machine clouds,such as Amazon EC2 [1] and Eucalyptus [38] share com-mon goals with Mesos, such as isolating frameworkswhile providing a low-level abstraction (VMs). How-ever, they differ from Mesos in several important ways.First, these systems share resources in a coarse-grainedmanner, where a user may hold onto a virtual machinefor an arbitrary amount of time. This makes fine-grainedsharing of data difficult Second, these systems generallydo not let applications specify placement needs beyondthe size of virtual machine they require. In contrast,Mesos allows frameworks to be highly selective aboutwhich resources they acquire through resource offers.

Quincy. Quincy [33] is a fair scheduler for Dryad.It uses a centralized scheduling algorithm based onmin-cost flow optimization and takes into account bothfairness and locality constraints. Resources are reas-signed by killing tasks (similar to Mesos resource killing)when the output of the min-cost flow algorithm changes.Quincy assumes that tasks have identical resource re-quirements, and the authors note that Quincy’s min-costflow formulation of the scheduling problem is difficultto extend to tasks with multi-dimensional resource de-mands. In contrast, Mesos aims to support multiple clus-ter computing frameworks, including frameworks withscheduling preferences other than data locality. Mesosuses a decentralized two-level scheduling approach letsframeworks decide where they run, and supports hetero-geneous task resource requirements. We have shown thatframeworks can still achieve near perfect data locality us-ing delay scheduling.

Specification Language Approach. Some systems,such as Condor and Clustera [41, 25], go to great lengthsto match users and jobs to available resources. Clusteraprovides multi-user support, and uses a heuristic incor-porating user priorities, data locality and starvation tomatch work to idle nodes. However, Clustera requireseach job to explicitly list its data locality needs, and doesnot support placement constraints other than data local-ity. Condor uses the ClassAds [39] to match node proper-ties to job needs. This approach of using a resource spec-ification language always has the problem that certainpreferences might currently not be possible to express.For example, delay scheduling is hard to express withthe current languages. In contrast, the two level schedul-ing and resource offer architecture in Mesos gives jobscontrol in deciding where and when to run tasks.

8 ConclusionWe have presented Mesos, a resource sharing layer thatallows diverse cluster computing frameworks to effi-ciently share resources. Mesos is built around two designelements: a fine-grained resource sharing model at thelevel of tasks within a job, and a decentralized schedulingmechanism called resource offers that lets applicationschoose which resources to use. Together, these elementslet Mesos achieve high utilization, respond rapidly toworkload changes, and cater to frameworks with diverseneeds, while remaining simple and scalable. We haveshown that existing frameworks can effectively share re-sources with Mesos, that new specialized frameworks,such as Spark, can provide major performance gains, andthat Mesos’s simple architecture allows the system to befault tolerant and to scale to 50,000 nodes.

We have recently open-sourced Mesos5 and startedworking with two companies (Twitter and Facebook) totest Mesos in their clusters. We have also begun usingMesos to manage a 40-node cluster in our lab. Futurework will report on lessons from these deployments.

References[1] Amazon EC2. http://aws.amazon.com/ec2/.[2] Apache Hadoop. http://hadoop.apache.org.[3] Apache Hive. http://hadoop.apache.org/hive/.[4] Apache ZooKeeper. http://hadoop.apache.org/

zookeeper/.[5] Hadoop and hive at facebook. http:

//www.slideshare.net/dzhou/facebook-hadoop-data-applications.

[6] HAProxy Homepage. http://haproxy.1wt.eu/.[7] Hive – A Petabyte Scale Data Warehouse using Hadoop.

http://www.facebook.com/note.php?note_id=89508453919#/note.php?note_id=89508453919.

[8] LibProcess Homepage. http://www.eecs.berkeley.edu/˜benh/libprocess.

[9] Linux 2.6.33 release notes. http://kernelnewbies.org/Linux_2_6_33.

[10] Linux containers (LXC) overview document. http://lxc.sourceforge.net/lxc.html.

[11] Logistic Regression Wikipedia Page. http://en.wikipedia.org/wiki/Logistic_regression.

[12] Personal communication with Dhruba Borthakur from the Face-book data infrastructure team.

[13] Personal communication with Khaled Elmeleegy from Yahoo!Research.

[14] RightScale webpage. http://blog.rightscale.com/2010/04/01/benchmarking-load-balancers-in-the-cloud.

[15] Simplified wrapper interface generator. http://www.swig.org.

[16] Solaris Resource Management. http://docs.sun.com/app/docs/doc/817-1592.

[17] Twister. http://www.iterativemapreduce.org.[18] M. J. Accetta, R. V. Baron, W. J. Bolosky, D. B. Golub, R. F.

Rashid, A. Tevanian, and M. Young. Mach: A new kernel foun-dation for unix development. In USENIX Summer, pages 93–113,1986.

5Mesos is available at http://www.github.com/mesos.

14

[19] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKen-ney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, andD. Sorensen. Lapack: a portable linear algebra library for high-performance computers. In Supercomputing ’90: Proceedings ofthe 1990 conference on Supercomputing, pages 2–11, Los Alami-tos, CA, USA, 1990. IEEE Computer Society Press.

[20] D. Angulo, M. Bone, C. Grubbs, and G. von Laszewski. Work-flow management through cobalt. In International Workshop onGrid Computing Environments–Open Access (2006), Tallahassee,FL, November 2006.

[21] A. Bouteiller, F. Cappello, T. Herault, G. Krawezik,P. Lemarinier, and F. Magniette. Mpich-v2: a fault tolerantmpi for volatile nodes based on pessimistic sender based mes-sage logging. In SC ’03: Proceedings of the 2003 ACM/IEEEconference on Supercomputing, page 25, Washington, DC, USA,2003. IEEE Computer Society.

[22] D. D. Clark. The design philosophy of the darpa internet proto-cols. In SIGCOMM, pages 106–114, 1988.

[23] T. Condie, N. Conway, P. Alvaro, and J. M. Hellerstein. Mapre-duce online. In 7th Symposium on Networked Systems Designand Implementation. USENIX, May 2010.

[24] J. Dean and S. Ghemawat. MapReduce: Simplified data process-ing on large clusters. In OSDI, pages 137–150, 2004.

[25] D. J. DeWitt, E. Paulson, E. Robinson, J. F. Naughton, J. Royalty,S. Shankar, and A. Krioukov. Clustera: an integrated computationand data management system. PVLDB, 1(1):28–41, 2008.

[26] D. R. Engler, M. F. Kaashoek, and J. O’Toole. Exokernel: Anoperating system architecture for application-level resource man-agement. In SOSP, pages 251–266, 1995.

[27] W. Gentzsch. Sun grid engine: towards creating a compute powergrid. pages 35–36, 2001.

[28] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker,and I. Stoica. Dominant resource fairness: Fair allocationof heterogeneous resources in datacenters. Technical ReportUCB/EECS-2010-55, EECS Department, University of Califor-nia, Berkeley, May 2010.

[29] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost ofa cloud: research problems in data center networks. SIGCOMMComput. Commun. Rev., 39(1):68–73, 2009.

[30] B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, , and L. Zhou.Comet: Batched stream processing in data intensive distributedcomputing. Technical Report MSR-TR-2009-180, Microsoft Re-search, December 2009.

[31] E. C. Hendricks and T. C. Hartmann. Evolution of a virtual ma-chine subsystem. IBM Systems Journal, 18(1):111–142, 1979.

[32] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: dis-tributed data-parallel programs from sequential building blocks.In EuroSys 07, pages 59–72, 2007.

[33] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, andA. Goldberg. Quincy: Fair scheduling for distributed computingclusters. In SOSP 2009, November 2009.

[34] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. On availability of in-termediate data in cloud computations. In HOTOS ’09: Proceed-ings of the Twelfth Workshop on Hot Topics in Operating Systems.USENIX, May 2009.

[35] D. Logothetis, C. Olston, B. Reed, K. Webb, and K. Yocum. Pro-gramming bulk-incremental dataflows. Technical report, Univer-sity of California San Diego, http://cseweb.ucsd.edu/˜dlogothetis/docs/bips_techreport.pdf, 2009.

[36] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,N. Leiser, and G. Czajkowski. Pregel: a system for large-scalegraph processing. In PODC ’09, pages 6–6, New York, NY, USA,2009. ACM.

[37] D. Mosberger and T. Jin. httperf – a tool for measuring web serverperformance. SIGMETRICS Perform. Eval. Rev., 26(3):31–37,1998.

[38] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman,L. Youseff, and D. Zagorodnov. The eucalyptus open-source

cloud-computing system. In Proceedings of Cloud Computingand Its Applications [Online], Chicago, Illinois, 10 2008.

[39] R. Raman, M. Solomon, M. Livny, and A. Roy. The classadslanguage. pages 255–270, 2004.

[40] G. Staples. Torque - torque resource manager. In SC, page 8,2006.

[41] D. Thain, T. Tannenbaum, and M. Livny. Distributed computingin practice: the Condor experience. Concurrency and Computa-tion - Practice and Experience, 17(2-4):323–356, 2005.

[42] C. A. Waldspurger and W. E. Weihl. Lottery scheduling: flexibleproportional-share resource management. In OSDI ’94: Proceed-ings of the 1st USENIX conference on Operating Systems De-sign and Implementation, pages 1+, Berkeley, CA, USA, 1994.USENIX Association.

[43] Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation fordata-parallel computing: interfaces and implementations. InSOSP ’09: Proceedings of the ACM SIGOPS 22nd symposiumon Operating systems principles, pages 247–260, New York, NY,USA, 2009. ACM.

[44] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K.Gunda, and J. Currey. DryadLINQ: A system for general-purposedistributed data-parallel computing using a high-level language.In OSDI, pages 1–14, 2008.

[45] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,S. Shenker, and I. Stoica. Delay scheduling: A simple techniquefor achieving locality and fairness in cluster scheduling. In Eu-roSys 2010, April 2010.

[46] M. Zaharia, N. M. M. Chowdhury, M. Franklin, S. Shenker, andI. Stoica. Spark: Cluster computing with working sets. Techni-cal Report UCB/EECS-2010-53, EECS Department, Universityof California, Berkeley, May 2010.

[47] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Sto-ica. Improving mapreduce performance in heterogeneous envi-ronments. In Proc. OSDI 2008, Dec. 2008.

[48] S. Zhou. Lsf: Load sharing in large-scale heterogeneous dis-tributed systems. In Workshop on Cluster Computing, Tallahas-see, FL, December 1992.

15

Mesos: A Platform for Fine-Grained Resource Sharing in the...

Documents

Transcript of Mesos: A Platform for Fine-Grained Resource Sharing in the...