Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
-
Upload
norah-robbins -
Category
Documents
-
view
222 -
download
1
Embed Size (px)
Transcript of Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Benjamin Hindman – @benh
Apache MesosDesign Decisions
mesos.apache.org
@ApacheMesos

this is nota talk about YARN

at least not explicitly!

this talk is about Mesos!

a little historyMesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

our motivation
increase performance and
utilization of clusters

our intuition
① static partitioning considered
harmful

static partitioning considered harmful
datacenter

static partitioning considered harmful

static partitioning considered harmful

static partitioning considered harmful

static partitioning considered harmful
faster!

higher utilization!
static partitioning considered harmful

our intuition
② build new frameworks

“Map/Reduce is a big hammer,but not everything is a nail!”

Apache Mesos is a distributed systemfor running and building other distributed systems

Mesos is a cluster manager

Mesos is a resource manager

Mesos is a resource negotiator

Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation

Mesos is a distributed system with a master/slave architecture
masters
slaves

frameworks register with the Mesos master in order to run jobs/tasks
masters
slaves
frameworks

Mesos @Twitter in early 2010
goal: run long-running services elastically on Mesos

Apache Aurora (incubating)
masters
Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc!

masters
Storm, Jenkins, …

a lot of interestingdesign decisionsalong the way

many appear (IMHO)in YARN too

design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++

design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++

frameworks get allocated resources from the masters
masters
framework
resources are allocated viaresource offers
a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks
offerhostname4 CPUs4 GB RAM

frameworks use these resources to decide what tasks to run
masters
framework
a task can use a subset of an offer
task3 CPUs2 GB RAM

Mesos challengedthe status quoof cluster managers

cluster manager status quo
cluster manager
application
specification
the specification includes as much information as possible to assist the cluster manager in scheduling and execution

cluster manager status quo
cluster manager
application wait for task to be executed

cluster manager status quo
cluster manager
application
result

problems with specifications① hard to specify certain desires or
constraints
② hard to update specifications dynamically as tasks executed and finished/failed

an alternative model
masters
framework
request3 CPUs2 GB RAM
a request is purposely simplified subset of a specification, mainly including the required resources

question: what should Mesos do if it can’t satisfy a request?

question: what should Mesos do if it can’t satisfy a request?
① wait until it can …

question: what should Mesos do if it can’t satisfy a request?
① wait until it can …
② offer the best it can immediately

question: what should Mesos do if it can’t satisfy a request?
① wait until it can …
② offer the best it can immediately

an alternative model
masters
framework
offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
an alternative model
masters
framework
offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
an alternative model
masters
framework
offerhostname4 CPUs4 GB RAM
framework uses the offers to perform it’s own scheduling

an analogue:non-blocking sockets
kernel
application
write(s, buffer, size);

an analogue:non-blocking sockets
kernel
application
42 of 100 bytes written!

resource offers address asynchrony in resource allocation

IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request

requests are complimentary(but not necessary)

offers representthe currently available resources a framework can use

question: should resources within offers be disjoint?

masters
framework1 framework2
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM

concurrency control
optimisticpessimistic

concurrency control
optimisticpessimistic
all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served

concurrency control
optimisticpessimistic
offers made to different frameworks are disjoint

Mesos semantics:assume overlapping offers

design comparison:Google’s Omega

the Omega model
database
framework
snapshot
a framework gets a snapshot of the cluster state from a database (note, does not make a request!)

the Omega model
database
framework
transaction
a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks)
failed transactions occur when another framework has already acquired sought resources

isomorphism?

observation:snapshots are optimistic offers

Omega and Mesos
database
framework
snapshot
masters
framework
offerhostname4 CPUs4 GB RAM

Omega and Mesos
database
framework
transaction
masters
framework
task3 CPUs2 GB RAM

thought experiment:what’s gained by exploiting the continuous spectrum of pessimistic to optimistic?
optimisticpessimistic

design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++

Mesos allocates resources to frameworks using afair-sharing algorithmwe created called Dominant Resource Fairness (DRF)

DRF, born of static partitioning
datacenter

static partitioning across teams
promotions trends recommendationsteam

promotions trends recommendationsteam
fairly shared!
static partitioning across teams

goal: fairly share the resources without static partitioning

partition utilizations
promotions trends recommendations
45% CPU100% RAM
75% CPU100% RAM
100% CPU50% RAM
team
utilization

observation: a dominant resource bottlenecks each team from running any more jobs/tasks

dominant resource bottlenecks
promotions trends recommendationsteam
utilization
bottleneck RAM
45% CPU100% RAM
75% CPU100% RAM
100% CPU50% RAM
RAM CPU

insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!

… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!

DRF in Mesos
masters
framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)

DRF in Mesos
masters
framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)
② master calculates each role’s dominant resource (dynamically) and allocates appropriately

tep 4: Profit(statistical multiplexing)
$

in practice,fair sharing is insufficient

weighted fair sharing
promotions trends recommendationsteam

weighted fair sharing
promotions trends recommendationsteam
weight 0.17 0.5 0.33

Mesos implements weighted DRF
masters
masters can be configured with weights per role
resource allocation decisions incorporate the weights to determine dominant fair shares

in practice,weighted fair sharingis still insufficient

a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources

Mesos provides reservations
slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development)
resource offers include the reservation role (if any)
masters
framework (trends)
offerhostname4 CPUs4 GB RAMrole: trends

promotions40%
trends20%
used10%
unused30%recommendations
40%
reservations
reservations provide guarantees,but at the cost of utilization

revocable resources
masters
framework (promotions)
reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time
offerhostname4 CPUs4 GB RAMrole: trends

preemption via revocation
… my tasks will not be killed unless I’m using revocable resources!

design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++

high-availability and fault-tolerance a prerequisite @twitter
① framework failover
② master failover
③ slave failover
machine failure
process failure (bugs!)
upgrades

high-availability and fault-tolerance a prerequisite @twitter
① framework failover
② master failover
③ slave failover
machine failure
process failure (bugs!)
upgrades

masters
① framework failover
framework
framework re-registers with master and resumes operation
all tasks keep running across framework failover!
framework

high-availability and fault-tolerance a prerequisite @twitter
① framework failover
② master failover
③ slave failover
machine failure
process failure (bugs!)
upgrades

masters
② master failover
framework
after a new master is elected all frameworks and slaves connect to the new master
all tasks keep running across master failover!

high-availability and fault-tolerance a prerequisite @twitter
① framework failover
② master failover
③ slave failover
machine failure
process failure (bugs!)
upgrades

slave
③ slave failover
mesos-slave
task task

slave
③ slave failover
mesos-slave
tasktask

slave
③ slave failover
tasktask

slave
③ slave failover
mesos-slave
tasktask

slave
③ slave failover
mesos-slave
tasktask

slave
③ slave failover @twitter
mesos-slave
(large in-memory services,expensive to restart)

design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++

execution
masters
framework
task3 CPUs2 GB RAM
frameworks launch fine-grained tasks for execution
if necessary, a framework can provide an executor to handle the execution of a task

slave
executor
mesos-slave
executor
task
task

slave
executor
mesos-slave
executor
task
task
task

slave
executor
mesos-slave
executor task

goal: isolation

slave
isolation
mesos-slave
executor
task
task

slave
isolation
mesos-slave
executor
task
task
containers

executor + task design means containers can have changing resource allocations

slave
isolation
mesos-slave
executor
task
task

slave
isolation
mesos-slave
executor
task
task

slave
isolation
mesos-slave
executor
task
task

slave
isolation
mesos-slave
executor
task
task

slave
isolation
mesos-slave
executor
task
task

slave
isolation
mesos-slave
executor
task
task

slave
isolation
mesos-slave
executor
task
task

making the task first-class gives us true fine-grained resources sharing

requirement:fast task launching (i.e., milliseconds or less)

virtual machinesan anti-pattern

operating-system virtualization
containers(zones and projects)
control groups (cgroups)namespaces

isolation support
tight integration with cgroups
CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in development)filesystem (using LVM, in development)

statistics too
rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)
used @twitter for capacity planning (and oversubscription in development)

CPU upper bounds?
in practice,determinism trumps utilization

design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++

requirements:① performance
② maintainability (static typing)
③ interfaces to low-level OS (for isolation, etc)
④ interoperability with other languages (for library bindings)

garbage collectiona performance anti-pattern

consequences:① antiquated libraries (especially
around concurrency and networking)
② nascent community

github.com/3rdparty/libprocess
concurrency via futures/actors, networking via message passing

github.com/3rdparty/stout
monads in C++,safe and understandable utilities

but …

scalability simulations to 50,000+ slaves

@twitter we run multiple Mesos clusters each with 3500+ nodes

design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++

final remarks

frameworks• Hadoop (github.com/mesos/hadoop)
• Spark (github.com/mesos/spark)
• DPark (github.com/douban/dpark)
• Storm (github.com/nathanmarz/storm)
• Chronos (github.com/airbnb/chronos)
• MPICH2 (in mesos git repository)
• Marathon (github.com/mesosphere/marathon)
• Aurora (github.com/twitter/aurora)

write your next distributed system with Mesos!

port a framework to Mesoswrite a “wrapper”
~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)
see http://github.com/mesos/hadoop

Thank You!
mesos.apache.org
mesos.apache.org/blog
@ApacheMesos



master
② master failover
framework
after a new master is elected all frameworks and slaves connect to the new master
all tasks keep running across master failover!

stateless masterto make master failover fast, we choose to make the master stateless
state is stored in the leaves, at the frameworks and the slaves
makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)
consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

master failoverto make master failover fast, we choose to make the master stateless
state is stored in the leaves, at the frameworks and the slaves
makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)
consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)



Apache Mesos is a distributed systemfor running and building other distributed systems

originsBerkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
mesos.apache.org/documentation

ecosystem
mesosdevelopers
operators
frameworkdevelopers

a tour of mesos from different perspectives of the ecosystem

the operator

the operatorPeople who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc)
Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker)
“ops” at most companies (SREs at Twitter)
the static partitioners

for the operator,Mesos is a cluster manager

for the operator,Mesos is a resource manager

for the operator,Mesos is a resource negotiator

for the operator,Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation

for the operator,Mesos is a distributed system with a master/slave architecture
masters
slaves

frameworks/applications register with the Mesos master in order to run jobs/tasks
masters
slaves

frameworks can be required to authenticate as a principal*
masters
SASL
SASL
CRAM-MD5 secret mechanism(Kerberos in development)
framework
masters initialized with secrets

Mesos is highly-availableand fault-tolerant

the framework developer

the framework developer
…

Mesos uses Apache ZooKeeperfor coordination
mastersslaves
ApacheZooKeeper

increase utilization with revocable resources and preemption
masters
framework1
hostname:4 CPUs4 GB RAMrole: -
framework2 framework3
61%24%
15%
reservations
framework1
framework2
framework3
64%25%
11%
reservations
framework1
framework2
framework3

optimistic vs pessimisticwhat to say here …

authorization*principals can be used for:
authorizing allocation roles
authorizing operating system users (for execution)

authorization

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

I’d love to answer some questions with the help
of my data!

I think I’ll try Hadoop.

your datacenter

+ Hadoop

happy?

Not exactly …

… Hadoop is a big hammer, but not
everything is a nail!

I’ve got some iterative algorithms, I want to try
Spark!

datacenter management

datacenter management

datacenter management

static partitioning

static partitioning

static partitioningconsidered harmful

static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures

static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures

Hadoop …
(map/reduce)
(distributed file system)

HDFS

HDFS

HDFS

Could we just give Spark it’s own HDFS cluster
too?

HDFS x 2

HDFS x 2

HDFS x 2

HDFS x 2tee incoming data(2 copies)

HDFS x 2tee incoming data(2 copies)
periodic copy/sync

That sounds annoying … let’s not do that. Can we do any better though?

HDFS

HDFS

HDFS

HDFS

static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures

During the day I’d rather give more machines to Spark but at night I’d
rather give more machines to Hadoop!

datacenter management

datacenter management

datacenter management

datacenter management


static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures

datacenter management

datacenter management

datacenter management

static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures

datacenter management

datacenter management

datacenter management





I don’t want to deal with this!

the datacenter …rather than think about the datacenter like this …

… is a computerthink about it like this …

datacenter computer
applications
resources
filesystem

mesos
applications
resources
filesystem
kernel

mesos
applications
resources
filesystem
kernel

mesos
frameworks
resources
filesystem
kernel

Step 1: filesystem

Step 2: mesosrun a “master” (or multiple for high availability)

Step 2: mesosrun “slaves” on the rest of the machines

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

tep 4: profit$

tep 4: profit(statistical multiplexing)
$

tep 4: profit(statistical multiplexing)
$

tep 4: profit(statistical multiplexing)
$

tep 4: profit(statistical multiplexing)
$

tep 4: profit(statistical multiplexing)
$

tep 4: profit(statistical multiplexing)
$
reduces CapEx and OpEx!

tep 4: profit(statistical multiplexing)
$
reduces latency!

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (failures)$

tep 4: profit (failures)$

tep 4: profit (failures)$

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

mesos
frameworks
resources
filesystem
kernel

mesos
frameworks
resources
kernel

resource allocation

resource allocation

reservationscan reserve resources per slave to provide guaranteed resources
requires human participation (ops) to determine what roles should be reserved what resources
kind of like thread affinity, but across many machines (and not just for CPUs)

resource allocation

resource allocation

resource allocation
(1) allocate reserved resources to frameworks authorized for a particular role
(2) allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights

preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

mesos
frameworks
kernel

framework≈
distributed system

framework commonality
run processes/tasks simultaneously (distributed)
handle process failures (fault-tolerant)
optimize performance (elastic)

framework commonality
run processes/tasks simultaneously (distributed)
handle process failures (fault-tolerant)
optimize performance (elastic)
coordinate execution

frameworksare
execution coordinators

frameworksare
execution coordinators

frameworksare
execution schedulers

end-to-end principle“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”
i.e., frameworks want to coordinate their tasks execution and they should be able to

framework anatomy
frameworks

framework anatomy
frameworks
scheduling API

scheduling

scheduling
i’d like to run some tasks!

scheduling
here are some resource offers!

resource offers
an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks
schedulers pick which resources to use to run their tasks
foo.bar.com:4 CPUs4 GB RAM

“two-level scheduling”mesos: controls resource allocations to schedulers
schedulers: make decisions about what to run given allocated resources

concurrency controlthe same resources may be offered to different frameworks

concurrency controlthe same resources may be offered to different frameworks
optimisticpessimistic
no overlapping offers all overlapping offers

tasksthe “threads” of the framework, a consumer of resources (cpu, memory, etc)
either a concrete command line or an opaque description (which requires an executor)

tasks
here are some resources!

tasks
launch these tasks!

tasks

tasks

status updates

status updates

status updates
task status update!

status updates

status updates

status updates
task status update!

more scheduling

more scheduling
i’d like to run some tasks!

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

high-availability

high-availability (master)

high-availability (master)

high-availability (master)

high-availability (master)

high-availability (master)

high-availability (master)task status update!

high-availability (master)i’d like to run some tasks!

high-availability (master)

high-availability (framework)

high-availability (framework)

high-availability (framework)

high-availability (framework)

high-availability (slave)

high-availability (slave)

high-availability (slave)

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

resource isolation
leverage Linux control groups (cgroups)
CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in progress)filesystem (lvm, in progress)

resource statistics
rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)
per task/executor statistics are collected (for all fork/exec’ed processes too!)
can help with capacity planning

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

securityTwitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term

agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies

framework commonality
run processes/tasks simultaneously (distributed)
handle process failures (fault-tolerant)
optimize performance (elastic)

framework commonality
as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!?

case study: chronosdistributed cron with dependencies
developed at airbnb
~3k lines of Scala!
distributed, highly available, and fault tolerant without any network programming!
http://github.com/airbnb/chronos

analytics

analytics + services

analytics + services

analytics + services

case study: aurora“run 200 of these, somewhere, forever”
developed at Twitter
highly available (uses the mesos replicated log)
uses a python DSL to describe services
leverages service discovery and proxying (see Twitter commons)
http://github.com/twitter/aurora

frameworks• Hadoop (github.com/mesos/hadoop)
• Spark (github.com/mesos/spark)
• DPark (github.com/douban/dpark)
• Storm (github.com/nathanmarz/storm)
• Chronos (github.com/airbnb/chronos)
• MPICH2 (in mesos git repository)
• Marathon (github.com/mesosphere/marathon)
• Aurora (github.com/twitter/aurora)

write your next distributed system with mesos!

port a framework to mesoswrite a “wrapper” scheduler
~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)
see http://github.com/mesos/hadoop

conclusionsdatacenter management is a pain

conclusionsmesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx!

conclusionsrather than build your next distributed system from scratch, consider using mesos

conclusionsyou can share your datacenter between analytics and online services!

Questions?
mesos.apache.org
@ApacheMesos

aurora

aurora

aurora

aurora

aurora

framework commonality
run processes simultaneously (distributed)
handle process failures (fault-tolerance)
optimize execution (elasticity, scheduling)

primitivesscheduler – distributed system “master” or “coordinator”
(executor – lower-level control of task execution, optional)
requests/offers – resource allocations
tasks – “threads” of the distributed system
…

scheduler
ApacheHadoop
Chronos

scheduler(1) brokers for resources
(2) launches tasks
(3) handles task termination

brokering for resources(1) make resource requests 2 CPUs 1 GB RAM slave *
(2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com

offers: non-blocking resource allocation
exist to answer the question:
“what should mesos do if it can’t satisfy a request?”
(1) wait until it can
(2) offer the best allocation it can immediately

offers: non-blocking resource allocation
exist to answer the question:
“what should mesos do if it can’t satisfy a request?”
(1) wait until it can
(2) offer the best allocation it can immediately

resource allocation
ApacheHadoop
Chronos
request

resource allocation
ApacheHadoop
Chronos
request
allocatordominant resource fairnessresource reservations

resource allocation
ApacheHadoop
Chronos
request
allocatordominant resource fairnessresource reservations
optimisticpessimistic

resource allocation
ApacheHadoop
Chronos
request
allocatordominant resource fairnessresource reservations
optimisticpessimisticno overlapping offers all overlapping offers

resource allocation
ApacheHadoop
Chronos
offer
allocatordominant resource fairnessresource reservations

“two-level scheduling”mesos: controls resource allocations to framework schedulers
schedulers: make decisions about what to run given allocated resources

end-to-end principle
“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

taskseither a concrete command line or an opaque description (which requires a framework executor to execute)
a consumer of resources

task operationslaunching/killing
health monitoring/reporting (failure detection)
resource usage monitoring (statistics)

resource isolation
cgroup per executor or task (if no executor)
resource controls adjusted dynamically as tasks come and go!

case study: chronosdistributed cron with dependencies
built at airbnb by @flo

before chronos

before chronos
single point of failure (and AWS was unreliable)
resource starved (not scalable)

chronos requirementsfault tolerance
distributed (elastically take advantage of resources)
retries (make sure a command eventually finishes)
dependencies

chronosleverages the primitives of mesos
~3k lines of scala
highly available (uses Mesos state)
distributed / elastic
no actual network programming!

after chronos

after chronos + hadoop

case study: aurora“run 200 of these, somewhere, forever”
built at Twitter

before aurorastatic partitioning of machines to services
hardware outages caused site outages
puppet + monit
ops couldn’t scale as fast as engineers

aurorahighly available (uses mesos replicated log)
uses a python DSL to describe services
leverages service discovery and proxying (see Twitter commons)

after aurorapower loss to 19 racks, no lost services!
more than 400 engineers running services
largest cluster has >2500 machines

Mesos
Mesos
Node NodeNod
eNode
Hadoop
Node NodeNod
eNode
Spark
Node Node
MPI Storm
Node
Chronos

Mesos
Mesos
Node NodeNod
eNode
Hadoop
Node NodeNod
eNode
Spark
Node Node
MPI
Node
…

Mesos
Mesos
Node NodeNod
eNode
Hadoop
Node NodeNod
eNode
Spark
Node Node
MPI Storm
Node
…

Mesos
Mesos
Node NodeNod
eNode
Hadoop
Node NodeNod
eNode
Spark
Node Node
MPI Storm
Node
Chronos …

tep 4: Profit(statistical multiplexing)
$