Benjamin Hindman – @benh
Apache MesosDesign Decisions
mesos.apache.org
@ApacheMesos
this is nota talk about YARN
at least not explicitly!
this talk is about Mesos!
a little historyMesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
our motivation
increase performance and
utilization of clusters
our intuition
① static partitioning considered
harmful
static partitioning considered harmful
datacenter
static partitioning considered harmful
static partitioning considered harmful
static partitioning considered harmful
static partitioning considered harmful
faster!
higher utilization!
static partitioning considered harmful
our intuition
② build new frameworks
“Map/Reduce is a big hammer,but not everything is a nail!”
Apache Mesos is a distributed systemfor running and building other distributed systems
Mesos is a cluster manager
Mesos is a resource manager
Mesos is a resource negotiator
Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation
Mesos is a distributed system with a master/slave architecture
masters
slaves
frameworks register with the Mesos master in order to run jobs/tasks
masters
slaves
frameworks
Mesos @Twitter in early 2010
goal: run long-running services elastically on Mesos
Apache Aurora (incubating)
masters
Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc!
masters
Storm, Jenkins, …
a lot of interestingdesign decisionsalong the way
many appear (IMHO)in YARN too
design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
frameworks get allocated resources from the masters
masters
framework
resources are allocated viaresource offers
a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks
offerhostname4 CPUs4 GB RAM
frameworks use these resources to decide what tasks to run
masters
framework
a task can use a subset of an offer
task3 CPUs2 GB RAM
Mesos challengedthe status quoof cluster managers
cluster manager status quo
cluster manager
application
specification
the specification includes as much information as possible to assist the cluster manager in scheduling and execution
cluster manager status quo
cluster manager
application wait for task to be executed
cluster manager status quo
cluster manager
application
result
problems with specifications① hard to specify certain desires or
constraints
② hard to update specifications dynamically as tasks executed and finished/failed
an alternative model
masters
framework
request3 CPUs2 GB RAM
a request is purposely simplified subset of a specification, mainly including the required resources
question: what should Mesos do if it can’t satisfy a request?
question: what should Mesos do if it can’t satisfy a request?
① wait until it can …
question: what should Mesos do if it can’t satisfy a request?
① wait until it can …
② offer the best it can immediately
question: what should Mesos do if it can’t satisfy a request?
① wait until it can …
② offer the best it can immediately
an alternative model
masters
framework
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
an alternative model
masters
framework
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
an alternative model
masters
framework
offerhostname4 CPUs4 GB RAM
framework uses the offers to perform it’s own scheduling
an analogue:non-blocking sockets
kernel
application
write(s, buffer, size);
an analogue:non-blocking sockets
kernel
application
42 of 100 bytes written!
resource offers address asynchrony in resource allocation
IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request
requests are complimentary(but not necessary)
offers representthe currently available resources a framework can use
question: should resources within offers be disjoint?
masters
framework1 framework2
offerhostname4 CPUs4 GB RAM
offerhostname4 CPUs4 GB RAM
concurrency control
optimisticpessimistic
concurrency control
optimisticpessimistic
all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served
concurrency control
optimisticpessimistic
offers made to different frameworks are disjoint
Mesos semantics:assume overlapping offers
design comparison:Google’s Omega
the Omega model
database
framework
snapshot
a framework gets a snapshot of the cluster state from a database (note, does not make a request!)
the Omega model
database
framework
transaction
a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks)
failed transactions occur when another framework has already acquired sought resources
isomorphism?
observation:snapshots are optimistic offers
Omega and Mesos
database
framework
snapshot
masters
framework
offerhostname4 CPUs4 GB RAM
Omega and Mesos
database
framework
transaction
masters
framework
task3 CPUs2 GB RAM
thought experiment:what’s gained by exploiting the continuous spectrum of pessimistic to optimistic?
optimisticpessimistic
design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
Mesos allocates resources to frameworks using afair-sharing algorithmwe created called Dominant Resource Fairness (DRF)
DRF, born of static partitioning
datacenter
static partitioning across teams
promotions trends recommendationsteam
promotions trends recommendationsteam
fairly shared!
static partitioning across teams
goal: fairly share the resources without static partitioning
partition utilizations
promotions trends recommendations
45% CPU100% RAM
75% CPU100% RAM
100% CPU50% RAM
team
utilization
observation: a dominant resource bottlenecks each team from running any more jobs/tasks
dominant resource bottlenecks
promotions trends recommendationsteam
utilization
bottleneck RAM
45% CPU100% RAM
75% CPU100% RAM
100% CPU50% RAM
RAM CPU
insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!
… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!
DRF in Mesos
masters
framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)
DRF in Mesos
masters
framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)
② master calculates each role’s dominant resource (dynamically) and allocates appropriately
tep 4: Profit(statistical multiplexing)
$
in practice,fair sharing is insufficient
weighted fair sharing
promotions trends recommendationsteam
weighted fair sharing
promotions trends recommendationsteam
weight 0.17 0.5 0.33
Mesos implements weighted DRF
masters
masters can be configured with weights per role
resource allocation decisions incorporate the weights to determine dominant fair shares
in practice,weighted fair sharingis still insufficient
a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources
Mesos provides reservations
slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development)
resource offers include the reservation role (if any)
masters
framework (trends)
offerhostname4 CPUs4 GB RAMrole: trends
promotions40%
trends20%
used10%
unused30%recommendations
40%
reservations
reservations provide guarantees,but at the cost of utilization
revocable resources
masters
framework (promotions)
reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time
offerhostname4 CPUs4 GB RAMrole: trends
preemption via revocation
… my tasks will not be killed unless I’m using revocable resources!
design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
high-availability and fault-tolerance a prerequisite @twitter
① framework failover
② master failover
③ slave failover
machine failure
process failure (bugs!)
upgrades
high-availability and fault-tolerance a prerequisite @twitter
① framework failover
② master failover
③ slave failover
machine failure
process failure (bugs!)
upgrades
masters
① framework failover
framework
framework re-registers with master and resumes operation
all tasks keep running across framework failover!
framework
high-availability and fault-tolerance a prerequisite @twitter
① framework failover
② master failover
③ slave failover
machine failure
process failure (bugs!)
upgrades
masters
② master failover
framework
after a new master is elected all frameworks and slaves connect to the new master
all tasks keep running across master failover!
high-availability and fault-tolerance a prerequisite @twitter
① framework failover
② master failover
③ slave failover
machine failure
process failure (bugs!)
upgrades
slave
③ slave failover
mesos-slave
task task
slave
③ slave failover
mesos-slave
tasktask
slave
③ slave failover
tasktask
slave
③ slave failover
mesos-slave
tasktask
slave
③ slave failover
mesos-slave
tasktask
slave
③ slave failover @twitter
mesos-slave
(large in-memory services,expensive to restart)
design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
execution
masters
framework
task3 CPUs2 GB RAM
frameworks launch fine-grained tasks for execution
if necessary, a framework can provide an executor to handle the execution of a task
slave
executor
mesos-slave
executor
task
task
slave
executor
mesos-slave
executor
task
task
task
slave
executor
mesos-slave
executor task
goal: isolation
slave
isolation
mesos-slave
executor
task
task
slave
isolation
mesos-slave
executor
task
task
containers
executor + task design means containers can have changing resource allocations
slave
isolation
mesos-slave
executor
task
task
slave
isolation
mesos-slave
executor
task
task
slave
isolation
mesos-slave
executor
task
task
slave
isolation
mesos-slave
executor
task
task
slave
isolation
mesos-slave
executor
task
task
slave
isolation
mesos-slave
executor
task
task
slave
isolation
mesos-slave
executor
task
task
making the task first-class gives us true fine-grained resources sharing
requirement:fast task launching (i.e., milliseconds or less)
virtual machinesan anti-pattern
operating-system virtualization
containers(zones and projects)
control groups (cgroups)namespaces
isolation support
tight integration with cgroups
CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in development)filesystem (using LVM, in development)
statistics too
rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)
used @twitter for capacity planning (and oversubscription in development)
CPU upper bounds?
in practice,determinism trumps utilization
design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
requirements:① performance
② maintainability (static typing)
③ interfaces to low-level OS (for isolation, etc)
④ interoperability with other languages (for library bindings)
garbage collectiona performance anti-pattern
consequences:① antiquated libraries (especially
around concurrency and networking)
② nascent community
github.com/3rdparty/libprocess
concurrency via futures/actors, networking via message passing
github.com/3rdparty/stout
monads in C++,safe and understandable utilities
but …
scalability simulations to 50,000+ slaves
@twitter we run multiple Mesos clusters each with 3500+ nodes
design decisions① two-level scheduling and resource
offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
final remarks
frameworks• Hadoop (github.com/mesos/hadoop)
• Spark (github.com/mesos/spark)
• DPark (github.com/douban/dpark)
• Storm (github.com/nathanmarz/storm)
• Chronos (github.com/airbnb/chronos)
• MPICH2 (in mesos git repository)
• Marathon (github.com/mesosphere/marathon)
• Aurora (github.com/twitter/aurora)
write your next distributed system with Mesos!
port a framework to Mesoswrite a “wrapper”
~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)
see http://github.com/mesos/hadoop
Thank You!
mesos.apache.org
mesos.apache.org/blog
@ApacheMesos
master
② master failover
framework
after a new master is elected all frameworks and slaves connect to the new master
all tasks keep running across master failover!
stateless masterto make master failover fast, we choose to make the master stateless
state is stored in the leaves, at the frameworks and the slaves
makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)
consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)
master failoverto make master failover fast, we choose to make the master stateless
state is stored in the leaves, at the frameworks and the slaves
makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)
consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)
Apache Mesos is a distributed systemfor running and building other distributed systems
originsBerkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
mesos.apache.org/documentation
ecosystem
mesosdevelopers
operators
frameworkdevelopers
a tour of mesos from different perspectives of the ecosystem
the operator
the operatorPeople who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc)
Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker)
“ops” at most companies (SREs at Twitter)
the static partitioners
for the operator,Mesos is a cluster manager
for the operator,Mesos is a resource manager
for the operator,Mesos is a resource negotiator
for the operator,Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation
for the operator,Mesos is a distributed system with a master/slave architecture
masters
slaves
frameworks/applications register with the Mesos master in order to run jobs/tasks
masters
slaves
frameworks can be required to authenticate as a principal*
masters
SASL
SASL
CRAM-MD5 secret mechanism(Kerberos in development)
framework
masters initialized with secrets
Mesos is highly-availableand fault-tolerant
the framework developer
the framework developer
…
Mesos uses Apache ZooKeeperfor coordination
mastersslaves
ApacheZooKeeper
increase utilization with revocable resources and preemption
masters
framework1
hostname:4 CPUs4 GB RAMrole: -
framework2 framework3
61%24%
15%
reservations
framework1
framework2
framework3
64%25%
11%
reservations
framework1
framework2
framework3
optimistic vs pessimisticwhat to say here …
authorization*principals can be used for:
authorizing allocation roles
authorizing operating system users (for execution)
authorization
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
I’d love to answer some questions with the help
of my data!
I think I’ll try Hadoop.
your datacenter
+ Hadoop
happy?
Not exactly …
… Hadoop is a big hammer, but not
everything is a nail!
I’ve got some iterative algorithms, I want to try
Spark!
datacenter management
datacenter management
datacenter management
static partitioning
static partitioning
static partitioningconsidered harmful
static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures
static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures
Hadoop …
(map/reduce)
(distributed file system)
HDFS
HDFS
HDFS
Could we just give Spark it’s own HDFS cluster
too?
HDFS x 2
HDFS x 2
HDFS x 2
HDFS x 2tee incoming data(2 copies)
HDFS x 2tee incoming data(2 copies)
periodic copy/sync
That sounds annoying … let’s not do that. Can we do any better though?
HDFS
HDFS
HDFS
HDFS
static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures
During the day I’d rather give more machines to Spark but at night I’d
rather give more machines to Hadoop!
datacenter management
datacenter management
datacenter management
datacenter management
static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures
datacenter management
datacenter management
datacenter management
static partitioningconsidered harmful(1)hard to share data
(2)hard to scale elastically (to exploit statistical multiplexing)
(3)hard to fully utilize machines
(4)hard to deal with failures
datacenter management
datacenter management
datacenter management
I don’t want to deal with this!
the datacenter …rather than think about the datacenter like this …
… is a computerthink about it like this …
datacenter computer
applications
resources
filesystem
mesos
applications
resources
filesystem
kernel
mesos
applications
resources
filesystem
kernel
mesos
frameworks
resources
filesystem
kernel
Step 1: filesystem
Step 2: mesosrun a “master” (or multiple for high availability)
Step 2: mesosrun “slaves” on the rest of the machines
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
tep 4: profit$
tep 4: profit(statistical multiplexing)
$
tep 4: profit(statistical multiplexing)
$
tep 4: profit(statistical multiplexing)
$
tep 4: profit(statistical multiplexing)
$
tep 4: profit(statistical multiplexing)
$
tep 4: profit(statistical multiplexing)
$
reduces CapEx and OpEx!
tep 4: profit(statistical multiplexing)
$
reduces latency!
tep 4: profit (utilize)$
tep 4: profit (utilize)$
tep 4: profit (utilize)$
tep 4: profit (utilize)$
tep 4: profit (utilize)$
tep 4: profit (utilize)$
tep 4: profit (failures)$
tep 4: profit (failures)$
tep 4: profit (failures)$
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
mesos
frameworks
resources
filesystem
kernel
mesos
frameworks
resources
kernel
resource allocation
resource allocation
reservationscan reserve resources per slave to provide guaranteed resources
requires human participation (ops) to determine what roles should be reserved what resources
kind of like thread affinity, but across many machines (and not just for CPUs)
resource allocation
resource allocation
resource allocation
(1) allocate reserved resources to frameworks authorized for a particular role
(2) allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights
preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
mesos
frameworks
kernel
framework≈
distributed system
framework commonality
run processes/tasks simultaneously (distributed)
handle process failures (fault-tolerant)
optimize performance (elastic)
framework commonality
run processes/tasks simultaneously (distributed)
handle process failures (fault-tolerant)
optimize performance (elastic)
coordinate execution
frameworksare
execution coordinators
frameworksare
execution coordinators
frameworksare
execution schedulers
end-to-end principle“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”
i.e., frameworks want to coordinate their tasks execution and they should be able to
framework anatomy
frameworks
framework anatomy
frameworks
scheduling API
scheduling
scheduling
i’d like to run some tasks!
scheduling
here are some resource offers!
resource offers
an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks
schedulers pick which resources to use to run their tasks
foo.bar.com:4 CPUs4 GB RAM
“two-level scheduling”mesos: controls resource allocations to schedulers
schedulers: make decisions about what to run given allocated resources
concurrency controlthe same resources may be offered to different frameworks
concurrency controlthe same resources may be offered to different frameworks
optimisticpessimistic
no overlapping offers all overlapping offers
tasksthe “threads” of the framework, a consumer of resources (cpu, memory, etc)
either a concrete command line or an opaque description (which requires an executor)
tasks
here are some resources!
tasks
launch these tasks!
tasks
tasks
status updates
status updates
status updates
task status update!
status updates
status updates
status updates
task status update!
more scheduling
more scheduling
i’d like to run some tasks!
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
high-availability
high-availability (master)
high-availability (master)
high-availability (master)
high-availability (master)
high-availability (master)
high-availability (master)task status update!
high-availability (master)i’d like to run some tasks!
high-availability (master)
high-availability (framework)
high-availability (framework)
high-availability (framework)
high-availability (framework)
high-availability (slave)
high-availability (slave)
high-availability (slave)
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
resource isolation
leverage Linux control groups (cgroups)
CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in progress)filesystem (lvm, in progress)
resource statistics
rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)
per task/executor statistics are collected (for all fork/exec’ed processes too!)
can help with capacity planning
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
securityTwitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term
agendamotivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
framework commonality
run processes/tasks simultaneously (distributed)
handle process failures (fault-tolerant)
optimize performance (elastic)
framework commonality
as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!?
case study: chronosdistributed cron with dependencies
developed at airbnb
~3k lines of Scala!
distributed, highly available, and fault tolerant without any network programming!
http://github.com/airbnb/chronos
analytics
analytics + services
analytics + services
analytics + services
case study: aurora“run 200 of these, somewhere, forever”
developed at Twitter
highly available (uses the mesos replicated log)
uses a python DSL to describe services
leverages service discovery and proxying (see Twitter commons)
http://github.com/twitter/aurora
frameworks• Hadoop (github.com/mesos/hadoop)
• Spark (github.com/mesos/spark)
• DPark (github.com/douban/dpark)
• Storm (github.com/nathanmarz/storm)
• Chronos (github.com/airbnb/chronos)
• MPICH2 (in mesos git repository)
• Marathon (github.com/mesosphere/marathon)
• Aurora (github.com/twitter/aurora)
write your next distributed system with mesos!
port a framework to mesoswrite a “wrapper” scheduler
~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)
see http://github.com/mesos/hadoop
conclusionsdatacenter management is a pain
conclusionsmesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx!
conclusionsrather than build your next distributed system from scratch, consider using mesos
conclusionsyou can share your datacenter between analytics and online services!
Questions?
mesos.apache.org
@ApacheMesos
aurora
aurora
aurora
aurora
aurora
framework commonality
run processes simultaneously (distributed)
handle process failures (fault-tolerance)
optimize execution (elasticity, scheduling)
primitivesscheduler – distributed system “master” or “coordinator”
(executor – lower-level control of task execution, optional)
requests/offers – resource allocations
tasks – “threads” of the distributed system
…
scheduler
ApacheHadoop
Chronos
scheduler(1) brokers for resources
(2) launches tasks
(3) handles task termination
brokering for resources(1) make resource requests 2 CPUs 1 GB RAM slave *
(2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com
offers: non-blocking resource allocation
exist to answer the question:
“what should mesos do if it can’t satisfy a request?”
(1) wait until it can
(2) offer the best allocation it can immediately
offers: non-blocking resource allocation
exist to answer the question:
“what should mesos do if it can’t satisfy a request?”
(1) wait until it can
(2) offer the best allocation it can immediately
resource allocation
ApacheHadoop
Chronos
request
resource allocation
ApacheHadoop
Chronos
request
allocatordominant resource fairnessresource reservations
resource allocation
ApacheHadoop
Chronos
request
allocatordominant resource fairnessresource reservations
optimisticpessimistic
resource allocation
ApacheHadoop
Chronos
request
allocatordominant resource fairnessresource reservations
optimisticpessimisticno overlapping offers all overlapping offers
resource allocation
ApacheHadoop
Chronos
offer
allocatordominant resource fairnessresource reservations
“two-level scheduling”mesos: controls resource allocations to framework schedulers
schedulers: make decisions about what to run given allocated resources
end-to-end principle
“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”
taskseither a concrete command line or an opaque description (which requires a framework executor to execute)
a consumer of resources
task operationslaunching/killing
health monitoring/reporting (failure detection)
resource usage monitoring (statistics)
resource isolation
cgroup per executor or task (if no executor)
resource controls adjusted dynamically as tasks come and go!
case study: chronosdistributed cron with dependencies
built at airbnb by @flo
before chronos
before chronos
single point of failure (and AWS was unreliable)
resource starved (not scalable)
chronos requirementsfault tolerance
distributed (elastically take advantage of resources)
retries (make sure a command eventually finishes)
dependencies
chronosleverages the primitives of mesos
~3k lines of scala
highly available (uses Mesos state)
distributed / elastic
no actual network programming!
after chronos
after chronos + hadoop
case study: aurora“run 200 of these, somewhere, forever”
built at Twitter
before aurorastatic partitioning of machines to services
hardware outages caused site outages
puppet + monit
ops couldn’t scale as fast as engineers
aurorahighly available (uses mesos replicated log)
uses a python DSL to describe services
leverages service discovery and proxying (see Twitter commons)
after aurorapower loss to 19 racks, no lost services!
more than 400 engineers running services
largest cluster has >2500 machines
Mesos
Mesos
Node NodeNod
eNode
Hadoop
Node NodeNod
eNode
Spark
Node Node
MPI Storm
Node
Chronos
Mesos
Mesos
Node NodeNod
eNode
Hadoop
Node NodeNod
eNode
Spark
Node Node
MPI
Node
…
Mesos
Mesos
Node NodeNod
eNode
Hadoop
Node NodeNod
eNode
Spark
Node Node
MPI Storm
Node
…
Mesos
Mesos
Node NodeNod
eNode
Hadoop
Node NodeNod
eNode
Spark
Node Node
MPI Storm
Node
Chronos …
tep 4: Profit(statistical multiplexing)
$
Top Related