Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Benjamin Hindman – @benh

Apache MesosDesign Decisions

mesos.apache.org

@ApacheMesos

this is nota talk about YARN

at least not explicitly!

this talk is about Mesos!

a little historyMesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

our motivation

increase performance and

utilization of clusters

our intuition

① static partitioning considered

harmful

static partitioning considered harmful

datacenter


faster!

higher utilization!


our intuition

② build new frameworks

“Map/Reduce is a big hammer,but not everything is a nail!”

Apache Mesos is a distributed systemfor running and building other distributed systems

Mesos is a cluster manager

Mesos is a resource manager

Mesos is a resource negotiator

Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation

Mesos is a distributed system with a master/slave architecture

masters

slaves

frameworks register with the Mesos master in order to run jobs/tasks

masters

slaves

frameworks

Mesos @Twitter in early 2010

goal: run long-running services elastically on Mesos

Apache Aurora (incubating)

masters

Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc!

masters

Storm, Jenkins, …

a lot of interestingdesign decisionsalong the way

many appear (IMHO)in YARN too

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

frameworks get allocated resources from the masters

masters

framework

resources are allocated viaresource offers

a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks

offerhostname4 CPUs4 GB RAM

frameworks use these resources to decide what tasks to run

masters

framework

a task can use a subset of an offer

task3 CPUs2 GB RAM

Mesos challengedthe status quoof cluster managers

cluster manager status quo

cluster manager

application

specification

the specification includes as much information as possible to assist the cluster manager in scheduling and execution


cluster manager

application wait for task to be executed


cluster manager

application

result

problems with specifications① hard to specify certain desires or

constraints

② hard to update specifications dynamically as tasks executed and finished/failed

an alternative model

masters

framework

request3 CPUs2 GB RAM

a request is purposely simplified subset of a specification, mainly including the required resources

question: what should Mesos do if it can’t satisfy a request?


① wait until it can …


① wait until it can …

② offer the best it can immediately


masters

framework






masters

framework


framework uses the offers to perform it’s own scheduling

an analogue:non-blocking sockets

kernel

application

write(s, buffer, size);

an analogue:non-blocking sockets

kernel

application

42 of 100 bytes written!

resource offers address asynchrony in resource allocation

IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request

requests are complimentary(but not necessary)

offers representthe currently available resources a framework can use

question: should resources within offers be disjoint?

masters

framework1 framework2



concurrency control

optimisticpessimistic

concurrency control


all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served

concurrency control


offers made to different frameworks are disjoint

Mesos semantics:assume overlapping offers

design comparison:Google’s Omega

the Omega model

database

framework

snapshot

a framework gets a snapshot of the cluster state from a database (note, does not make a request!)

the Omega model

database

framework

transaction

a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks)

failed transactions occur when another framework has already acquired sought resources

isomorphism?

observation:snapshots are optimistic offers

Omega and Mesos

database

framework

snapshot

masters

framework


Omega and Mesos

database

framework

transaction

masters

framework

task3 CPUs2 GB RAM

thought experiment:what’s gained by exploiting the continuous spectrum of pessimistic to optimistic?



offers




⑤ C++

Mesos allocates resources to frameworks using afair-sharing algorithmwe created called Dominant Resource Fairness (DRF)

DRF, born of static partitioning

datacenter

static partitioning across teams

promotions trends recommendationsteam


fairly shared!

static partitioning across teams

goal: fairly share the resources without static partitioning

partition utilizations

promotions trends recommendations

45% CPU100% RAM

75% CPU100% RAM

100% CPU50% RAM

team

utilization

observation: a dominant resource bottlenecks each team from running any more jobs/tasks

dominant resource bottlenecks


utilization

bottleneck RAM

45% CPU100% RAM

75% CPU100% RAM

100% CPU50% RAM

RAM CPU

insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!

… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!

DRF in Mesos

masters

framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)

DRF in Mesos

masters

framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)

② master calculates each role’s dominant resource (dynamically) and allocates appropriately

tep 4: Profit(statistical multiplexing)

$

in practice,fair sharing is insufficient

weighted fair sharing


weighted fair sharing


weight 0.17 0.5 0.33

Mesos implements weighted DRF

masters

masters can be configured with weights per role

resource allocation decisions incorporate the weights to determine dominant fair shares

in practice,weighted fair sharingis still insufficient

a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources

Mesos provides reservations

slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development)

resource offers include the reservation role (if any)

masters

framework (trends)

offerhostname4 CPUs4 GB RAMrole: trends

promotions40%

trends20%

used10%

unused30%recommendations

40%

reservations

reservations provide guarantees,but at the cost of utilization

revocable resources

masters

framework (promotions)

reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time

offerhostname4 CPUs4 GB RAMrole: trends

preemption via revocation

… my tasks will not be killed unless I’m using revocable resources!


offers




⑤ C++

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

masters


framework

framework re-registers with master and resumes operation

all tasks keep running across framework failover!

framework



② master failover

③ slave failover

machine failure


upgrades

masters

② master failover

framework

after a new master is elected all frameworks and slaves connect to the new master

all tasks keep running across master failover!



② master failover

③ slave failover

machine failure


upgrades

slave

③ slave failover

mesos-slave

task task

slave

③ slave failover

mesos-slave

tasktask

slave

③ slave failover

tasktask

slave

③ slave failover

mesos-slave

tasktask

slave

③ slave failover @twitter

mesos-slave

(large in-memory services,expensive to restart)


offers




⑤ C++

execution

masters

framework

task3 CPUs2 GB RAM

frameworks launch fine-grained tasks for execution

if necessary, a framework can provide an executor to handle the execution of a task

slave

executor

mesos-slave

executor

task

task

slave

executor

mesos-slave

executor

task

task

task

slave

executor

mesos-slave

executor task

goal: isolation

slave

isolation

mesos-slave

executor

task

task

slave

isolation

mesos-slave

executor

task

task

containers

executor + task design means containers can have changing resource allocations

slave

isolation

mesos-slave

executor

task

task

making the task first-class gives us true fine-grained resources sharing

requirement:fast task launching (i.e., milliseconds or less)

virtual machinesan anti-pattern

operating-system virtualization

containers(zones and projects)

control groups (cgroups)namespaces

isolation support

tight integration with cgroups

CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in development)filesystem (using LVM, in development)

statistics too

rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)

used @twitter for capacity planning (and oversubscription in development)

CPU upper bounds?

in practice,determinism trumps utilization


offers




⑤ C++

requirements:① performance

② maintainability (static typing)

③ interfaces to low-level OS (for isolation, etc)

④ interoperability with other languages (for library bindings)

garbage collectiona performance anti-pattern

consequences:① antiquated libraries (especially

around concurrency and networking)

② nascent community

github.com/3rdparty/libprocess

concurrency via futures/actors, networking via message passing

github.com/3rdparty/stout

monads in C++,safe and understandable utilities

but …

scalability simulations to 50,000+ slaves

@twitter we run multiple Mesos clusters each with 3500+ nodes


offers




⑤ C++

final remarks

frameworks• Hadoop (github.com/mesos/hadoop)

• Spark (github.com/mesos/spark)

• DPark (github.com/douban/dpark)

• Storm (github.com/nathanmarz/storm)

• Chronos (github.com/airbnb/chronos)

• MPICH2 (in mesos git repository)

• Marathon (github.com/mesosphere/marathon)

• Aurora (github.com/twitter/aurora)

write your next distributed system with Mesos!

port a framework to Mesoswrite a “wrapper”

~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)

see http://github.com/mesos/hadoop

Thank You!

mesos.apache.org

mesos.apache.org/blog

@ApacheMesos

master

② master failover

framework

after a new master is elected all frameworks and slaves connect to the new master

all tasks keep running across master failover!

stateless masterto make master failover fast, we choose to make the master stateless

state is stored in the leaves, at the frameworks and the slaves

makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)

consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

master failoverto make master failover fast, we choose to make the master stateless

state is stored in the leaves, at the frameworks and the slaves

makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)

consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

Apache Mesos is a distributed systemfor running and building other distributed systems

originsBerkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

mesos.apache.org/documentation

ecosystem

mesosdevelopers

operators

frameworkdevelopers

a tour of mesos from different perspectives of the ecosystem

the operator

the operatorPeople who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc)

Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker)

“ops” at most companies (SREs at Twitter)

the static partitioners

for the operator,Mesos is a cluster manager

for the operator,Mesos is a resource manager

for the operator,Mesos is a resource negotiator

for the operator,Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation

for the operator,Mesos is a distributed system with a master/slave architecture

masters

slaves

frameworks/applications register with the Mesos master in order to run jobs/tasks

masters

slaves

frameworks can be required to authenticate as a principal*

masters

SASL

SASL

CRAM-MD5 secret mechanism(Kerberos in development)

framework

masters initialized with secrets

Mesos is highly-availableand fault-tolerant

the framework developer

the framework developer

…

Mesos uses Apache ZooKeeperfor coordination

mastersslaves

ApacheZooKeeper

increase utilization with revocable resources and preemption

masters

framework1

hostname:4 CPUs4 GB RAMrole: -

framework2 framework3

61%24%

15%

reservations

framework1

framework2

framework3

64%25%

11%

reservations

framework1

framework2

framework3

optimistic vs pessimisticwhat to say here …

authorization*principals can be used for:

authorizing allocation roles

authorizing operating system users (for execution)

authorization

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

I’d love to answer some questions with the help

of my data!

I think I’ll try Hadoop.

your datacenter

+ Hadoop

happy?

Not exactly …

… Hadoop is a big hammer, but not

everything is a nail!

I’ve got some iterative algorithms, I want to try

Spark!

datacenter management

static partitioning

static partitioningconsidered harmful

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

Hadoop …

(map/reduce)

(distributed file system)

Could we just give Spark it’s own HDFS cluster

too?

HDFS x 2

HDFS x 2tee incoming data(2 copies)

HDFS x 2tee incoming data(2 copies)

periodic copy/sync

That sounds annoying … let’s not do that. Can we do any better though?

During the day I’d rather give more machines to Spark but at night I’d

rather give more machines to Hadoop!

I don’t want to deal with this!

the datacenter …rather than think about the datacenter like this …

… is a computerthink about it like this …

datacenter computer

applications

resources

filesystem

mesos

applications

resources

filesystem

kernel

mesos

frameworks

resources

filesystem

kernel

Step 1: filesystem

Step 2: mesosrun a “master” (or multiple for high availability)

Step 2: mesosrun “slaves” on the rest of the machines

Step 3: frameworks

tep 4: profit$

tep 4: profit(statistical multiplexing)

$


$

reduces CapEx and OpEx!


$

reduces latency!

tep 4: profit (utilize)$

tep 4: profit (failures)$


resource allocation


high-availability


security

case studies

mesos

frameworks

resources

filesystem

kernel

mesos

frameworks

resources

kernel

resource allocation

reservationscan reserve resources per slave to provide guaranteed resources

requires human participation (ops) to determine what roles should be reserved what resources

kind of like thread affinity, but across many machines (and not just for CPUs)

resource allocation

resource allocation

(1) allocate reserved resources to frameworks authorized for a particular role

(2) allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights

preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation


resource allocation


high-availability


security

case studies

mesos

frameworks

kernel

framework≈

distributed system

framework commonality

run processes/tasks simultaneously (distributed)

handle process failures (fault-tolerant)

optimize performance (elastic)





coordinate execution

frameworksare

execution coordinators

frameworksare

execution schedulers

end-to-end principle“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

i.e., frameworks want to coordinate their tasks execution and they should be able to

framework anatomy

frameworks

framework anatomy

frameworks

scheduling API

scheduling

scheduling

i’d like to run some tasks!

scheduling

here are some resource offers!

resource offers

an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks

schedulers pick which resources to use to run their tasks

foo.bar.com:4 CPUs4 GB RAM

“two-level scheduling”mesos: controls resource allocations to schedulers

schedulers: make decisions about what to run given allocated resources

concurrency controlthe same resources may be offered to different frameworks

concurrency controlthe same resources may be offered to different frameworks


no overlapping offers all overlapping offers

tasksthe “threads” of the framework, a consumer of resources (cpu, memory, etc)

either a concrete command line or an opaque description (which requires an executor)

tasks

here are some resources!

tasks

launch these tasks!

status updates

status updates

task status update!

status updates

status updates

task status update!

more scheduling

more scheduling

i’d like to run some tasks!


resource allocation


high-availability


security

case studies

high-availability

high-availability (master)

high-availability (master)task status update!

high-availability (master)i’d like to run some tasks!

high-availability (master)

high-availability (framework)

high-availability (slave)


resource allocation


high-availability


security

case studies

resource isolation

leverage Linux control groups (cgroups)

CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in progress)filesystem (lvm, in progress)

resource statistics

rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)

per task/executor statistics are collected (for all fork/exec’ed processes too!)

can help with capacity planning


resource allocation


high-availability


security

case studies

securityTwitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term


resource allocation


high-availability


security

case studies


as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!?

case study: chronosdistributed cron with dependencies

developed at airbnb

~3k lines of Scala!

distributed, highly available, and fault tolerant without any network programming!

http://github.com/airbnb/chronos


analytics

analytics + services

case study: aurora“run 200 of these, somewhere, forever”

developed at Twitter

highly available (uses the mesos replicated log)

uses a python DSL to describe services

leverages service discovery and proxying (see Twitter commons)

http://github.com/twitter/aurora


frameworks• Hadoop (github.com/mesos/hadoop)

• Spark (github.com/mesos/spark)

• DPark (github.com/douban/dpark)

• Storm (github.com/nathanmarz/storm)

• Chronos (github.com/airbnb/chronos)

• MPICH2 (in mesos git repository)

• Marathon (github.com/mesosphere/marathon)

• Aurora (github.com/twitter/aurora)

write your next distributed system with mesos!

port a framework to mesoswrite a “wrapper” scheduler

~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)

see http://github.com/mesos/hadoop

conclusionsdatacenter management is a pain

conclusionsmesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx!

conclusionsrather than build your next distributed system from scratch, consider using mesos

conclusionsyou can share your datacenter between analytics and online services!

Questions?

mesos.apache.org

@ApacheMesos

aurora


run processes simultaneously (distributed)

handle process failures (fault-tolerance)

optimize execution (elasticity, scheduling)

primitivesscheduler – distributed system “master” or “coordinator”

(executor – lower-level control of task execution, optional)

requests/offers – resource allocations

tasks – “threads” of the distributed system

…

scheduler

ApacheHadoop

Chronos

scheduler(1) brokers for resources

(2) launches tasks

(3) handles task termination

brokering for resources(1) make resource requests 2 CPUs 1 GB RAM slave *

(2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com

offers: non-blocking resource allocation

exist to answer the question:

“what should mesos do if it can’t satisfy a request?”

(1) wait until it can

(2) offer the best allocation it can immediately

resource allocation

ApacheHadoop

Chronos

request

resource allocation

ApacheHadoop

Chronos

request

allocatordominant resource fairnessresource reservations

resource allocation

ApacheHadoop

Chronos

request



resource allocation

ApacheHadoop

Chronos

request


optimisticpessimisticno overlapping offers all overlapping offers

resource allocation

ApacheHadoop

Chronos

offer


“two-level scheduling”mesos: controls resource allocations to framework schedulers

schedulers: make decisions about what to run given allocated resources

end-to-end principle

“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

taskseither a concrete command line or an opaque description (which requires a framework executor to execute)

a consumer of resources

task operationslaunching/killing

health monitoring/reporting (failure detection)

resource usage monitoring (statistics)

resource isolation

cgroup per executor or task (if no executor)

resource controls adjusted dynamically as tasks come and go!

case study: chronosdistributed cron with dependencies

built at airbnb by @flo

before chronos

before chronos

single point of failure (and AWS was unreliable)

resource starved (not scalable)

chronos requirementsfault tolerance

distributed (elastically take advantage of resources)

retries (make sure a command eventually finishes)

dependencies

chronosleverages the primitives of mesos

~3k lines of scala

highly available (uses Mesos state)

distributed / elastic

no actual network programming!

after chronos

after chronos + hadoop

case study: aurora“run 200 of these, somewhere, forever”

built at Twitter

before aurorastatic partitioning of machines to services

hardware outages caused site outages

puppet + monit

ops couldn’t scale as fast as engineers

aurorahighly available (uses mesos replicated log)

uses a python DSL to describe services

leverages service discovery and proxying (see Twitter commons)

after aurorapower loss to 19 racks, no lost services!

more than 400 engineers running services

largest cluster has >2500 machines

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

Chronos

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI

Node

…

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

…

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

Chronos …

tep 4: Profit(statistical multiplexing)

$

Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Documents

Transcript of Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.