Download - Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Transcript
Page 1: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Benjamin Hindman – @benh

Apache MesosDesign Decisions

mesos.apache.org

@ApacheMesos

Page 2: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

this is nota talk about YARN

Page 3: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

at least not explicitly!

Page 4: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

this talk is about Mesos!

Page 5: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

a little historyMesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

Page 6: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

our motivation

increase performance and

utilization of clusters

Page 7: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

our intuition

① static partitioning considered

harmful

Page 8: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioning considered harmful

datacenter

Page 9: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioning considered harmful

Page 10: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioning considered harmful

Page 11: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioning considered harmful

Page 12: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioning considered harmful

faster!

Page 13: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

higher utilization!

static partitioning considered harmful

Page 14: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

our intuition

② build new frameworks

Page 15: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

“Map/Reduce is a big hammer,but not everything is a nail!”

Page 16: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Apache Mesos is a distributed systemfor running and building other distributed systems

Page 17: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos is a cluster manager

Page 18: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos is a resource manager

Page 19: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos is a resource negotiator

Page 20: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation

Page 21: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos is a distributed system with a master/slave architecture

masters

slaves

Page 22: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworks register with the Mesos master in order to run jobs/tasks

masters

slaves

frameworks

Page 23: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos @Twitter in early 2010

goal: run long-running services elastically on Mesos

Page 24: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Apache Aurora (incubating)

masters

Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc!

Page 25: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

masters

Storm, Jenkins, …

Page 26: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

a lot of interestingdesign decisionsalong the way

Page 27: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

many appear (IMHO)in YARN too

Page 28: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

Page 29: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

Page 30: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworks get allocated resources from the masters

masters

framework

resources are allocated viaresource offers

a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks

offerhostname4 CPUs4 GB RAM

Page 31: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworks use these resources to decide what tasks to run

masters

framework

a task can use a subset of an offer

task3 CPUs2 GB RAM

Page 32: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos challengedthe status quoof cluster managers

Page 33: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

cluster manager status quo

cluster manager

application

specification

the specification includes as much information as possible to assist the cluster manager in scheduling and execution

Page 34: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

cluster manager status quo

cluster manager

application wait for task to be executed

Page 35: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

cluster manager status quo

cluster manager

application

result

Page 36: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

problems with specifications① hard to specify certain desires or

constraints

② hard to update specifications dynamically as tasks executed and finished/failed

Page 37: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

an alternative model

masters

framework

request3 CPUs2 GB RAM

a request is purposely simplified subset of a specification, mainly including the required resources

Page 38: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

question: what should Mesos do if it can’t satisfy a request?

Page 39: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

question: what should Mesos do if it can’t satisfy a request?

① wait until it can …

Page 40: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

question: what should Mesos do if it can’t satisfy a request?

① wait until it can …

② offer the best it can immediately

Page 41: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

question: what should Mesos do if it can’t satisfy a request?

① wait until it can …

② offer the best it can immediately

Page 42: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

an alternative model

masters

framework

offerhostname4 CPUs4 GB RAM

Page 43: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

an alternative model

masters

framework

offerhostname4 CPUs4 GB RAM

Page 44: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

an alternative model

masters

framework

offerhostname4 CPUs4 GB RAM

framework uses the offers to perform it’s own scheduling

Page 45: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

an analogue:non-blocking sockets

kernel

application

write(s, buffer, size);

Page 46: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

an analogue:non-blocking sockets

kernel

application

42 of 100 bytes written!

Page 47: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource offers address asynchrony in resource allocation

Page 48: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request

Page 49: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

requests are complimentary(but not necessary)

Page 50: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

offers representthe currently available resources a framework can use

Page 51: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

question: should resources within offers be disjoint?

Page 52: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

masters

framework1 framework2

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

Page 53: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

concurrency control

optimisticpessimistic

Page 54: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

concurrency control

optimisticpessimistic

all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served

Page 55: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

concurrency control

optimisticpessimistic

offers made to different frameworks are disjoint

Page 56: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos semantics:assume overlapping offers

Page 57: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

design comparison:Google’s Omega

Page 58: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

the Omega model

database

framework

snapshot

a framework gets a snapshot of the cluster state from a database (note, does not make a request!)

Page 59: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

the Omega model

database

framework

transaction

a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks)

failed transactions occur when another framework has already acquired sought resources

Page 60: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

isomorphism?

Page 61: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

observation:snapshots are optimistic offers

Page 62: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Omega and Mesos

database

framework

snapshot

masters

framework

offerhostname4 CPUs4 GB RAM

Page 63: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Omega and Mesos

database

framework

transaction

masters

framework

task3 CPUs2 GB RAM

Page 64: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

thought experiment:what’s gained by exploiting the continuous spectrum of pessimistic to optimistic?

optimisticpessimistic

Page 65: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

Page 66: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos allocates resources to frameworks using afair-sharing algorithmwe created called Dominant Resource Fairness (DRF)

Page 67: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

DRF, born of static partitioning

datacenter

Page 68: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioning across teams

promotions trends recommendationsteam

Page 69: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

promotions trends recommendationsteam

fairly shared!

static partitioning across teams

Page 70: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

goal: fairly share the resources without static partitioning

Page 71: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

partition utilizations

promotions trends recommendations

45% CPU100% RAM

75% CPU100% RAM

100% CPU50% RAM

team

utilization

Page 72: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

observation: a dominant resource bottlenecks each team from running any more jobs/tasks

Page 73: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

dominant resource bottlenecks

promotions trends recommendationsteam

utilization

bottleneck RAM

45% CPU100% RAM

75% CPU100% RAM

100% CPU50% RAM

RAM CPU

Page 74: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!

Page 75: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!

Page 76: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

DRF in Mesos

masters

framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)

Page 77: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

DRF in Mesos

masters

framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)

② master calculates each role’s dominant resource (dynamically) and allocates appropriately

Page 78: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: Profit(statistical multiplexing)

$

Page 79: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

in practice,fair sharing is insufficient

Page 80: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

weighted fair sharing

promotions trends recommendationsteam

Page 81: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

weighted fair sharing

promotions trends recommendationsteam

weight 0.17 0.5 0.33

Page 82: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos implements weighted DRF

masters

masters can be configured with weights per role

resource allocation decisions incorporate the weights to determine dominant fair shares

Page 83: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

in practice,weighted fair sharingis still insufficient

Page 84: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources

Page 85: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos provides reservations

slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development)

resource offers include the reservation role (if any)

masters

framework (trends)

offerhostname4 CPUs4 GB RAMrole: trends

Page 86: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

promotions40%

trends20%

used10%

unused30%recommendations

40%

reservations

reservations provide guarantees,but at the cost of utilization

Page 87: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

revocable resources

masters

framework (promotions)

reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time

offerhostname4 CPUs4 GB RAMrole: trends

Page 88: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

preemption via revocation

… my tasks will not be killed unless I’m using revocable resources!

Page 89: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

Page 90: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

Page 91: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

Page 92: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

masters

① framework failover

framework

framework re-registers with master and resumes operation

all tasks keep running across framework failover!

framework

Page 93: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

Page 94: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

masters

② master failover

framework

after a new master is elected all frameworks and slaves connect to the new master

all tasks keep running across master failover!

Page 95: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

Page 96: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

③ slave failover

mesos-slave

task task

Page 97: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

③ slave failover

mesos-slave

tasktask

Page 98: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

③ slave failover

tasktask

Page 99: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

③ slave failover

mesos-slave

tasktask

Page 100: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

③ slave failover

mesos-slave

tasktask

Page 101: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

③ slave failover @twitter

mesos-slave

(large in-memory services,expensive to restart)

Page 102: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

Page 103: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

execution

masters

framework

task3 CPUs2 GB RAM

frameworks launch fine-grained tasks for execution

if necessary, a framework can provide an executor to handle the execution of a task

Page 104: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

executor

mesos-slave

executor

task

task

Page 105: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

executor

mesos-slave

executor

task

task

task

Page 106: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

executor

mesos-slave

executor task

Page 107: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

goal: isolation

Page 108: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

Page 109: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

containers

Page 110: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

executor + task design means containers can have changing resource allocations

Page 111: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

Page 112: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

Page 113: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

Page 114: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

Page 115: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

Page 116: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

Page 117: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

slave

isolation

mesos-slave

executor

task

task

Page 118: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

making the task first-class gives us true fine-grained resources sharing

Page 119: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

requirement:fast task launching (i.e., milliseconds or less)

Page 120: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

virtual machinesan anti-pattern

Page 121: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

operating-system virtualization

containers(zones and projects)

control groups (cgroups)namespaces

Page 122: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

isolation support

tight integration with cgroups

CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in development)filesystem (using LVM, in development)

Page 123: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

statistics too

rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)

used @twitter for capacity planning (and oversubscription in development)

Page 124: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

CPU upper bounds?

in practice,determinism trumps utilization

Page 125: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

Page 126: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

requirements:① performance

② maintainability (static typing)

③ interfaces to low-level OS (for isolation, etc)

④ interoperability with other languages (for library bindings)

Page 127: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

garbage collectiona performance anti-pattern

Page 128: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

consequences:① antiquated libraries (especially

around concurrency and networking)

② nascent community

Page 129: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

github.com/3rdparty/libprocess

concurrency via futures/actors, networking via message passing

Page 130: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

github.com/3rdparty/stout

monads in C++,safe and understandable utilities

Page 131: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

but …

Page 132: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

scalability simulations to 50,000+ slaves

Page 133: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

@twitter we run multiple Mesos clusters each with 3500+ nodes

Page 134: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

Page 135: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

final remarks

Page 136: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworks• Hadoop (github.com/mesos/hadoop)

• Spark (github.com/mesos/spark)

• DPark (github.com/douban/dpark)

• Storm (github.com/nathanmarz/storm)

• Chronos (github.com/airbnb/chronos)

• MPICH2 (in mesos git repository)

• Marathon (github.com/mesosphere/marathon)

• Aurora (github.com/twitter/aurora)

Page 137: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

write your next distributed system with Mesos!

Page 138: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

port a framework to Mesoswrite a “wrapper”

~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)

see http://github.com/mesos/hadoop

Page 139: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Thank You!

mesos.apache.org

mesos.apache.org/blog

@ApacheMesos

Page 140: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 141: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 142: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

master

② master failover

framework

after a new master is elected all frameworks and slaves connect to the new master

all tasks keep running across master failover!

Page 143: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

stateless masterto make master failover fast, we choose to make the master stateless

state is stored in the leaves, at the frameworks and the slaves

makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)

consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

Page 144: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

master failoverto make master failover fast, we choose to make the master stateless

state is stored in the leaves, at the frameworks and the slaves

makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)

consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

Page 145: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 146: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 147: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Apache Mesos is a distributed systemfor running and building other distributed systems

Page 148: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

originsBerkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

mesos.apache.org/documentation

Page 149: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

ecosystem

mesosdevelopers

operators

frameworkdevelopers

Page 150: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

a tour of mesos from different perspectives of the ecosystem

Page 151: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

the operator

Page 152: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

the operatorPeople who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc)

Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker)

“ops” at most companies (SREs at Twitter)

the static partitioners

Page 153: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

for the operator,Mesos is a cluster manager

Page 154: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

for the operator,Mesos is a resource manager

Page 155: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

for the operator,Mesos is a resource negotiator

Page 156: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

for the operator,Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation

Page 157: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

for the operator,Mesos is a distributed system with a master/slave architecture

masters

slaves

Page 158: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworks/applications register with the Mesos master in order to run jobs/tasks

masters

slaves

Page 159: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworks can be required to authenticate as a principal*

masters

SASL

SASL

CRAM-MD5 secret mechanism(Kerberos in development)

framework

masters initialized with secrets

Page 160: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos is highly-availableand fault-tolerant

Page 161: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

the framework developer

Page 162: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

the framework developer

Page 163: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos uses Apache ZooKeeperfor coordination

mastersslaves

ApacheZooKeeper

Page 164: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

increase utilization with revocable resources and preemption

masters

framework1

hostname:4 CPUs4 GB RAMrole: -

framework2 framework3

61%24%

15%

reservations

framework1

framework2

framework3

64%25%

11%

reservations

framework1

framework2

framework3

Page 165: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

optimistic vs pessimisticwhat to say here …

Page 166: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

authorization*principals can be used for:

authorizing allocation roles

authorizing operating system users (for execution)

Page 167: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

authorization

Page 168: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 169: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 170: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

I’d love to answer some questions with the help

of my data!

Page 171: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

I think I’ll try Hadoop.

Page 172: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

your datacenter

Page 173: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

+ Hadoop

Page 174: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

happy?

Page 175: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Not exactly …

Page 176: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

… Hadoop is a big hammer, but not

everything is a nail!

Page 177: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

I’ve got some iterative algorithms, I want to try

Spark!

Page 178: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 179: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 180: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 181: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioning

Page 182: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioning

Page 183: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioningconsidered harmful

Page 184: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

Page 185: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

Page 186: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Hadoop …

(map/reduce)

(distributed file system)

Page 187: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS

Page 188: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS

Page 189: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS

Page 190: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Could we just give Spark it’s own HDFS cluster

too?

Page 191: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS x 2

Page 192: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS x 2

Page 193: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS x 2

Page 194: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS x 2tee incoming data(2 copies)

Page 195: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS x 2tee incoming data(2 copies)

periodic copy/sync

Page 196: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

That sounds annoying … let’s not do that. Can we do any better though?

Page 197: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS

Page 198: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS

Page 199: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS

Page 200: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

HDFS

Page 201: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

Page 202: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

During the day I’d rather give more machines to Spark but at night I’d

rather give more machines to Hadoop!

Page 203: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 204: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 205: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 206: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 207: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 208: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

Page 209: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 210: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 211: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 212: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

Page 213: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 214: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 215: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter management

Page 216: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 217: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 218: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 219: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.
Page 220: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

I don’t want to deal with this!

Page 221: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

the datacenter …rather than think about the datacenter like this …

Page 222: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

… is a computerthink about it like this …

Page 223: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

datacenter computer

applications

resources

filesystem

Page 224: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

mesos

applications

resources

filesystem

kernel

Page 225: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

mesos

applications

resources

filesystem

kernel

Page 226: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

mesos

frameworks

resources

filesystem

kernel

Page 227: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 1: filesystem

Page 228: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 2: mesosrun a “master” (or multiple for high availability)

Page 229: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 2: mesosrun “slaves” on the rest of the machines

Page 230: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 231: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 232: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 233: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 234: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 235: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 236: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 237: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 238: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 239: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 240: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 241: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 242: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Step 3: frameworks

Page 243: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit$

Page 244: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit(statistical multiplexing)

$

Page 245: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit(statistical multiplexing)

$

Page 246: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit(statistical multiplexing)

$

Page 247: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit(statistical multiplexing)

$

Page 248: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit(statistical multiplexing)

$

Page 249: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit(statistical multiplexing)

$

reduces CapEx and OpEx!

Page 250: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit(statistical multiplexing)

$

reduces latency!

Page 251: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (utilize)$

Page 252: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (utilize)$

Page 253: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (utilize)$

Page 254: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (utilize)$

Page 255: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (utilize)$

Page 256: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (utilize)$

Page 257: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (failures)$

Page 258: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (failures)$

Page 259: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: profit (failures)$

Page 260: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 261: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 262: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

mesos

frameworks

resources

filesystem

kernel

Page 263: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

mesos

frameworks

resources

kernel

Page 264: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

Page 265: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

Page 266: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

reservationscan reserve resources per slave to provide guaranteed resources

requires human participation (ops) to determine what roles should be reserved what resources

kind of like thread affinity, but across many machines (and not just for CPUs)

Page 267: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

Page 268: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

Page 269: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

(1) allocate reserved resources to frameworks authorized for a particular role

(2) allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights

Page 270: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation

Page 271: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 272: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

mesos

frameworks

kernel

Page 273: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

framework≈

distributed system

Page 274: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

framework commonality

run processes/tasks simultaneously (distributed)

handle process failures (fault-tolerant)

optimize performance (elastic)

Page 275: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

framework commonality

run processes/tasks simultaneously (distributed)

handle process failures (fault-tolerant)

optimize performance (elastic)

coordinate execution

Page 276: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworksare

execution coordinators

Page 277: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworksare

execution coordinators

Page 278: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworksare

execution schedulers

Page 279: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

end-to-end principle“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

i.e., frameworks want to coordinate their tasks execution and they should be able to

Page 280: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

framework anatomy

frameworks

Page 281: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

framework anatomy

frameworks

scheduling API

Page 282: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

scheduling

Page 283: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

scheduling

i’d like to run some tasks!

Page 284: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

scheduling

here are some resource offers!

Page 285: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource offers

an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks

schedulers pick which resources to use to run their tasks

foo.bar.com:4 CPUs4 GB RAM

Page 286: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

“two-level scheduling”mesos: controls resource allocations to schedulers

schedulers: make decisions about what to run given allocated resources

Page 287: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

concurrency controlthe same resources may be offered to different frameworks

Page 288: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

concurrency controlthe same resources may be offered to different frameworks

optimisticpessimistic

no overlapping offers all overlapping offers

Page 289: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tasksthe “threads” of the framework, a consumer of resources (cpu, memory, etc)

either a concrete command line or an opaque description (which requires an executor)

Page 290: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tasks

here are some resources!

Page 291: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tasks

launch these tasks!

Page 292: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tasks

Page 293: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tasks

Page 294: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

status updates

Page 295: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

status updates

Page 296: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

status updates

task status update!

Page 297: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

status updates

Page 298: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

status updates

Page 299: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

status updates

task status update!

Page 300: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

more scheduling

Page 301: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

more scheduling

i’d like to run some tasks!

Page 302: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 303: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability

Page 304: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (master)

Page 305: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (master)

Page 306: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (master)

Page 307: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (master)

Page 308: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (master)

Page 309: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (master)task status update!

Page 310: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (master)i’d like to run some tasks!

Page 311: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (master)

Page 312: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (framework)

Page 313: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (framework)

Page 314: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (framework)

Page 315: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (framework)

Page 316: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (slave)

Page 317: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (slave)

Page 318: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

high-availability (slave)

Page 319: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 320: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource isolation

leverage Linux control groups (cgroups)

CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in progress)filesystem (lvm, in progress)

Page 321: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource statistics

rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)

per task/executor statistics are collected (for all fork/exec’ed processes too!)

can help with capacity planning

Page 322: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 323: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

securityTwitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term

Page 324: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

Page 325: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

framework commonality

run processes/tasks simultaneously (distributed)

handle process failures (fault-tolerant)

optimize performance (elastic)

Page 326: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

framework commonality

as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!?

Page 327: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

case study: chronosdistributed cron with dependencies

developed at airbnb

~3k lines of Scala!

distributed, highly available, and fault tolerant without any network programming!

http://github.com/airbnb/chronos

Page 328: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

analytics

Page 329: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

analytics + services

Page 330: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

analytics + services

Page 331: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

analytics + services

Page 332: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

case study: aurora“run 200 of these, somewhere, forever”

developed at Twitter

highly available (uses the mesos replicated log)

uses a python DSL to describe services

leverages service discovery and proxying (see Twitter commons)

http://github.com/twitter/aurora

Page 333: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

frameworks• Hadoop (github.com/mesos/hadoop)

• Spark (github.com/mesos/spark)

• DPark (github.com/douban/dpark)

• Storm (github.com/nathanmarz/storm)

• Chronos (github.com/airbnb/chronos)

• MPICH2 (in mesos git repository)

• Marathon (github.com/mesosphere/marathon)

• Aurora (github.com/twitter/aurora)

Page 334: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

write your next distributed system with mesos!

Page 335: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

port a framework to mesoswrite a “wrapper” scheduler

~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)

see http://github.com/mesos/hadoop

Page 336: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

conclusionsdatacenter management is a pain

Page 337: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

conclusionsmesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx!

Page 338: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

conclusionsrather than build your next distributed system from scratch, consider using mesos

Page 339: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

conclusionsyou can share your datacenter between analytics and online services!

Page 340: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Questions?

mesos.apache.org

@ApacheMesos

Page 341: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

aurora

Page 342: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

aurora

Page 343: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

aurora

Page 344: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

aurora

Page 345: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

aurora

Page 346: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

framework commonality

run processes simultaneously (distributed)

handle process failures (fault-tolerance)

optimize execution (elasticity, scheduling)

Page 347: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

primitivesscheduler – distributed system “master” or “coordinator”

(executor – lower-level control of task execution, optional)

requests/offers – resource allocations

tasks – “threads” of the distributed system

Page 348: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

scheduler

ApacheHadoop

Chronos

Page 349: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

scheduler(1) brokers for resources

(2) launches tasks

(3) handles task termination

Page 350: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

brokering for resources(1) make resource requests 2 CPUs 1 GB RAM slave *

(2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com

Page 351: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

offers: non-blocking resource allocation

exist to answer the question:

“what should mesos do if it can’t satisfy a request?”

(1) wait until it can

(2) offer the best allocation it can immediately

Page 352: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

offers: non-blocking resource allocation

exist to answer the question:

“what should mesos do if it can’t satisfy a request?”

(1) wait until it can

(2) offer the best allocation it can immediately

Page 353: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

ApacheHadoop

Chronos

request

Page 354: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

ApacheHadoop

Chronos

request

allocatordominant resource fairnessresource reservations

Page 355: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

ApacheHadoop

Chronos

request

allocatordominant resource fairnessresource reservations

optimisticpessimistic

Page 356: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

ApacheHadoop

Chronos

request

allocatordominant resource fairnessresource reservations

optimisticpessimisticno overlapping offers all overlapping offers

Page 357: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource allocation

ApacheHadoop

Chronos

offer

allocatordominant resource fairnessresource reservations

Page 358: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

“two-level scheduling”mesos: controls resource allocations to framework schedulers

schedulers: make decisions about what to run given allocated resources

Page 359: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

end-to-end principle

“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

Page 360: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

taskseither a concrete command line or an opaque description (which requires a framework executor to execute)

a consumer of resources

Page 361: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

task operationslaunching/killing

health monitoring/reporting (failure detection)

resource usage monitoring (statistics)

Page 362: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

resource isolation

cgroup per executor or task (if no executor)

resource controls adjusted dynamically as tasks come and go!

Page 363: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

case study: chronosdistributed cron with dependencies

built at airbnb by @flo

Page 364: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

before chronos

Page 365: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

before chronos

single point of failure (and AWS was unreliable)

resource starved (not scalable)

Page 366: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

chronos requirementsfault tolerance

distributed (elastically take advantage of resources)

retries (make sure a command eventually finishes)

dependencies

Page 367: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

chronosleverages the primitives of mesos

~3k lines of scala

highly available (uses Mesos state)

distributed / elastic

no actual network programming!

Page 368: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

after chronos

Page 369: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

after chronos + hadoop

Page 370: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

case study: aurora“run 200 of these, somewhere, forever”

built at Twitter

Page 371: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

before aurorastatic partitioning of machines to services

hardware outages caused site outages

puppet + monit

ops couldn’t scale as fast as engineers

Page 372: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

aurorahighly available (uses mesos replicated log)

uses a python DSL to describe services

leverages service discovery and proxying (see Twitter commons)

Page 373: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

after aurorapower loss to 19 racks, no lost services!

more than 400 engineers running services

largest cluster has >2500 machines

Page 374: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

Chronos

Page 375: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI

Node

Page 376: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

Page 377: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

Chronos …

Page 378: Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos.

tep 4: Profit(statistical multiplexing)

$