GeekAustin: What’s So Exciting About Mesos?

47
Tobi Knaup @superguenter Paco Nathan @pacoid “GeekAustin: What’s So Exciting About Mesos?” Licensed under a Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. Tuesday, 13 August 13

Transcript of GeekAustin: What’s So Exciting About Mesos?

“What’s so exciting about Mesos?”

• What is Apache Mesos?

• Case Studies

• History: How did we get here?

• Screen Shots

• Demo, Q&A

mesos.apache.org

Tuesday, 13 August 13

Mesos – definitions

a common substrate for cluster computing

heterogenous assets in your data center or cloud made available as a homogenous set of resources

• Fault-tolerant replicated master using ZooKeeper

• Scalability to 10,000s of nodes

• Isolation between tasks with Linux Containers

• Multi-resource scheduling (memory and CPU aware)

• Java, Python, and C++ APIs for developing new parallel applications

• Web UI for viewing cluster state

• Obviates the need for virtual machines

Tuesday, 13 August 13

Mesos – background

• Available for Linux, Mac OSX, OpenSolaris

• Developed by UC Berkeley / AMP Lab, Twitter, Airbnb, Mesosphere, etc.

• Deployments at Twitter, Airbnb, InsideVault, Vimeo, UCSF, UC Berkeley, etc.

Tuesday, 13 August 13

Mesos Kernel

Chronos Marathon

Apps

Web AppsStreamingBatch

FrameworksHadoop Spark Storm

RailsJBoss

KafkaMPI

Hive Scalding

JVM

Python

C++

Workloads

Mesos – architecture

Tuesday, 13 August 13

“Return of the Borg”

Return of the Borg: How Twitter Rebuilt Google’s Secret WeaponCade Metzwired.com/wiredenterprise/2013/03/google-borg-twitter-mesos

“We wanted people to be able to program for the data center just like they program for their laptop."

Ben Hindman

Tuesday, 13 August 13

“Return of the Borg”

Consider that Google is generations ahead of Hadoop, etc., with much improved ROI on its data centers…

Borg serves as the data center “secret sauce”,with Omega as its next evolution:

2011 GAFS OmegaJohn Wilkes, et al.youtu.be/0ZFMlO98Jkc

Tuesday, 13 August 13

Industry Issues:

• Most software developers tend to think about computing resources in terms of individual hosts

• Clusters are simply considered as collections of hosts

• Typically, those machines get divided into smaller virtual machines to allow for fine-grained resource allocation

• On the one hand, this practice leads to more complexity, due to the number of systems that must be managed

• On the other hand, it results in less efficiency: the hypervisor becomes a black box which the host operating system cannot schedule intelligently

Tuesday, 13 August 13

Mesos – benefits

• scale to 10,000s of nodes using fast, event-driven C++ impl

• maximize utilization rates, minimize latency for data updates

• combine batch, real-time, and long-lived services on the same nodes and share resources

• reshape clusters on the fly based on app history and workload requirements

• run multiple Hadoop versions, Spark, MPI, Heroku, HAProxy, etc., on the same cluster

• build new distributed frameworks without reinventing low-level facilities

• enable new kinds of apps, which combine frameworks with lower latency

• hire top talent out of Google, while providing a familiar data center environment

Tuesday, 13 August 13

STATE OF THE ART

Provision VMs on public cloud or physical servers

DATACENTER

Tuesday, 13 August 13

STATE OF THE ARTPROVISIONED VMS

Provision VMs on public cloud or physical servers

DATACENTER

Tuesday, 13 August 13

STATE OF THE ARTPROVISIONED VMS

Use Chef/Puppet to setup & launch Hadoop

DATACENTER

Tuesday, 13 August 13

STATE OF THE ARTSTATICALLY PARTITIONED SERVICES

Use Chef/Puppet to setup & launch Hadoop

DATACENTER

Tuesday, 13 August 13

STATE OF THE ARTSTATICALLY PARTITIONED SERVICES

Use Chef/Puppet to setup & launch JBoss

DATACENTER

Tuesday, 13 August 13

STATE OF THE ARTSTATICALLY PARTITIONED SERVICES

Use Chef/Puppet to setup & launch JBoss

DATACENTER

Tuesday, 13 August 13

STATE OF THE ARTSTATICALLY PARTITIONED SERVICES

Manually resize Hadoop

DATACENTER

Tuesday, 13 August 13

STATE OF THE ARTSTATICALLY PARTITIONED SERVICES

DATACENTER

Manually resize Hadoop

Tuesday, 13 August 13

STATE OF THE ARTSTATICALLY PARTITIONED SERVICES

It is difficult to deploy new frameworks (provision, setup, install, resize)

Static partitioning leads to low utilization and prevents elasticity

DATACENTER

Tuesday, 13 August 13

ONE LARGE POOL OF RESOURCES

DATACENTERMESOS

Tuesday, 13 August 13

VALUE PROPOSITION - EASY DEVELOPMENT OF APPS

CHRONOS SPARK HADOOP DPARK MPI

JVM (JAVA, SCALA, CLOJURE, JRUBY)

MESOS

PYTHON C++

Tuesday, 13 August 13

MESOSPHERE CLOUD OS STACK

HADOOP STORM CHRONOS RAILS JBOSS

TELEMETRY

Kernel

OS

Apps

MESOS

CAPACITY PLANNING GUISECURITYSMARTER SCHEDULING

Tuesday, 13 August 13

Example: Balance Utilization Curves

0%

25%

50%

75%

100%

RAILS CPU LOAD

MEMCACHED CPU LOAD

0%

25%

50%

75%

100%

HADOOP CPU LOAD

0%

25%

50%

75%

100%

t t

0%

25%

50%

75%

100%

Rails MemcachedHadoop

COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP)

Tuesday, 13 August 13

“What’s so exciting about Mesos?”

• What is Apache Mesos?

• Case Studies

• History: How did we get here?

• Screen Shots

• Demo, Q&A

mesos.apache.org

Tuesday, 13 August 13

Case Study: Twitter (bare metal / on-prem)

“Mesos is the cornerstone of our elastic compute infrastructure – it’s how we build all our new services and is critical for Twitter’s continued success at scale. It's one of the primary keys to our data center efficiency."

Chris Fry, SVP Engineeringblog.twitter.com/2013/mesos-graduates-from-apache-incubation

• several key services run in production: analytics, typeahead, ads, etc.

• engineers rely on Mesos to build all our new services

• instead of thinking about static machines, engineers think about resources like CPU, memory and disk

• allows services to scale and leverage a shared pool of servers across data centers efficiently

• reduces the time between prototyping and launching new services efficiently

Tuesday, 13 August 13

Case Study: Airbnb (fungible cloud infra)

“We think we might be pushing data science in the field of travel more so than anyone has ever done before… a smaller number of engineers can have higher impact through automation on Mesos."

Mike Curtis, VP Engineeringgigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven-company

• improves resource management and efficiency

• helps advance engineering strategy of building small teams that can move fast

• key to letting engineers make the most of AWS-based infrastructure beyond just Hadoop

• allowed Airbnb to migrate off the Elastic MapReduce service

• enables use of Hadoop along with Chronos, Spark, Storm, etc.

Tuesday, 13 August 13

TWO WORLDS - ONE SUBSTRATE

Built-in /bare metal

Hypervisors

Solaris Zones

Linux CGroups

Tuesday, 13 August 13

TWO WORLDS - ONE SUBSTRATE

Request /Response Batch

Tuesday, 13 August 13

“What’s so exciting about Mesos?”

• What is Apache Mesos?

• Case Studies

• History: How did we get here?

• Screen Shots

• Demo, Q&A

mesos.apache.org

Tuesday, 13 August 13

Q3 1997: inflection point

Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware

This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack emerged from this

Tuesday, 13 August 13

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

Tuesday, 13 August 13

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

“throw it over the wall”

Tuesday, 13 August 13

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

Tuesday, 13 August 13

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

“data products”

Tuesday, 13 August 13

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

Tuesday, 13 August 13

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

“optimize topologies”

Tuesday, 13 August 13

Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtu.be/E91oEn1bnXM

Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtu.be/qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

MIT Media Lab“Social Information Filtering for Music Recommendation” – Pattie Maespubs.media.mit.edu/pubs/papers/32paper.psted.com/speakers/pattie_maes.html

Primary Sources

Tuesday, 13 August 13

Current Challenge

Consider the datacenter as a computer…

We must rethink the way that we write, deploy, and manage distributed applications

Early use cases for clustered computing tend to tolerate, having many separate clusters; however, more mature Enterprise use cases require ROI, hence higher utilization rates

Managing the operational costs for large, distributed apps becomes key

Mesos provides the means for this evolution

Tuesday, 13 August 13

“What’s so exciting about Mesos?”

• What is Apache Mesos?

• Case Studies

• History: How did we get here?

• Screen Shots

• Demo, Q&A

mesos.apache.org

Tuesday, 13 August 13

Tuesday, 13 August 13

Tuesday, 13 August 13

Tuesday, 13 August 13

Tuesday, 13 August 13

Tuesday, 13 August 13

Tuesday, 13 August 13

Tuesday, 13 August 13

“What’s so exciting about Mesos?”

• What is Apache Mesos?

• Case Studies

• History: How did we get here?

• Screen Shots

• Demo, Q&A

mesos.apache.org

Tuesday, 13 August 13