Systems for Resource Management - home · DAMON · Mesos in the data center • Where does Mesos...

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Systems for Resource Management

Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Valeria Cardellini

The reference Big Data stack

Matteo Nardelli - SABD 2016/17

1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Outline

•  Cluster management systems –  Mesos –  YARN –  Borg –  Omega –  Kubernetes

•  Resource management policies –  DRF

Valeria Cardellini - SABD 2016/17

2

Motivations

•  Rapid innovation in cloud computing

•  No single framework optimal for all applications

•  Running each framework on its dedicated cluster: –  Expensive –  Hard to share data


3

A possible solution

4

•  Run multiple frameworks on a single cluster

•  How to share the (virtual) cluster resources among multiple and non homogeneous frameworks executed in virtual machines/containers?

•  The classical solution:

Static partitioning

•  Is it efficient?


What we need

•  The Datacenter as a Computer idea by D. Patterson –  Share resources to maximize their utilization –  Share data among frameworks –  Provide a unified API to the outside –  Hide the internal complexity of the infrastructure

from applications •  The solution:

A cluster-scale resource manager that employs dynamic partitioning


5

Apache Mesos

6 Valeria Cardellini - SABD 2016/17

Dynamic partitioning

•  Cluster manager that provides a common resource sharing layer over which diverse frameworks can run

•  Abstracts the entire datacenter into a single pool of computing resources, simplifying running distributed systems at scale

•  A distributed system to run distributed systems on top of it

Apache Mesos (2)


7

•  Designed and developed at the University of Berkeley -  Top open-source project by Apache mesos.apache.org

•  Twitter and Airbnb as first users; now supports some of the largest applications in the world

•  Cluster: a dynamically shared pool of resources Dynamic partitioning Static partitioning

Mesos goals

•  High utilization of resources

•  Supports diverse frameworks (current and future)

•  Scalability to 10,000's of nodes

•  Reliability in face of failures


8

Mesos in the data center

•  Where does Mesos fit as an abstraction layer in the datacenter?


9

Computation model

•  A framework (e.g., Hadoop, Spark) manages and runs one or more jobs

•  A job consists of one or more tasks •  A task (e.g., map, reduce) consists of one or

more processes running on same machine


10

What Mesos does


11

•  Enables fine-grained resource sharing (at the level of tasks within a job) of resources (CPU, RAM, …) across frameworks

•  Provides common functionalities: -  Failure detection

-  Task distribution

-  Task starting

-  Task monitoring

-  Task killing

-  Task cleanup

Fine-grained sharing

•  Allocation at the level of tasks within a job •  Improves utilization, latency, and data locality


12

Coarse-grain sharing Fine-grain sharing

Frameworks on Mesos

•  Frameworks must be aware of running on Mesos –  DevOps tooling: Vamp (deployment and workflow

tool for container orchestration) –  Long running services: Aurora (service scheduler), …

–  Big Data processing: Hadoop, Spark, Storm, MPI, …

–  Batch scheduling: Chronos, … –  Data storage: Alluxio, Cassandra, ElasticSearch, … –  Machine learning: TFMesos (Tensorflow in Docker

on Mesos) See the full list at mesos.apache.org/documentation/latest/frameworks/


13

Mesos: architecture


14

•  Master-slave architecture

•  Slaves publish available resources to master

•  Master sends resource offers to frameworks

•  Master election and service discovery via ZooKeeper

Source: Mesos: a platform for fine-grained resource sharing in the data center, NSDI'11

Mesos component: Apache ZooKeeper •  Coordination service for maintaining configuration

information, naming, providing distributed synchronization, and providing group services

•  Used in many distributed systems, among which Mesos, Storm and Kafka

•  Allows distributed processes to coordinate with each other through a shared hierarchical name space of data (znodes) –  File-system-like API –  Name space similar to a standard file system –  Limited amount of data in znodes –  It is no really a: file system, database, key-value store, lock

service

•  Provides high throughput, low latency, highly available, strictly ordered access to the znodes


15

Mesos component: ZooKeeper (2) •  Replicated over a set of machines that maintain an

in-memory image of the data tree –  Read requests processed locally by the ZooKeeper server –  Write requests forwarded to other ZooKeeper servers and

consensus before a response is generated (primary-backup system)

–  Uses Paxos as leader election protocol to determine which server is the master

•  Implements atomic broadcast –  Processes deliver the same messages (agreement) and

deliver them in the same order (total order) –  Message = state update


16

Mesos and framework components


17

•  Mesos components -  Master

-  Slaves or agents

•  Framework components -  Scheduler: registers with

the master to be offered resources

-  Executors: launched on agent nodes to run the framework’s tasks

Scheduling in Mesos


18

•  Scheduling mechanism based on resource offers -  Mesos offers available resources to frameworks

•  Each resource offer contains a list of <agent ID, resource1: amount1, resource2: amount2, ...> !

-  Each framework chooses which resources to use and which tasks to launch

•  Two-level scheduler architecture -  Mesos delegates the actual scheduling of tasks to

frameworks

-  Improves scalability: the master does not have to know the scheduling intricacies of every type of application that it supports

Mesos: resource offers

•  Resource allocation is based on Dominant Resource Fairness (DRF)


19

Mesos: resource offers in details


20

•  Slaves continuously send status updates about resources to the Master

Mesos: resource offers in details (2)


21



22

•  Framework scheduler can reject offers



23

•  Framework scheduler selects resources and provides tasks

•  Master sends tasks to slaves



24

•  Framework executors launch tasks



25



26

Mesos fault tolerance

•  Task failure •  Slave failure •  Host or network failure •  Master failure •  Framework scheduler failure


27

Fault tolerance: task failure


28

Fault tolerance: task failure (2)


29

Fault tolerance: slave failure


30

Fault tolerance: slave failure (2)


31

Fault tolerance: host or network failure


32

Fault tolerance: host or network failure (2)


33

Fault tolerance: host or network failure (3)


34

Fault tolerance: master failure


35

Fault tolerance: master failure (2)


36

•  When the leading Master fails, the surviving masters use ZooKeeper to elect a new leader

Fault tolerance: master failure (3)


37

•  The slaves and frameworks use ZooKeeper to detect the new leader and reregister

Fault tolerance: framework scheduler failure


38

Fault tolerance: framework scheduler failure (2)


39

•  When a framework scheduler fails, another instance can reregister to the Master without interrupting any of the running tasks



40



41

Resource allocation

1.  How to assign the cluster resources to the tasks? –  Design alternatives

•  Global (monolithic) scheduler •  Two-level scheduler

2.  How to allocate resources of different types?


42

Global (monolithic) scheduler

•  Job requirements –  Response time –  Throughput –  Availability

•  Job execution plan –  Task direct acyclic graph (DAG) –  Inputs/outputs

•  Estimates –  Task duration –  Input sizes –  Transfer sizes


43

•  Pros –  Can achieve optimal

schedule (global knowledge) •  Cons:

–  Complexity: hard to scale and ensure resilience

–  Hard to anticipate future frameworks requirements

–  Need to refactor existing frameworks

Two-level scheduling in Mesos

•  Resource offer –  Vector of available

resources on a node –  E.g., node1: <1CPU, 1GB>,

node2: <4CPU, 16GB>

•  Master sends resource offers to frameworks

•  Frameworks select which offers to accept and which tasks to run


44

•  Pros: –  Simple: easier to scale and

make resilient –  Easy to port existing

frameworks and support new ones

•  Cons: –  Distributed scheduling

decision: not optimal

Push task placement to frameworks

Mesos: resource allocation

•  Based on Dominant Resource Fairness (DRF) algorithm


45

DRF: background on fair sharing •  Consider a single resource: fair sharing

–  n users want to share a resource, e.g., CPU –  Solution: allocate each 1/n of the shared

resource •  Generalized by max-min fairness

–  Handles if a user wants less than its fair share

–  E.g., user 1 wants no more than 20% •  Generalized by weighted max-min

fairness –  Gives weights to users according to

importance –  E.g., user 1 gets weight 1, user 2 weight 2


46

Max-min fairness •  1 resource type: CPU •  Total resources: 20 CPU •  User 1 has x tasks and wants <1CPU> per task •  User 2 has y tasks and wants <2CPU> per task

max(x, y) (maximize allocation) subject to

x + 2y ≤ 20 (CPU constraint) x = 2y (fairness)

Solution: x = 10 y = 5


47

Why is fair sharing useful?

•  Proportional allocation –  User 1 gets weight 2, user 2 weight 1

•  Priorities –  Give user 1 weight 1000, user 2 weight 1

•  Reservations: –  Ensure user 1 gets 10% of a resource, so give

user 1 weight 10, sum weights 100 •  Isolation policy:

–  Users cannot affect others beyond their fair share


48

Why is fair sharing useful? (2)

•  Share guarantee –  Each user can get at least 1/n of the resource –  But will get less if its demand is less

•  Strategy-proof –  Users are not better off by asking for more than

they need –  Users have no reason to lie

•  Max-min fairness is the only reasonable mechanism with these two properties

•  Many schedulers use max-min fairness –  OS, networking, datacenters (e.g., YARN),


49

Max-min fairness drawback •  When is max-min fairness not enough? •  Need to schedule multiple, heterogeneous

resources (CPU, memory, disk, I/O) •  Single resource example

–  1 resource: CPU –  User 1 wants <1CPU> per task –  User 2 wants <2CPU> per task

•  Multi-resource example –  2 resources: CPUs and memory –  User 1 wants <1CPU, 4GB> per task –  User 2 wants <3CPU, 1GB> per task

•  What is a fair allocation? Valeria Cardellini - SABD 2016/17

50

A first (wrong) solution •  Asset fairness: gives weights to resources (e.g., 1

CPU = 1 GB) and equalizes total value allocated to each user

•  Total resources: 28 CPU and 56GB RAM (e.g., 1 CPU = 2 GB) –  User 1 has x tasks and wants <1CPU, 2GB> per task –  User 2 has y tasks and wants <1CPU, 4GB> per task

•  Asset fairness yields: max(x, y) x + y ≤ 28 2x + 4y ≤ 56 4x = 6y

•  User 1: x = 12 i.e. <43%CPU, 43%GB> (sum = 86%) •  User 2: y = 8 i.e. <29%CPU, 57%GB> (sum = 86%)


51

A first (wrong) solution (2)

•  Problem: violates share guarantee –  User 1 gets less than 50% of both CPU and RAM –  Better off in a separate cluster with half the

resources


52

What Mesos needs

•  A fair sharing policy that provides: –  Share guarantee –  Strategy-proofness

•  Challenge: can we generalize max-min fairness to multiple resources?

•  Solution: Dominant Resource Fairness (DRF)


53

Source: Dominant Resource Fairness: Fair Allocation of Multiple Resource Types, NSDI'11

DRF

•  Dominant resource of a user: the resource that user has the biggest share of –  Example:

•  Total resources: <8CPU, 5GB> •  User 1 allocation: <2CPU, 1GB> •  2/8 = 25%CPU and 1/5 = 20%RAM •  Dominant resource of user 1 is CPU (25% > 20%)

•  Dominant share of a user: the fraction of the dominant resource allocated to the user –  User 1 dominant share is 25%


54

DRF (2) •  Apply max-min fairness to dominant shares: give

every user an equal share of its dominant resource •  Equalize the dominant share of the users

–  Total resources: <9CPU, 18GB> –  User 1 wants <1CPU, 4GB> –  Dominant resource for user 1: RAM (1/9 < 4/18) –  User 2 wants <3CPU, 1GB> –  Dominant resource for user 2: CPU (3/9 > 1/18)

max(x, y) x + 3y ≤ 9 4x + y ≤ 18 (4/18)x = (3/9)y

•  User 1: x = 3 <33%CPU, 66%GB> •  User 2: y = 2 <66%CPU, 16%GB>


55

Online DRF

•  Whenever there are available resources and tasks to run: –  Presents resource offers to the framework with the

smallest dominant share among all frameworks


56

DRF: efficiency-fairness trade-off


57

Efficiency-Fairness Trade-off

•  DRF has under-utilized resources

•  DRF schedules at the level of tasks (lead to sub-optimal job completion time)

•  Fairness is fundamentally at odds with overall efficiency (how to trade-off?)

Mesos and containers

•  Mesos plays well with existing container technologies (e.g., Docker) and also provides its own container technology –  Docker containers: tasks run inside Docker

container –  Mesos containers: tasks run with an array of

pluggable isolators provided by Mesos •  Also supports composing different container

technologies (e.g., Docker and Mesos) •  Which scale? 50,000 live containers


58

Mesos deployment and evolution

•  Mesos clusters can be deployed –  On nearly every IaaS cloud provider infrastructure –  In your own physical datacenter

•  Mesos platform evolution towards microservices •  Mesosphere’s Datacenter Operating System (DC/OS)


59

–  Open source operating system and distributed system built upon Mesos

•  Marathon –  Container orchestration platform

for Mesos and DC/OS

Apache Hadoop YARN •  YARN: Yet Another Resource Negotiator

–  A framework for job scheduling and cluster resource management

•  Turns out Hadoop into an analytics platform in which resource management functions are separated from the programming model –  Many scheduling-related functions are delegated to per-job

components

•  Why separation? –  Provides flexibility in the choice of programming framework –  Platform can support not only MapReduce, but also other

models (e.g., Spark, Storm)


60

Source: Apache Hadoop YARN: Yet Another Resource Negotiator, SoCC'13

YARN architecture •  Global ResourceManager (RM) •  A set of per-application ApplicationMasters (AMs) •  A set of NodeManagers (NMs)


61

YARN architecture: ResourceManager •  One RM per cluster

–  Central: global view –  Enable global properties: fairness, capacity, locality

•  Job requests are submitted to RM –  Request-based –  To start a job (application), RM finds a container to spawn AM

•  Container –  Logical bundle of resources (CPU, memory)

•  Only handles an overall resource profile for each application –  Local optimization is up to the application

•  Preemption –  Request resources back from an application –  Checkpoint snapshot instead of explicitly killing jobs/migrate

computation to other containers Valeria Cardellini - SABD 2016/17

62

YARN architecture: ResourceManager (2) •  No static resource partitioning •  RM: resource scheduler in the system

–  Organizes the global assignment of computing resources to application instances

•  Composed of Scheduler and ApplicationsManager –  Scheduler: responsible for allocating resources to the

various running applications •  Pluggable scheduling policy to partition the cluster resources

among the various applications •  Available pluggable schedulers:

–  CapacityScheduler and FairScheduler –  ApplicationsManager accepts job submissions and

negotiates the first set of resources for job execution


63

YARN architecture: ApplicationMaster •  One AM per application •  The head of a job •  Runs as a container •  Issues resource requests to RM to obtain containers

–  # of containers, resources per container, locality preferences, ...

•  Dynamically changing resource consumption, based on the containers it receives from the RM

•  Requests are late-binding –  The process spawned is not bound to the request, but to the

lease –  The conditions that caused the AM to issue the request may not

remain true when it receives its resources •  Can run any user code, e.g., MapReduce, Spark, etc. •  AM determines the semantics of the success or failure of

the container Valeria Cardellini - SABD 2016/17

64

YARN architecture: NodeManager

•  One NM per node –  The worker daemon –  Runs as local agent on the execution node

•  Registers with RM, heartbeats its status and receives instructions

•  Reports resources to RM: memory, CPU, ... –  Responsible for monitoring resource availability, reporting

faults, and container lifecycle management (e.g., starting, killing)

•  Containers are described by a container launch context (CLC) –  The command necessary to create the process –  Environment variables, security tokens, …


65

YARN architecture: NodeManager (2)

•  Configures the environment for task execution •  Garbage collection •  Auxiliary services

–  A process may produce data that persist beyond the life of the container

–  Output intermediate data between map and reduce tasks.


66

YARN: application startup

•  Submitting the application: passing a CLC for the AM to the RM

•  The RM launches the AM, which registers with the RM

•  Periodically advertises its liveness and requirements over the heartbeat protocol

•  Once the RM allocates a container, AM can construct a CLC to launch the container on the corresponding NM –  It monitors the status of the running container and stops it

when the resource should be reclaimed

•  Once the AM is done with its work, it should unregister from the RM and exit cleanly


67

Comparing Mesos and YARN


68

Mesos YARN Two-level scheduler Two-level scheduler (limited) Offer-based approach Request-based approach Pluggable scheduling policy (DRF as default)

Pluggable scheduling policy

Framework gets resource offer to choose (minimal information)

Framework asks a container with specification + preferences (lots of information)

General-purpose scheduler (including stateful services)

Designed and optimized for Hadoop jobs (stateless batch jobs)

Multi-dimensional resource granularity

RAM/CPU slots resource granularity

Multiple Masters Single ResourceManager Written in C++ Written in Java Linux cgroups (performance isolation)

Simple Linux processes

Cluster resource management: Hot research topic

•  Borg @EuroSys 2015 –  Cluster management system used by Google for

the last decade, ancestor of Kubernetes •  Firmament @OSDI 2016 firmament.io

–  Scheduler (with a Kubernetes plugin called Poseidon) that balances the high-quality placement decisions of a centralized scheduler with the speed of a distributed scheduler

•  Slicer @OSDI 2016 –  Google’s auto-sharding service that separates out

the responsibility for sharding, balancing, and re-balancing from individual application frameworks


69

Container orchestration

•  Container orchestration: set of operations that allows cloud and application providers to define how to select, deploy, monitor, and dynamically control the configuration of multi-container packaged applications in the cloud –  The tool used by organizations adopting containers

for enterprise production to integrate and manage those containers at scale

•  Some examples –  Cloudify –  Kubernetes –  Docker Swarm


70

Container-management systems at Google

•  Application-oriented shift “Containerization transforms the data center from being machine-oriented to being application-oriented”

•  Goal: to allow container technology to operate at Google scale –  Everything at Google runs as a container

•  Borg -> Omega -> Kubernetes –  Borg and Omega: purely Google-internal systems –  Kubernetes: open-source


71

Borg

Cluster management system at Google that achieves high utilization by: •  Admission control •  Efficient task-packing •  Over-commitment •  Machine sharing


72

The user perspective •  Users: Google developers and system

administrators mainly •  Heterogeneous workload: production and

batch, mainly •  Cell: subset of cluster machines

–  Around 10K nodes –  Machines in a cell are heterogeneous in many

dimensions •  Jobs and tasks

–  A job is a group of tasks –  Each task maps to a set of Linux processes

running in a container on a machine •  task = container


73

The user perspective •  Alloc (allocation)

–  Reserved pool of resources on a machine in which one or more tasks can run

–  The smallest deployable units that can be created, scheduled, and manage

–  Task = alloc –  Job = alloc set

•  Priority, quota, and admission control –  Job has a priority (preempting) –  Quota is used to decide which jobs to admit for scheduling

•  Naming and monitoring –  Service’s clients and other systems need to be able to find

tasks, even after their relocation to a new machine –  BNS example: 50.jfoo.ubar.cc.borg.google.com !–  Monitoring health of the task and thousands of performance

metrics Valeria Cardellini - SABD 2016/17

74

Scheduling a job


75

job hello_world = {

runtime = { cell = “ic” } //what cell should run it in?

binary = ‘../hello_world_webserver’ //what program to run?

args = { port = ‘%port%’ }

requirements = {

RAM = 100M

disk = 100M

CPU = 0.1

}

replicas = 10000

}

Borg architecture


76

•  Borgmaster –  Main BorgMaster process

•  Five replicas –  Scheduler

•  Borglets –  Manage and monitor tasks

and resource –  Borgmaster polls Borglets

every few seconds

Borg architecture


77

•  Fauxmaster: high-fidelity Borgmaster simulator –  Simulate previous runs from

checkpoints –  Contains full Borg code

•  Used for debugging, capacity planning, evaluate new policies and algorithms

Scheduling in Borg

•  Two-phase scheduling algorithm 1. Feasibility checking: find machines for a given

job 2. Scoring: pick one machine

–  User preferences and build-in criteria •  Minimize the number and priority of the preempted tasks •  Picking machines that already have a copy of the task’s

packages •  Spreading tasks across power and failure domains •  Packing by mixing high and low priority tasks


78

Borg’s allocation policies

•  Advanced bin-packing algorithms –  Avoid stranding of resources –  Bin packing problem: objects of different volumes

vi must be packed into a finite number of bins each of volume V in a way that minimizes the number of bins used

–  Bin packing: NP-hard problem -> many heuristics, among which: best fit, first fit, worst fit

•  Evaluation metric: cell-compaction –  Find smallest cell that we can pack the workload

into… –  Remove machines randomly from a cell to

maintain cell heterogeneity Valeria Cardellini - SABD 2016/17

79

Design choices for scalability in Borg

•  Separate scheduler •  Separate threads to poll the Borglets •  Partition functions across the five replicas •  Score caching

–  Evaluating feasibility and scoring a machine is expensive, so cache them

•  Equivalence classes –  Group tasks with identical requirements

•  Relaxed randomization –  Wasteful to calculate feasibility and scores for all

the machines in a large cell Valeria Cardellini - SABD 2016/17

80

Borg ecosystem

•  Many different systems built in, on, and around Borg to improve upon the basic container-management services –  Naming and service discovery (the Borg Name

Service, or BNS) –  Master election, using Chubby –  Application-aware load balancing –  Horizontal (instance number) and vertical

(instance size) auto-scaling –  Rollout tools for deployment of new binaries and

configuration data –  Workflow tools –  Monitoring tools


81

Omega

•  Offspring of Borg •  State of the cluster stored in a centralized

Paxos-based transaction-oriented store –  Accessed by the different parts of the cluster

control plane (such as schedulers), using optimistic concurrency control to handle the occasional conflicts


82

Mesos Borg

Kubernetes (K8s)

•  Google’s open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure kubernetes.io kubernetes.io/docs/tutorials/kubernetes-basics/

•  Features: –  Portable: public, private, hybrid, multi-cloud –  Extensible: modular, pluggable, hookable,

composable –  Self-healing: auto-placement, auto-restart, auto-

replication, auto-scalingValeria Cardellini - SABD 2016/17

83

Borg and Kubernetes

Valeria Cardellini - SABD 2016/17 84

Directly derived

•  Borglet => Kubelet •  alloc => pod •  Borg containers =>

Docker •  Declarative specifications

Improved

•  Job => labels •  Managed ports => IP per

pod •  Monolithic master =>

Micro-services

•  Loosely inspired by Borg

Pod

•  Pod: basic unit in Kubernetes –  Group of containers that are co-

scheduled •  A resource envelope in which

one or more containers run


–  Containers that are part of the same pod are guaranteed to be scheduled together onto the same machine, and can share state via local volumes

•  Users organize pods using labels –  Label: arbitrary key/value pair attached to pod –  E.g., role=frontend and stage=production !

IP per pod

•  Containers running on a machine share the host’s IP address, so Borg assigns the containers unique port numbers –  Burdens on infrastructure

and application developers, e.g., it requires to replace DNS service

•  IP address per pod, thus aligning network identity (IP address) with application identity –  Easier to run off-the-shelf

software on Kubernetes


Borg Kubernetes

Omega and Kubernetes

•  Like Omega, shared persistent store –  Components watch for changes to relevant

objects •  In contrast to Omega, state accessed

through a domain-specific REST API that applies higher-level versioning, validation, semantics, and policy, in support of a more diverse array of clients –  Stronger focus on experience of developers

writing cluster applications


Kubernetes architecture


•  Master: cluster control plane

•  Kubelets: node agents

•  Cluster state backed by distributed storage system (etcd, a highly available distributed key-value store)

Kubernetes on Mesos

•  Mesos allows dynamic sharing of cluster resources between Kubernetes and other first-class Mesos frameworks such as HDFS, Spark, and Chronos kubernetes.io/docs/getting-started-guides/mesos/


References

•  B. Burns et al., Borg, Omega, and Kubernetes. Commun. ACM, 2016.

•  A. Ghodsi et al., Dominant resource fairness: fair allocation of multiple resource types, NSDI 2011.

•  B. Hindman et al., Mesos: a platform for fine-grained resource sharing in the data center, NSDI 2011.

•  M. Schwarzkopf et al., Omega: flexible, scalable schedulers for large compute clusters, EuroSys 2013.

•  V. K. Vavilapalli et al., Apache Hadoop YARN: yet another resource negotiator, SOCC 2013.

•  A. Verma et al., Large-scale cluster management at Google with Borg, EuroSys 2015.


90

Systems for Resource Management - home · DAMON · Mesos in the data center • Where does Mesos...

Documents

Transcript of Systems for Resource Management - home · DAMON · Mesos in the data center • Where does Mesos...