Hadoop summit-diverse-workload

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enabling diverse workload scheduling in YARN

June, 2015

Wangda Tan, Hortonworks, ([email protected])Craig Welch, Hortonworks, ([email protected])

mailto:[email protected])

mailto:[email protected])


About us

Wangda Tan• Last 5+ years in big data field,

Hadoop, Open-MPI, etc.• Past

– Pivotal (PHD team, brings OpenMPI/GraphLab to YARN)

– Alibaba (ODPS team, platform for distributed data-mining)

• Now– Apache Hadoop Committer

@Hortonworks, all in YARN.– Now spending most of time on

resource scheduling enhancements.

Craig Welch• Yarn Contributor


Hadoop+YARN is the home of

big data processing.


Our workloads vary, Service | Batch | interactive/ real-time


They have different CRAZY requirements

I wanna be fast!

When cluster is busy

Don’t take away

MY RESOURCES

A huge job

needs be scheduled

at a special time


We want to make them

AS HAPPY AS POSSIBLE

to run together in YARN.


Let’s start…


Agenda today

• Overview• Node Label• Resource Preemption• Reservation system• Pluggable behavior for Scheduler• Docker support• Resource scheduling beyond memory


Overview


Background

• Resources are managed by a hierarchy of queues.

• One queue can have multiple applications

• Container is the result resource scheduling, Which is a bundle of resources and can run process(es)


How to manage your workload by queues

• By organization:–Marketing/Finance

queue

• By workload– Interactive/Batch queue

• Hybrid–Finance-batch/

Marketing-realtime queue


Node Label


Node Label – Overview

• Types of node labels– Node partition (Since 2.6)– Node constraints (WIP)

• Node partition (Today’s focus)– One node belongs to only one

partition– Related to resource planning

• Node constraints– One node can assign multiple

constraints– Not related to resource planning


Node partition – Resource planning

• Nodes belong to “default partition” if not specified• It’s possible to specify different capacities of queues on different partitions

–For example, sales queue can use different resource on GPU and default partition.

• It’s possible to specify some partition will be only used by some queues (ACL for partition)–For example, only sales queue can access “Large memory partition”


Node partition – Exclusive vs. Non-exclusive


Node Partition – Use cases & best practice

• Dedicate nodes to run important services:–E.g. Running HBase region server using Apache Slider

• Nodes with special hardware in the cluster are used by organizations. –E.g. You may want a queue dedicated to the marketing department to use 80% of

these memory-heavy nodes.

• Use non-exclusive node partition to make better resource utilization.• Be careful about user-limits, capacity, etc. to make sure jobs can be launched

I will cover more details about implementation & usage in Thursday morning’s session “YARN Node Labels” with Mayank Bansal from Ebay.


Resource Preemption


Resource Preemption – Overview

• Queue has configured minimum resource.

• Since it has a minimum resource value, the preemption policy (which performs preempting resources) is used to insure that:–When a queue is under its “minimum resource”, and the cluster doesn’t have

available resources, preemption policy can get resource from other queues use more than their minimum resource.


Resource Preemption – Example

• When preemption is not enabled

• When preemption is enabled


Resource Preemption – best practice

•Configurations to control the pace of preemption:–yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill–yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round–yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor

•Configurations to control when or if preemption happens–yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity (deadzone)–yarn.scheduler.capacity.<queue-path>.disable_preemption


Reservation System


Reservation System – Overview

• Reserving resource ahead of time– Just like ordering table in a restaurant

– “I need a table for X people at Y time”

– “Wait for moment … Reservation confirmed sir“

– (After some time), “Your table is ready”

–What Reservation System does is:–Send a reservation request

–RM checks time table

–Send back reservation confirmation ID

–Notify when ready

•Enables more predictable start and run time for time-critical / resource intensive applications


Reservation System – Use cases

•Gang scheduling– Currently, YARN can do gang scheduling from application side (holding resources until it meets requirements)– Resources could be wasted and there’s risk of deadlocks.

–RS lays the foundation for gang scheduling

•Workflow support– I want to run jobs in stages– Stage-1 at 1 AM tomorrow, needs 10k containers– Stage-2 after stage-1, needs 5k containers– Stage-3 after stage-2, needs 2k containers

– You can submit such requests to RS!

https://issues.apache.org/jira/secure/attachment/12635661/techreport.pdf


Reservation System – Result & References

•Before & After Reservation System (reports from MSR)– It increased cluster utilization a lot!

•References

– Design / Discussion / Report : YARN-1051– More detail about example : YARN-2609


Pluggable scheduler behavior


Why

• Problem• It’s difficult to share functionality between schedulers

• Users cannot achieve the same behavior with all schedulers

• Fixes and enhancements tend to end up in one scheduler, not all, leading to fragmentation

• No simple mechanism exists to mix behaviors for a given feature in a single cluster

• Solution• Move to sharable, pluggable scheduler behavior


How

• The Goal–Recast scheduler behavior as

policies – candidates include –Resource limits for apps, users...

–Ordering for allocation and preemption

• With this, we can:–Maximize feature availability and

reduce fragmentation–Configure different queues for

different workloads in a single cluster

Flexible Scheduler configuration, as simple

as building with Legos!


Ordering Policy of Capacity Scheduler

• Pluggable ordering policies for LeafQueues in Capacity Scheduler–Enables the implementation of different

policies for ordering assignment and preemption of containers for applications

– Initial implementations include FIFO (Capacity Scheduler original behavior) and Fair

–User Limits and Queue Capacity limits are still respected

• Fair scheduling inside Capacity Scheduler–Based on the Fair Sharing logic in

FairScheduler–Assigns containers to applications in

order of least to greatest resource usage–Allows many applications to make

progress concurrently–Lets short jobs finish in reasonable time

while not starving long running jobs


Configuration and tuning

• Rough guidelines for when to use Fair and FIFO ordering policies

• Configuration

– yarn.scheduler.capacity.<queue>.ordering-policy (“fifo” or “fair”, default “fifo”)

– yarn.scheduler.capacity.<queue>.ordering-policy.fair.enable-size-based-weight (true or false)

• Tuning–Use max-am-resource-percent to avoid “peanut buttering” from having too many apps running at once–Sometimes it’s necessary to separate large and small apps in different queues, or use size-based-weight, to avoid large app starvation

Workloads Policy

On-demand/interactive/exploratory

Fair

Predictable/Recu-rring batch

FIFO

Mix of above two Fair


Docker container support


Docker container support – Overview

• Containers for the Cluster–Brings the sandboxing and

dependency isolation of container technology to Hadoop

–Containers make it simple to use Hadoop resources for a wider range of applications


Docker container support – Status

• Done–(V1) Initial implementation translating Kubernetes to an Application Master launching Docker containers from the Cluster met with success.–(V2) A custom container launcher for Docker containers. This brought the capability more fully under the management of YARN,

–but a single cluster could not support both traditional YARN applications (MapReduce, etc)

and Docker concurrently

• Next phase–(V3) WIP, is adding support for running Docker and traditional YARN applications side-by-side in a single cluster


It’s not all about memory


It’s not all about Memory - CPU

• What’s in a CPU–Some workloads are CPU intensive, without accounting for this nodes may end up CPU bound or CPU may be under utilized cluster-wide –CPU awareness at the scheduer level is enabled by selecting the DominantResourceCalculator.–Dominant? “Dominant” stands for the “dominant factor”, or the “bottleneck”. In simplified terms, for the resource type which is the

most constrained becomes the dominant factor for any given comparison or calculation

–For example, If there is enough memory but not enough cpu for a resource request, the cpu component is dominant ( and the answer is “No” )

–See https://www.cs.berkeley.edu/~alig/papers/drf.pdf for more detail

https://www.cs.berkeley.edu/~alig/papers/drf.pdf

https://www.cs.berkeley.edu/~alig/papers/drf.pdf


It’s not all about Memory – CPU - Vcores

• What’s in a CPU–The unit used to abstract CPU capability in YARN is the vcore–Vcore counts are configured per-node in the yarn-site.xml, typically 1-1 vcore to physical CPU–If some Nodes’ CPUs outclass other nodes’, the number of vcores per physical CPU can be adjusted upward to compensate


Q & A

?

Hadoop summit-diverse-workload

Engineering

Transcript of Hadoop summit-diverse-workload