Hadoop summit-diverse-workload

36
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enabling diverse workload scheduling in YARN June, 2015 Wangda Tan, Hortonworks, ( [email protected] ) Craig Welch, Hortonworks, ([email protected] )

Transcript of Hadoop summit-diverse-workload

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enabling diverse workload scheduling in YARN

June, 2015

Wangda Tan, Hortonworks, ([email protected])Craig Welch, Hortonworks, ([email protected])

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

About us

Wangda Tan• Last 5+ years in big data field,

Hadoop, Open-MPI, etc.• Past

– Pivotal (PHD team, brings OpenMPI/GraphLab to YARN)

– Alibaba (ODPS team, platform for distributed data-mining)

• Now– Apache Hadoop Committer

@Hortonworks, all in YARN.– Now spending most of time on

resource scheduling enhancements.

Craig Welch• Yarn Contributor

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop+YARN is the home of

big data processing.

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Our workloads vary, Service | Batch | interactive/ real-time

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

They have different CRAZY requirements

I wanna be fast!

When cluster is busy

Don’t take away

MY RESOURCES

A huge job

needs be scheduled

at a special time

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

We want to make them

AS HAPPY AS POSSIBLE

to run together in YARN.

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Let’s start…

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Agenda today

• Overview• Node Label• Resource Preemption• Reservation system• Pluggable behavior for Scheduler• Docker support• Resource scheduling beyond memory

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Overview

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Background

• Resources are managed by a hierarchy of queues.

• One queue can have multiple applications

• Container is the result resource scheduling, Which is a bundle of resources and can run process(es)

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

How to manage your workload by queues

• By organization:–Marketing/Finance

queue

• By workload– Interactive/Batch queue

• Hybrid–Finance-batch/

Marketing-realtime queue

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Node Label

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Node Label – Overview

• Types of node labels– Node partition (Since 2.6)– Node constraints (WIP)

• Node partition (Today’s focus)– One node belongs to only one

partition– Related to resource planning

• Node constraints– One node can assign multiple

constraints– Not related to resource planning

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Node partition – Resource planning

• Nodes belong to “default partition” if not specified• It’s possible to specify different capacities of queues on different partitions

–For example, sales queue can use different resource on GPU and default partition.

• It’s possible to specify some partition will be only used by some queues (ACL for partition)–For example, only sales queue can access “Large memory partition”

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Node partition – Exclusive vs. Non-exclusive

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Node Partition – Use cases & best practice

• Dedicate nodes to run important services:–E.g. Running HBase region server using Apache Slider

• Nodes with special hardware in the cluster are used by organizations. –E.g. You may want a queue dedicated to the marketing department to use 80% of

these memory-heavy nodes.

• Use non-exclusive node partition to make better resource utilization.• Be careful about user-limits, capacity, etc. to make sure jobs can be launched

I will cover more details about implementation & usage in Thursday morning’s session “YARN Node Labels” with Mayank Bansal from Ebay.

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Resource Preemption

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Resource Preemption – Overview

• Queue has configured minimum resource.

• Since it has a minimum resource value, the preemption policy (which performs preempting resources) is used to insure that:–When a queue is under its “minimum resource”, and the cluster doesn’t have

available resources, preemption policy can get resource from other queues use more than their minimum resource.

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Resource Preemption – Example

• When preemption is not enabled

• When preemption is enabled

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Resource Preemption – best practice

•Configurations to control the pace of preemption:–yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill–yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round–yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor

•Configurations to control when or if preemption happens–yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity (deadzone)–yarn.scheduler.capacity.<queue-path>.disable_preemption

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Reservation System

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Reservation System – Overview

• Reserving resource ahead of time– Just like ordering table in a restaurant

– “I need a table for X people at Y time”

– “Wait for moment … Reservation confirmed sir“

– (After some time), “Your table is ready”

–What Reservation System does is:–Send a reservation request

–RM checks time table

–Send back reservation confirmation ID

–Notify when ready

•Enables more predictable start and run time for time-critical / resource intensive applications

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Reservation System – Use cases

•Gang scheduling– Currently, YARN can do gang scheduling from application side (holding resources until it meets requirements)– Resources could be wasted and there’s risk of deadlocks.

–RS lays the foundation for gang scheduling

•Workflow support– I want to run jobs in stages– Stage-1 at 1 AM tomorrow, needs 10k containers– Stage-2 after stage-1, needs 5k containers– Stage-3 after stage-2, needs 2k containers

– You can submit such requests to RS!

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Reservation System – Result & References

•Before & After Reservation System (reports from MSR)– It increased cluster utilization a lot!

•References

– Design / Discussion / Report : YARN-1051– More detail about example : YARN-2609

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Pluggable scheduler behavior

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why

• Problem• It’s difficult to share functionality between schedulers

• Users cannot achieve the same behavior with all schedulers

• Fixes and enhancements tend to end up in one scheduler, not all, leading to fragmentation

• No simple mechanism exists to mix behaviors for a given feature in a single cluster

• Solution• Move to sharable, pluggable scheduler behavior

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

How

• The Goal–Recast scheduler behavior as

policies – candidates include –Resource limits for apps, users...

–Ordering for allocation and preemption

• With this, we can:–Maximize feature availability and

reduce fragmentation–Configure different queues for

different workloads in a single cluster

Flexible Scheduler configuration, as simple

as building with Legos!

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Ordering Policy of Capacity Scheduler

• Pluggable ordering policies for LeafQueues in Capacity Scheduler–Enables the implementation of different

policies for ordering assignment and preemption of containers for applications

– Initial implementations include FIFO (Capacity Scheduler original behavior) and Fair

–User Limits and Queue Capacity limits are still respected

• Fair scheduling inside Capacity Scheduler–Based on the Fair Sharing logic in

FairScheduler–Assigns containers to applications in

order of least to greatest resource usage–Allows many applications to make

progress concurrently–Lets short jobs finish in reasonable time

while not starving long running jobs

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Configuration and tuning

• Rough guidelines for when to use Fair and FIFO ordering policies

• Configuration

– yarn.scheduler.capacity.<queue>.ordering-policy (“fifo” or “fair”, default “fifo”)

– yarn.scheduler.capacity.<queue>.ordering-policy.fair.enable-size-based-weight (true or false)

• Tuning–Use max-am-resource-percent to avoid “peanut buttering” from having too many apps running at once–Sometimes it’s necessary to separate large and small apps in different queues, or use size-based-weight, to avoid large app starvation

Workloads Policy

On-demand/interactive/exploratory

Fair

Predictable/Recu-rring batch

FIFO

Mix of above two Fair

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Docker container support

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Docker container support – Overview

• Containers for the Cluster–Brings the sandboxing and

dependency isolation of container technology to Hadoop

–Containers make it simple to use Hadoop resources for a wider range of applications

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Docker container support – Status

• Done–(V1) Initial implementation translating Kubernetes to an Application Master launching Docker containers from the Cluster met with success.–(V2) A custom container launcher for Docker containers. This brought the capability more fully under the management of YARN,

–but a single cluster could not support both traditional YARN applications (MapReduce, etc)

and Docker concurrently

• Next phase–(V3) WIP, is adding support for running Docker and traditional YARN applications side-by-side in a single cluster

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

It’s not all about memory

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

It’s not all about Memory - CPU

• What’s in a CPU–Some workloads are CPU intensive, without accounting for this nodes may end up CPU bound or CPU may be under utilized cluster-wide –CPU awareness at the scheduer level is enabled by selecting the DominantResourceCalculator.–Dominant? “Dominant” stands for the “dominant factor”, or the “bottleneck”. In simplified terms, for the resource type which is the

most constrained becomes the dominant factor for any given comparison or calculation

–For example, If there is enough memory but not enough cpu for a resource request, the cpu component is dominant ( and the answer is “No” )

–See https://www.cs.berkeley.edu/~alig/papers/drf.pdf for more detail

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

It’s not all about Memory – CPU - Vcores

• What’s in a CPU–The unit used to abstract CPU capability in YARN is the vcore–Vcore counts are configured per-node in the yarn-site.xml, typically 1-1 vcore to physical CPU–If some Nodes’ CPUs outclass other nodes’, the number of vcores per physical CPU can be adjusted upward to compensate

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Q & A

?