Node labels in YARN

YARN Node Labels

Wangda Tan, Hortonorks ([email protected])Mayank Bansal, ebay ([email protected])

About us

Wangda Tan• Last 5+ years in big data field, Hadoop,

Open-MPI, etc.• Now

• Apache Hadoop Committer @Hortonworks, all in YARN.

• Now spending most of time on resource scheduling enhancements.

• Past• Pivotal (PHD team, brings

OpenMPI/GraphLab to YARN)• Alibaba (ODPS team, platform for

distributed data-mining)

Mayank Bansal• Hadoop Architect @ ebay• Apache Hadoop Committer• Apache Oozie PMC and Committer• Current

• Leading Hadoop Core Development for YARN and MapReduce @ ebay

• Past• Working on Scheduler / Resource

Managers

Agenda

• Overview• Problems• What is node label• Understand by example• Architecture• Case study • Status• Future

Overview – Background

• Resources are managed by a hierarchy of queues.

• One queue can have multiple applications

• Container is the result resource scheduling, Which is a bundle of resources and can run process(es)

Overview – How to manage your workload by queues• By organization:

• Marketing/Finance queue• By workload

• Interactive/Batch queue• Hybrid

• Finance-batch/Marketing-realtime queue

Problems

• No way to specify for specific resource on nodes• E.g. nodes with GPU / SSD

• No way for application to request nodes with specific resources.• Unable to partition a cluster based on organizations/workloads

What is Node Label?

• Group nodes with similar profile• Hardware• Software• Organization• Workloads

• A way for app to specify where to run in a cluster

Node Labels

• Types of node labels• Node partition (Since 2.6)• Node constraints (WIP)

• Node partition• One node belongs to only one partition• Related to resource planning

• Node constraints• One node can assign multiple

constraints• Not related to resource planning

Understand by example (1)

• A real-world example about why node partition is needed:• Company-X has a big cluster, Each of

Engineering/Marketing/Sales team has 33% share of the cluster.

Understand by example (2)• Engineering/marketing team need GPU installed

servers to do some visualization works. So they spent equal amount of money buy some machines with GPU.• They want to share the cluster 50:50.• Sales team spent $0 on these node nodes, so it cannot

run anything on these new nodes at all.


• Here problem comes:

• if you create a separated YARN cluster, ops team will unhappy. • If you add these new nodes to original

cluster, you cannot guarantee engineering/marking team have preference to use these new nodes.

?

Understand by example (4)• Node partition is to solve this problem:• Add GPU partition, which is managed by the same YARN RM. Admin can specify

different percentage of shares in different partitions.

Understand by example (5)• Understand Non-exclusive node

partition:• In the previous example, “GPU”

partition can be only used by engineering and marketing team.

• This is a bad for resource utilization.• Admin can define, if “GPU” partition

has idle resources, sales queue can use it. But when engineering/marketing come back. Resource allocated to sales queue will be preempted.

• (available since Hadoop 2.8)

Understand by example (6)• Configuration for above example (Capacity Scheduler)yarn.scheduler.capacity.root.queues=engineering,marketing,sales

yarn.scheduler.capacity.root.engineering.capacity=33

yarn.scheduler.capacity.root.marketing.capacity=33

yarn.scheduler.capacity.root.sales.capacity=33

---------

yarn.scheduler.capacity.root.engineering.accessible-node-labels=GPU

yarn.scheduler.capacity.root.marketing.accessible-node-labels=GPU

---------

yarn.scheduler.capacity.root.engineering.accessible-node-labels.GPU.capacity=50

yarn.scheduler.capacity.root.marketing.accessible-node-labels.GPU.capacity=50

---------

(optional)

yarn.scheduler.capacity.root.engineering.default-node-label-expression=GPU

They’re original configuration without node partition

CapacitiesFor node partitions.

Queue ACLsFor node partitions.

(optional)Applications running in the queueWill run in GPU partitionBy default

Architecture

• Central piece: NodeLabelsManager• Stores labels and their attributes• Store nodes-to-labels mapping

• It can be read/write by• CLI and REST API (which we

called centralized configuration)• OR NM can retrieve labels on it

and send to RM (we call it distributed configuration)

• Scheduler uses node labels manager make decisions and receive resource request from AM, return allocated container to AM

Case study (1) – uses node label

• Use node label to create isolated environment for batch/interactive/low-latency workloads.• Deploy YARN containers onto

compute nodes are optimized and accelerated for each workload:• Using RDMA-enabled nodes to

accelerate shuffle.• Using powerful CPU nodes to

accelerate compression.

• It is possible to DOUBLE THE DENSITY of today’s traditional Hadoop cluster with substantially better price performance.• Create a converged system that

allow Hadoop / Vertica / Spark and other stacks share a common pool of data.

Case study (2) – uses node label

Case study (3) – Ebay cluster use node label

• Separate Machine Learning workloads from regular workloads

• Use node label to separate licensed software to some machines

• Enabling GPU workloads

• Separation of organizational workloads

Case study (4) – Slider use cases

• HBase region servers run in nodes with SSD (Non-exclusive).• HBase master monopolize to use

nodes.

• Map-reduce jobs run in other nodes. And they can use idle resources of region server nodes.

Status – Done parts of Node Labels

• Exclusive / non-exclusive node partition support in Capacity Scheduler (√)• User-limit• Preemption• Now all respecting node partition!

• Centralized configuration via CLI/REST API (√)

• Distributed configuration in Node Manager’s config/script (√)

Status - Node Labels Web UI

Status – Other Apache projects support node label• Following projects are already

support node label:• (SPARK-6470)• (MAPREDUCE-6304)• Slider (SLIDER-81)• (via SLIDER)• (via SLIDER)• (via SLIDER)• (AMBARI-10063)

Future of Node Label

• Support constraints (YARN-3409)• Orthogonal to partition, they’re

describing attributes of node’s hardware/software just for affinity. • Some example of constraints:

• glibc version• JDK version• Type of CPU (x86_64/i686)• Physical or virtualized

• With this, application can ask for resource• glibc.version >= 2.20 &&

JDK.version >= 8u20 && x86_64

• Support node label in FairScheduler (YARN-2497)• Support in more projects• Tez• Oozie• …

Node labels in YARN

Engineering

Transcript of Node labels in YARN