Node labels in YARN
-
Upload
wangda-tan -
Category
Engineering
-
view
159 -
download
4
Transcript of Node labels in YARN
YARN Node Labels
Wangda Tan, Hortonorks ([email protected])Mayank Bansal, ebay ([email protected])
About us
Wangda Tan• Last 5+ years in big data field, Hadoop,
Open-MPI, etc.• Now
• Apache Hadoop Committer @Hortonworks, all in YARN.
• Now spending most of time on resource scheduling enhancements.
• Past• Pivotal (PHD team, brings
OpenMPI/GraphLab to YARN)• Alibaba (ODPS team, platform for
distributed data-mining)
Mayank Bansal• Hadoop Architect @ ebay• Apache Hadoop Committer• Apache Oozie PMC and Committer• Current
• Leading Hadoop Core Development for YARN and MapReduce @ ebay
• Past• Working on Scheduler / Resource
Managers
Agenda
• Overview• Problems• What is node label• Understand by example• Architecture• Case study • Status• Future
Overview – Background
• Resources are managed by a hierarchy of queues.
• One queue can have multiple applications
• Container is the result resource scheduling, Which is a bundle of resources and can run process(es)
Overview – How to manage your workload by queues• By organization:
• Marketing/Finance queue• By workload
• Interactive/Batch queue• Hybrid
• Finance-batch/Marketing-realtime queue
Problems
• No way to specify for specific resource on nodes• E.g. nodes with GPU / SSD
• No way for application to request nodes with specific resources.• Unable to partition a cluster based on organizations/workloads
What is Node Label?
• Group nodes with similar profile• Hardware• Software• Organization• Workloads
• A way for app to specify where to run in a cluster
Node Labels
• Types of node labels• Node partition (Since 2.6)• Node constraints (WIP)
• Node partition• One node belongs to only one partition• Related to resource planning
• Node constraints• One node can assign multiple
constraints• Not related to resource planning
Understand by example (1)
• A real-world example about why node partition is needed:• Company-X has a big cluster, Each of
Engineering/Marketing/Sales team has 33% share of the cluster.
Understand by example (2)• Engineering/marketing team need GPU installed
servers to do some visualization works. So they spent equal amount of money buy some machines with GPU.• They want to share the cluster 50:50.• Sales team spent $0 on these node nodes, so it cannot
run anything on these new nodes at all.
Understand by example (3)
• Here problem comes:
• if you create a separated YARN cluster, ops team will unhappy. • If you add these new nodes to original
cluster, you cannot guarantee engineering/marking team have preference to use these new nodes.
?
Understand by example (4)• Node partition is to solve this problem:• Add GPU partition, which is managed by the same YARN RM. Admin can specify
different percentage of shares in different partitions.
Understand by example (5)• Understand Non-exclusive node
partition:• In the previous example, “GPU”
partition can be only used by engineering and marketing team.
• This is a bad for resource utilization.• Admin can define, if “GPU” partition
has idle resources, sales queue can use it. But when engineering/marketing come back. Resource allocated to sales queue will be preempted.
• (available since Hadoop 2.8)
Understand by example (6)• Configuration for above example (Capacity Scheduler)yarn.scheduler.capacity.root.queues=engineering,marketing,sales
yarn.scheduler.capacity.root.engineering.capacity=33
yarn.scheduler.capacity.root.marketing.capacity=33
yarn.scheduler.capacity.root.sales.capacity=33
---------
yarn.scheduler.capacity.root.engineering.accessible-node-labels=GPU
yarn.scheduler.capacity.root.marketing.accessible-node-labels=GPU
---------
yarn.scheduler.capacity.root.engineering.accessible-node-labels.GPU.capacity=50
yarn.scheduler.capacity.root.marketing.accessible-node-labels.GPU.capacity=50
---------
(optional)
yarn.scheduler.capacity.root.engineering.default-node-label-expression=GPU
They’re original configuration without node partition
CapacitiesFor node partitions.
Queue ACLsFor node partitions.
(optional)Applications running in the queueWill run in GPU partitionBy default
Understand by example (7)
Architecture
• Central piece: NodeLabelsManager• Stores labels and their attributes• Store nodes-to-labels mapping
• It can be read/write by• CLI and REST API (which we
called centralized configuration)• OR NM can retrieve labels on it
and send to RM (we call it distributed configuration)
• Scheduler uses node labels manager make decisions and receive resource request from AM, return allocated container to AM
Case study (1) – uses node label
• Use node label to create isolated environment for batch/interactive/low-latency workloads.• Deploy YARN containers onto
compute nodes are optimized and accelerated for each workload:• Using RDMA-enabled nodes to
accelerate shuffle.• Using powerful CPU nodes to
accelerate compression.
• It is possible to DOUBLE THE DENSITY of today’s traditional Hadoop cluster with substantially better price performance.• Create a converged system that
allow Hadoop / Vertica / Spark and other stacks share a common pool of data.
Case study (2) – uses node label
Case study (3) – Ebay cluster use node label
• Separate Machine Learning workloads from regular workloads
• Use node label to separate licensed software to some machines
• Enabling GPU workloads
• Separation of organizational workloads
Case study (4) – Slider use cases
• HBase region servers run in nodes with SSD (Non-exclusive).• HBase master monopolize to use
nodes.
• Map-reduce jobs run in other nodes. And they can use idle resources of region server nodes.
Status – Done parts of Node Labels
• Exclusive / non-exclusive node partition support in Capacity Scheduler (√)• User-limit• Preemption• Now all respecting node partition!
• Centralized configuration via CLI/REST API (√)
• Distributed configuration in Node Manager’s config/script (√)
Status - Node Labels Web UI
Status – Other Apache projects support node label• Following projects are already
support node label:• (SPARK-6470)• (MAPREDUCE-6304)• Slider (SLIDER-81)• (via SLIDER)• (via SLIDER)• (via SLIDER)• (AMBARI-10063)
Future of Node Label
• Support constraints (YARN-3409)• Orthogonal to partition, they’re
describing attributes of node’s hardware/software just for affinity. • Some example of constraints:
• glibc version• JDK version• Type of CPU (x86_64/i686)• Physical or virtualized
• With this, application can ask for resource• glibc.version >= 2.20 &&
JDK.version >= 8u20 && x86_64
• Support node label in FairScheduler (YARN-2497)• Support in more projects• Tez• Oozie• …
Q & A