Download - 1 Resource Management with YARN: YARN Past, Present and Future Anubhav Dhoot Software Engineer Cloudera.

1

Resource Management with YARN:YARN Past, Present and FutureAnubhav DhootSoftware EngineerCloudera

Resource Management

Map Reduce Impala Spark

YARN (DYNAMIC RESOURCE MANAGEMENT)

YARN (Yet Another Resource Negotiator)Traditional Operating System

Storage:File System

Execution/Scheduling:

Processes/Kernel

Scheduler

Hadoop

Storage:Hadoop

Distributed File System

(HDFS)

Execution/Scheduling:Yet Another

Resource Negotiator

(YARN)

Overview of Talk

• History of YARN• Recent features• On going features• Future

WHY YARN

Traditional Distributed Execution Engines

Master Worker

Worker

Worker

TaskTask

TaskTask

TaskTask

ClientClient

MapReduce v1 (MR1)

Job Tracker Task Tracker

Task Tracker

Task Tracker

MapMap

ReduceMap

MapReduce

ClientClient

JobTracker tracks every task in the cluster!

MR1 Utilization

4 GB

Map1024 MB

Map1024 MB

Reduce1024 MB

Reduce1024 MB

Fixed-size slot model forces slots large enough for the biggest task!

Running multiple frameworks…

Master Worker

Worker

Worker

TaskTask

TaskTask

TaskTask

ClientClient

Master Worker

Worker

Worker

TaskTask

TaskTask

TaskTask

ClientClient

Master Worker

Worker

Worker

TaskTask

TaskTask

TaskTask

ClientClient

YARN to the rescue!

• Scalability: Track only applications, not all tasks.• Utilization: Allocate only as many resources as needed. • Multi-tenancy: Share resources between frameworks and users

• Physical resources – memory, CPU, disk, network

YARN Architecture

Resource Manager

Node Manager

Node Manager

Node ManagerAppMaster Container

Client

Client

Cluster State

ApplicationsState

MR1 to YARN/MR2 functionality mapping

• JobTracker is split into• ResourceManager – cluster-management, scheduling and

application state handling• ApplicationMaster – Handle tasks (containers) per

application (e.g. MR job)• JobHistoryServer – Serve MR history

• TaskTracker maps to NodeManager

EARLY FEATURES

Handing faults on Workers

Resource Manager

Node Manager

Node Manager

Node ManagerAppMaster Container

Client

Client

AppMaster

ContainerCluster State

ApplicationsState

Master Fault-tolerance - RM Recovery

Node Manager

Node ManagerAppMaster Container Client

Client

Resource Manager

Cluster State

ApplicationsState

RM Store

AppMaster Container

Master Node Fault toleranceHigh Availability (Active / Standby)

Node Manager

Node Manager

AppMaster

Client

ClientActive

Resource Manager

RM Store

StandbyResource Manager

Elector

ElectorZK

Master Node Fault toleranceHigh Availability (Active / Standby)

Node Manager

Node ManagerAppMaster

Client

Client

ActiveResource Manager

RM Store

StandbyResource ManagerElector

ElectorZK

Scheduler

• Inside ResourceManager• Decides who gets to run when and where• Uses “Queues” to describe organization needs• Applications are submitted to a queue• Two schedulers out of the box

• Fair Scheduler• Capacity Scheduler

Fair Scheduler Hierarchical QueuesRoot

Mem Capacity: 12 GBCPU Capacity: 24 cores

MarketingFair Share Mem: 4 GB

Fair Share CPU: 8 cores

R&DFair Share Mem: 4 GB


SalesFair Share Mem: 4 GB


Jim’s TeamFair Share Mem: 2 GB


Bob’s TeamFair Share Mem: 2 GB


Fair Scheduler Queue Placement Policies

<queuePlacementPolicy> <rule name="specified" /> <rule name="primaryGroup" create="false" /> <rule name="default" /> </queuePlacementPolicy>

Multi-Resource Scheduling

● Node capacities expressed in both memory and CPU● Memory in MB and CPU in terms of vcores● Scheduler uses dominant resource for making

decisions

Multi-Resource Scheduling

12 GB33% cap. 3 cores

25% cap.

10 GB28% cap.

6 cores50% cap.

Queue 1 Usage Queue 2 Usage

Multi-Resource Enforcement

● YARN kills containers that use too much memory● CGroups for limiting CPU

RECENTLY ADDED FEATURES

RM recovery without losing work

• Preserving running containers on RM restart• NM no longer kills containers on resync• AM made to register on resync with RM

RM recovery without losing work

Node Manager

Node ManagerAppMaster Container Client

Client

Resource Manager

Cluster State

ApplicationsState

RM Store

NM Recovery without losing work

• NM stores container and its associated state in a local store

• On restart reconstruct state from store• Default implementation using LevelDB• Supports rolling restarts with no user impact

NM Recovery without losing work

Resource Manager

Node Manager

AppMaster Container

Client

Client

Cluster State

ApplicationsState State

Store

Fair Scheduler Dynamic User QueuesRoot

Mem Capacity: 12 GBCPU Capacity: 24 cores

MarketingFair Share Mem: 4 GB


R&DFair Share Mem: 4 GB


SalesFair Share Mem: 4 GB


MoeFair Share Mem: 4 GB


LarryFair Share Mem: 2 GB


MoeFair Share Mem: 2 GB


ON GOING FEATURES

Long Running Apps on Secure Clusters (YARN-896)

● Update tokens of running applications● Reset AM failure count to allow mulitple failures

over a long time● Need to access logs while application is running● Need a way to show progress

Application Timeline Server (YARN-321, YARN-1530)

● Currently we have a JobHistoryServer for MapReduce history

● Generic history server● Gives information even while job is running

Application Timeline Server

● Store and serve generic data like when containers ran, container logs

● Apps post app-specific eventso e.g. MapReduce Attempt Succeeded/Failed

● Pluggable framework-specific UIs● Pluggable storage backend ● Default LevelDB

Disk scheduling (YARN-2139 )

● Disk as a resource in addition to CPU and Memory● Expressed as virtual disk similar to vcore for cpu● Dominant resource fairness can handle this on the

scheduling side● Use CGroups blkio controller for enforcement

Reservation-based Scheduling (YARN-1051)

Reservation-based Scheduling

FUTURE FEATURES

Container Resizing (YARN-1197)

● Change container’s resource allocation● Very useful for frameworks like Spark that schedule

multiple tasks within a container● Follow same paths as for acquiring and releasing

containers

Admin labels (YARN-796)● Admin tags nodes with labels (e.g. GPU)● Applications can include labels in container requests

NodeManager[GPU, beefy]

NodeManager[Windows]

Application MasterI want a GPU

Container Delegation (YARN-1488)

● Problem: single process wants to run work on behalf of multiple users.

● Want to count resources used against users that use them.

● E.g. Impala or HDFS caching

Container Delegation (YARN-1488)

● Solution: let apps “delegate” their containers to other containers on the same node.

● Delegated container never runs● Framework container gets its resources● Framework container responsible for fairness within

itself

Questions?

43

Thank You!Anubhav Dhoot, Software Engineer, [email protected]