1
Resource Management with YARN:YARN Past, Present and FutureAnubhav DhootSoftware EngineerCloudera
Resource Management
Map Reduce Impala Spark
YARN (DYNAMIC RESOURCE MANAGEMENT)
YARN (Yet Another Resource Negotiator)Traditional Operating System
Storage:File System
Execution/Scheduling:
Processes/Kernel
Scheduler
Hadoop
Storage:Hadoop
Distributed File System
(HDFS)
Execution/Scheduling:Yet Another
Resource Negotiator
(YARN)
Overview of Talk
• History of YARN• Recent features• On going features• Future
WHY YARN
Traditional Distributed Execution Engines
Master Worker
Worker
Worker
TaskTask
TaskTask
TaskTask
ClientClient
MapReduce v1 (MR1)
Job Tracker Task Tracker
Task Tracker
Task Tracker
MapMap
ReduceMap
MapReduce
ClientClient
JobTracker tracks every task in the cluster!
MR1 Utilization
4 GB
Map1024 MB
Map1024 MB
Reduce1024 MB
Reduce1024 MB
Fixed-size slot model forces slots large enough for the biggest task!
Running multiple frameworks…
Master Worker
Worker
Worker
TaskTask
TaskTask
TaskTask
ClientClient
Master Worker
Worker
Worker
TaskTask
TaskTask
TaskTask
ClientClient
Master Worker
Worker
Worker
TaskTask
TaskTask
TaskTask
ClientClient
YARN to the rescue!
• Scalability: Track only applications, not all tasks.• Utilization: Allocate only as many resources as needed. • Multi-tenancy: Share resources between frameworks and users
• Physical resources – memory, CPU, disk, network
YARN Architecture
Resource Manager
Node Manager
Node Manager
Node ManagerAppMaster Container
Client
Client
Cluster State
ApplicationsState
MR1 to YARN/MR2 functionality mapping
• JobTracker is split into• ResourceManager – cluster-management, scheduling and
application state handling• ApplicationMaster – Handle tasks (containers) per
application (e.g. MR job)• JobHistoryServer – Serve MR history
• TaskTracker maps to NodeManager
EARLY FEATURES
Handing faults on Workers
Resource Manager
Node Manager
Node Manager
Node ManagerAppMaster Container
Client
Client
AppMaster
ContainerCluster State
ApplicationsState
Master Fault-tolerance - RM Recovery
Node Manager
Node ManagerAppMaster Container Client
Client
Resource Manager
Cluster State
ApplicationsState
RM Store
AppMaster Container
Master Node Fault toleranceHigh Availability (Active / Standby)
Node Manager
Node Manager
AppMaster
Client
ClientActive
Resource Manager
RM Store
StandbyResource Manager
Elector
ElectorZK
Master Node Fault toleranceHigh Availability (Active / Standby)
Node Manager
Node ManagerAppMaster
Client
Client
ActiveResource Manager
RM Store
StandbyResource ManagerElector
ElectorZK
Scheduler
• Inside ResourceManager• Decides who gets to run when and where• Uses “Queues” to describe organization needs• Applications are submitted to a queue• Two schedulers out of the box
• Fair Scheduler• Capacity Scheduler
Fair Scheduler Hierarchical QueuesRoot
Mem Capacity: 12 GBCPU Capacity: 24 cores
MarketingFair Share Mem: 4 GB
Fair Share CPU: 8 cores
R&DFair Share Mem: 4 GB
Fair Share CPU: 8 cores
SalesFair Share Mem: 4 GB
Fair Share CPU: 8 cores
Jim’s TeamFair Share Mem: 2 GB
Fair Share CPU: 4 cores
Bob’s TeamFair Share Mem: 2 GB
Fair Share CPU: 4 cores
Fair Scheduler Queue Placement Policies
<queuePlacementPolicy> <rule name="specified" /> <rule name="primaryGroup" create="false" /> <rule name="default" /> </queuePlacementPolicy>
Multi-Resource Scheduling
● Node capacities expressed in both memory and CPU● Memory in MB and CPU in terms of vcores● Scheduler uses dominant resource for making
decisions
Multi-Resource Scheduling
12 GB33% cap. 3 cores
25% cap.
10 GB28% cap.
6 cores50% cap.
Queue 1 Usage Queue 2 Usage
Multi-Resource Enforcement
● YARN kills containers that use too much memory● CGroups for limiting CPU
RECENTLY ADDED FEATURES
RM recovery without losing work
• Preserving running containers on RM restart• NM no longer kills containers on resync• AM made to register on resync with RM
RM recovery without losing work
Node Manager
Node ManagerAppMaster Container Client
Client
Resource Manager
Cluster State
ApplicationsState
RM Store
NM Recovery without losing work
• NM stores container and its associated state in a local store
• On restart reconstruct state from store• Default implementation using LevelDB• Supports rolling restarts with no user impact
NM Recovery without losing work
Resource Manager
Node Manager
AppMaster Container
Client
Client
Cluster State
ApplicationsState State
Store
Fair Scheduler Dynamic User QueuesRoot
Mem Capacity: 12 GBCPU Capacity: 24 cores
MarketingFair Share Mem: 4 GB
Fair Share CPU: 8 cores
R&DFair Share Mem: 4 GB
Fair Share CPU: 8 cores
SalesFair Share Mem: 4 GB
Fair Share CPU: 8 cores
MoeFair Share Mem: 4 GB
Fair Share CPU: 8 cores
LarryFair Share Mem: 2 GB
Fair Share CPU: 4 cores
MoeFair Share Mem: 2 GB
Fair Share CPU: 4 cores
ON GOING FEATURES
Long Running Apps on Secure Clusters (YARN-896)
● Update tokens of running applications● Reset AM failure count to allow mulitple failures
over a long time● Need to access logs while application is running● Need a way to show progress
Application Timeline Server (YARN-321, YARN-1530)
● Currently we have a JobHistoryServer for MapReduce history
● Generic history server● Gives information even while job is running
Application Timeline Server
● Store and serve generic data like when containers ran, container logs
● Apps post app-specific eventso e.g. MapReduce Attempt Succeeded/Failed
● Pluggable framework-specific UIs● Pluggable storage backend ● Default LevelDB
Disk scheduling (YARN-2139 )
● Disk as a resource in addition to CPU and Memory● Expressed as virtual disk similar to vcore for cpu● Dominant resource fairness can handle this on the
scheduling side● Use CGroups blkio controller for enforcement
Reservation-based Scheduling (YARN-1051)
Reservation-based Scheduling
FUTURE FEATURES
Container Resizing (YARN-1197)
● Change container’s resource allocation● Very useful for frameworks like Spark that schedule
multiple tasks within a container● Follow same paths as for acquiring and releasing
containers
Admin labels (YARN-796)● Admin tags nodes with labels (e.g. GPU)● Applications can include labels in container requests
NodeManager[GPU, beefy]
NodeManager[Windows]
Application MasterI want a GPU
Container Delegation (YARN-1488)
● Problem: single process wants to run work on behalf of multiple users.
● Want to count resources used against users that use them.
● E.g. Impala or HDFS caching
Container Delegation (YARN-1488)
● Solution: let apps “delegate” their containers to other containers on the same node.
● Delegated container never runs● Framework container gets its resources● Framework container responsible for fairness within
itself
Questions?
43
Thank You!Anubhav Dhoot, Software Engineer, [email protected]