ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar

Context• GPUs are used in supercomputers

– Some of the top500 supercomputers use GPUs• Tianhe-1A

– 14,336 Xeon X5670 processors– 7,168 Nvidia Tesla M2050 GPUs

• Stampede– about 6,000 nodes:

» Xeon E5-2680 8C, Intel Xeon Phi

• GPUs are used in cloud computing

Need for resource managers and scheduling schemes for heterogeneous clusters including many-core GPUs

Categories of Scheduling Objectives• Traditional schedulers for supercomputers aim to

improve system-wide metrics: throughput & latency

• A market-based service world is emerging: focus on provider’s profit and user’s satisfaction– Cloud: pay-as-you-go model

• Amazon: different users (On-Demand, Free, Spot, …)– Recent resource managers for supercomputers (e.g. MOAB)

have the notion of service-level agreement (SLA)

Motivation

• Open-source batch schedulers start to support GPUs– TORQUE, SLURM– Users’ guide mapping of jobs to heterogeneous nodes– Simple scheduling schemes (goals: throughput & latency)

• Recent proposals describe runtime systems & virtualization frameworks for clusters with GPUs– [gViM HPCVirt '09][vCUDA IPDPS '09][ rCUDA HPCS’10 ]

[gVirtuS Euro-Par 2010][our HPDC’11, CCGRID’12, HPDC’12]– Simple scheduling schemes (goals: throughput & latency)

• Proposals on market-based scheduling policies focus on homogeneous CPU clusters – [Irwin HPDC’04][Sherwani Soft.Pract.Exp.’04]

State of the Art

Our Goal:Reconsider market-based scheduling for heterogeneous clusters including GPUs

Considerations• Community looking into code portability between CPU

and GPU– OpenCL– PGI CUDA-x86– MCUDA (CUDA-C), Ocelot, SWAN (CUDA-OpenCL), OpenMPC

→ Opportunity to flexibly schedule a job on CPU/GPU

• In cloud environments oversubscription commonly used to reduce infrastructural costs

→ Use of resource sharing to improve performance by maximizing hardware utilization

Problem Formulation• Given a CPU-GPU cluster• Schedule a set of jobs on the cluster

– To maximize the provider’s profit / aggregate user satisfaction• Exploit the portability offered by OpenCL

– Flexibly map the job on to either CPU or GPU• Maximize resource utilization

– Allow sharing of multi-core CPU or GPU

Assumptions/Limitations• 1 multi-core CPU and 1 GPU per node• Single-node, single GPU jobs• Only space-sharing, limited to two jobs per resource

Value Function

Market-based Scheduling Formulation• For each job, Linear-Decay Value Function [Irwin HPDC’04]

• Max Value → Importance/Priority of job Decay → Urgency of job• Delay due to:

₋ queuing, execution on non-optimal resource, resource sharing

Execution time

Max Value

Decay rate

Yield = maxValue – decay * delay

Overall Scheduling Approach

Jobs arrive in batches

Execute on CPU Execute on GPU

Scheduling Flow

Enqueue into CPU Queue Enqueue into GPU Queue

Jobs are enqueued on their optimal resource.

Phase 1 is oblivious of other jobs (based on optimal walltime)

Phase 1:Mapping

Sort jobs to Improve YieldInter-jobs scheduling considerationsSort jobs to Improve Yield

Phase 2:Sorting

Phase 3:Re-mapping

Different schemes:- When to remap?- What to remap?

Phase 1: Mapping

• Users provide walltime on GPU and GPU– walltime used as indicator of optimal/non optimal

resource– Each job is mapped onto its optimal resourceNOTE: in our experiments we assumed

maxValue = optimal walltime

Phase 2: Sorting• Sort jobs based on Reward [Irwin HPDC’04]

• Present Value – f(maxValuei, discount_rate)– Value after discounting the risk of running a job– The shorter the job, the lower the risk

• Opportunity Cost – Degradation in value due to the selection of one among several

alternatives

iii Walltime

yCostOpportunituePresentValReward

jijii decaydecayWalltimeyCostOpportunit

Phase 3: Remapping• When to remap:

– Uncoordinated schemes• queue is empty and resource is idle

– Coordinated scheme• When CPU and GPU queues are imbalanced

• What to remap:– Which job will have best reward on non-optimal resource?– Which job will suffer least reward penalty ?

Phase 3: Uncoordinated Schemes1. Last Optimal Reward (LOR)

– Remap job with least reward on optimal resource– Idea: least reward → least risk in moving

2. First Non-Optimal Reward (FNOR)– Compute the reward job could produce on non-optimal resource– Remap job with highest reward on non-optimal resource– Idea: consider non-optimal penalty

3. Last Non-Optimal Reward Penalty (LNORP)– Remap job with least reward degradation RewardDegradationi = OptimalRewardi - NonOptimalRewardi

Phase 3: Coordinated SchemeCoordinated Least Penalty (CORLP)• When to remap: imbalance between queues

– Imbalance affected by: decay rates and execution times of jobs– Total Queuing-Delay Decay-Rate Product (TQDP)

– Remap if |TQDPCPU – TQDPGPU| > threshold

• What to remap– Remap job with least penalty degradation

jji decaydelayqueueingTQDP _

Heuristic for Sharing• Limitation: Two jobs can space-share of CPU/GPU• Factors affecting sharing

- Slowdown incurred by jobs using half of a resource+ More resource available for other jobs

• Jobs– Categorized as low, medium, high scaling (based on models/profiling)

• When to enable sharing– Large fraction of jobs in pending queues with negative yield

• What jobs share a resource– Scalability-DecayRate factor

• Jobs grouped based on scalability• Within each group, jobs are ordered by decay rate (urgency)

– Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay)

Resource Sharing Heuristic

Master Node

Compute Node

Overall System Prototype

Compute Node Compute Node

Master Node

Cluster-Level Scheduler

Scheduling Schemes & Policies

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Multi-core CPU GPU

Compute Node

Multi-core CPU GPU

Compute Node

Multi-core CPU GPU

CPUGPU

Master Node

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

CPUGPU

Master Node

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

CPUGPU

TCP Communicator

CPU Execution Processes

GPU Execution Processes

GPU ConsolidationFramework

OS-basedscheduling & sharing

Master Node

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

CPUGPU

TCP Communicator

CPU Execution Processes

GPU Execution Processes

GPU ConsolidationFramework

Assumption: shared file system

Centralized decision making

Execution & sharing

mechanisms

OS-basedscheduling & sharing

CUDA app1

CUDA Driver

CUDA Runtime

GPU execution processes

(Front-End)

GPU Consolidation Framework

Back-End

GPU Sharing FrameworkGPU-related Node-Level Runtime

CUDA appN

…CUDAInterception

Library

CUDAInterception

Library

Front End – Back End Communication Channel

CUDAInterception

Library

CUDA app1

CUDA Driver

CUDA Runtime

GPU Consolidation Framework

GPU execution processes

(Front-End)

Back-End VirtualContext

CUDA calls arrive from Frontend

GPU Sharing FrameworkGPU-related Node-Level Runtime

CUDA appN

… CUDAInterception

LibraryBack-End

Server

CUDA stream1

Manipulates kernel

configurations to allow GPU

space sharing

CUDA stream2CUDA

streamN

Workload Consolidator

Simplified version of our HPDC’11 runtime

Experimental Setup• 16-node cluster

– CPU: 8-core Intel Xeon E5520 (2.27 GHz), 48 GB memory– GPU: Nvidia Tesla C2050 (1.15 GHz), 3GB device memory

• 256-job workload– 10 benchmark programs– 3 configurations: small, large, very large datasets– Various application domains: scientific computations, financial analysis, data

mining, machine learning

• Baselines– TORQUE (always optimal resource)– Minimum Completion Time (MCT) [Maheswaran et.al, HCW’99]

Comparison with Torque-based Metrics

Comp. Time-UM Comp. Time-BM Ave. Lat-UM Ave. Lat-BM

Metrics

TORQUE MCT LOR FNOR LNORP CORLP

• Baselines suffer from idle resources• By privileging shorter jobs, our schemes reduce queuing delays

Throughput & Latency

COMPLETION TIME AVERAGE LATENCY

10-20% better

~ 20% better

Results with Average Yield Metric

25C/75G 50C/50G 75C/25G

CPU/GPU Job Mix Ratio

Torque MCT FNOR LNORP LOR CORLP

up to 8.8x better

Yield: Effect of Job Mix

Skewed-GPU Skewed-CPUUniform

up to 2.3x better

• Better on skewed job mixes:− More idle time in case of baseline schemes− More room for dynamic mapping

Linear Decay Step Decay

Value Decay Function

up to 3.8x better

up to 6.9x better

Results with Average Yield MetricYield: Effect of Value Function

• Adaptability of our schemes to different value functions

Results with Average Yield Metric

128 256 384 512

Total No. of Jobs

up to 8.2x better

Yield: Effect of System Load

• As load increases, yield from baselines decreases linearly

• Proposed schemes achieve initially increased yield and then sustained yield

Yield Improvements from Sharing

0.1 0.2 0.3 0.4 0.5 0.6

Sharing K Factor

CPU only Sharing Benefit GPU only Sharing Benefit

CPU & GPU Sharing Benefit

Fraction of jobs to share

• Careful space sharing can help performance by freeing resources

• Excessive sharing can be detrimental to performance

Yield: Effect of Sharing

up to 23x improvement

Summary

• Value-based Scheduling on CPU-GPU clusters- Goal: improve aggregate yield

• Coordinated and uncoordinated scheduling schemes for dynamic mapping

• Automatic space sharing of resources based on heuristics

• Prototypical framework for evaluating the proposed schemes

• Improvement over state-of-the-art- Based on completion time & latency- Based on average yield

Conclusion

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Documents

Transcript of ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters

A Scalable Framework for Heterogeneous GPU-Based Clusters

Pearson - valuepack Makroökonomie

CUDA-DTM: Distributed Transactional Memory for GPU Clusters · CUDA-DTM: Distributed Transactional Memory for GPU Clusters Samuel Irving 1, Sui Chen , Lu Peng , Costas Busch , Maurice

GPU Clusters in HPC - GPU Technology Conferenceon-demand.gputechconf.com/supercomputing/2014/...Education 0.01% Actual Usage by Discipline . ... 2. much higher intensities in near-fault

SCIENTIFIC VISUALIZATION ON GPU CLUSTERS

NAMD: Molecular Dynamics on GPU Clusters€¦ · GPU-Accelerated NAMD Issues • Serial performance – Kernel developed in 2007 for G80 GPU, CUDA 1.0 – Strict memory coalescing

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN ... · Similar to clusters setup for big data analysis [12,50], using shared clusters can facilitate better utilization and

A Simulated Annealing algorithm for GPU clusters · The proposed algorithm Conclusion A Simulated Annealing algorithm for GPU clusters Maciej Zbierski Institute of Computer Science

Hybrid Map Task Scheduling on GPU-based Heterogeneous Clusters

Parallel Rendering on Hybrid Multi-GPU Clusters

Multi-GPU MapReduce on GPU Clusters

Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters

GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

Heterogeneous CPU/GPU co- processor clusters Michael Fruchtman.

Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis … · 2018-10-02 · Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications Myeongjae Jeon,

Scaling Deep Learning on GPU and Knights Landing clustersaydin/sc2017_deep_learning.pdf · Scaling Deep Learning on GPU and Knights Landing clusters SC17, November 12–17, 2017,

Getting the Most Out of GPU-Accelerated Clusters | GTC 2013on-demand.gputechconf.com/gtc/2013/...Tips and Tricks for Getting the Most Out of GPU-accelerated Clusters Rolf vandeVaart,

SCIENTIFIC VISUALIZATION ON GPU CLUSTERSon-demand.gputechconf.com/gtc/2015/presentation/S5660-Peter-Me… · SCIENTIFIC VISUALIZATION ON GPU CLUSTERS . Visualization ≠ Rendering.