Gandiva: Introspective Cluster Scheduling ... - Romil...

- DL jobs have different sensitivities to resource affinity, due to network architecture or hyper- parameters (e.g., batch-size) - Other cases: Inter-server locality, 1-GPU interference, NIC interference Gandiva: Introspective Cluster Scheduling for Deep Learning Characteristics of Deep Learning Jobs Scheduling Mechanisms for Deep Learning Experimental Results Wencong Xiao ♠†* , Romil Bhardwaj †* , Ramachandran Ramjee † , Muthian Sivathanu † , Nipun Kwatra † , Zhenhua Han ♣† , Pratyush Patel†, Xuan Peng ♦† , Hanyu Zhao ●† , Quanlu Zhang † , Fan Yang † , Lidong Zhou † ♠ Beihang University, † Microsoft Research, ♣ The University of Hong Kong, ♦ Huazhong University of Science and Technology, ● Peking University Model Sensitivity Job3 Job4 … JobN Cluster Scheduler - Allocation reactively at job arrival and departure - Dedicated GPUs for a job in its whole lifetime - Jobs queued if no qualified resources Intra-job Predictability - GPU Memory usage follows a cyclic pattern aligned with mini-batch boundaries, usually with more than 10x difference in utilization within a mini-batch Feedback-driven exploration Job Queue - Opportunistically scale jobs to idle GPUs - Vacate GPUs on-demand - Depends on job capabilities to utilize additional GPUs Hyperparameter Search with Time-slicing Time to find a qualified model (minutes) Search across 12 dimensions - LeNet on CIFAR-10 Cluster Experiment: Time-slicing + Migration Cluster Experiment: Time-slicing + Packing 9-day trace from Microsoft servers on 100 GPUs CDF of job completion time Low overhead Suspend/Resume & Migration Migration time of real workloads Mixed PyTorch jobs on 180 Tesla GPUs Number of successful packing decisions Average GPU util(%) Time to find a qualified model (minutes) GPU Memory usage during training Intra-server locality - Deep learning experiments today use manual or automatic (AutoML) trial-and-error techniques to find the best model Job1 Job2 Job3 Job4 Job5 Job6 Which model is better? GPU cluster GPU Introspective Policies Up to 7x faster hyperparameter search 27% reduction in job completion time 13.6x faster AutoML in shared environments 26% increase in cluster GPU utilization 4.5x faster job feedback Over-subscription - Time-slice to allow multiple jobs to run simultaneously with a weighted time-share - Pack multiple jobs in the same server if jobs have light-weight resource requirements Profiling for introspection - Monitor resource utilization (e.g., GPU utilization and memory) - Non-invasive progress rate estimation for scheduling decisions - Copy GPU memory to CPU at mini-batch boundaries - Restore state from CPU memory on resume - Or, run multiple processes simultaneously on a GPU. - Generic GPU processes: Use CRIU to dump and restore process state across machines - Checkpoint-aware processes: repurpose TF checkpointing APIs to save and restore state. Pre-warm libraries for fast migration. Traditional GPU Allocation Suspend-Resume/Packing Migration Grow-Shrink Runtime adjustment - Migrate jobs at mini-batch boundary if better resources appear - Defrag GPUs to better compact resources for multi-GPU jobs - Grow to more resources when available and shrink when required Server 1 Server 2 Server 3 Server 4 J 1 J 1 J 2 J 2 Completed Job Multi-GPU Jobs Packed Jobs Migrate J 3 J 3

Upload
others
Category

Documents
view
1
download
0

Embed Size (px):

Transcript of Gandiva: Introspective Cluster Scheduling ... - Romil...

Page 1: Gandiva: Introspective Cluster Scheduling ... - Romil Bhardwajromilbhardwaj.github.io/files/gandiva-poster-osdi18-v2.pdf · memory on resume - Or, run multiple processes simultaneously

- DL jobs have different sensitivities to resource affinity, due to network architecture or hyper-parameters (e.g., batch-size)

- Other cases: Inter-server locality, 1-GPU interference, NIC interference

Gandiva: Introspective Cluster Scheduling for Deep Learning

Characteristics of Deep Learning Jobs

Scheduling Mechanisms for Deep Learning

Experimental Results

Wencong Xiao♠†*, Romil Bhardwaj†*, Ramachandran Ramjee†, Muthian Sivathanu†, Nipun Kwatra†, Zhenhua Han♣†, Pratyush Patel†, Xuan Peng♦†, Hanyu Zhao●†, Quanlu Zhang†, Fan Yang†, Lidong Zhou†

♠ Beihang University, † Microsoft Research, ♣ The University of Hong Kong,♦ Huazhong University of Science and Technology, ● Peking University

Model Sensitivity

Job3Job4…JobNCluster

Scheduler

- Allocation reactively at job arrival and departure

- Dedicated GPUs for a job in its whole lifetime

- Jobs queued if no qualified resources

Intra-job Predictability

- GPU Memory usage follows a cyclic pattern aligned with mini-batch boundaries, usually with more than 10x difference in utilization within a mini-batch

Feedback-driven exploration

Job Queue

- Opportunistically scale jobs to idle GPUs

- Vacate GPUs on-demand

- Depends on job capabilities to utilize additional GPUs

Hyperparameter Search with Time-slicing

Time to find a qualified model (minutes)

Search across 12 dimensions - LeNet on CIFAR-10

Cluster Experiment: Time-slicing + Migration

Cluster Experiment: Time-slicing + Packing

9-day trace from Microsoft servers on 100 GPUs

CDF of job completion time

Low overhead Suspend/Resume & Migration

Migration time of real workloads

Mixed PyTorch jobs on 180 Tesla GPUs

er o

f su

cces

sfu

l p

acki

dec

isio

Ave

rage

U u

til(

Time to find a qualified model (minutes)

GPU Memory usage during training

Intra-server locality

- Deep learning experiments today use manual or automatic (AutoML) trial-and-error techniques to find the best model

Job1Job2

Job3Job4

Job5 Job6

Which model is better?

GPU cluster

GPU

Introspective Policies

Up to 7x faster hyperparameter search

27% reduction in job completion time13.6x faster AutoML in shared environments

26% increase in cluster GPU utilization4.5x faster job feedback

Over-subscription

- Time-slice to allow multiple jobs to run simultaneously with a weighted time-share

- Pack multiple jobs in the same server if jobs have light-weight resource requirements

Profiling for introspection

- Monitor resource utilization (e.g., GPU utilization and memory)

- Non-invasive progress rate estimation for scheduling decisions

- Copy GPU memory to CPU at mini-batch boundaries

- Restore state from CPU memory on resume

- Or, run multiple processes simultaneously on a GPU.

- Generic GPU processes: Use CRIU to dump and restore process state across machines

- Checkpoint-aware processes:repurpose TF checkpointing APIs to save and restore state. Pre-warm libraries for fast migration.

Traditional GPU Allocation

Suspend-Resume/Packing

Migration

Grow-Shrink

Runtime adjustment

- Migrate jobs at mini-batch boundary if better resources appear

- Defrag GPUs to better compact resources for multi-GPU jobs

- Grow to more resources when available and shrink when required

Server 1

Server 2

Server 3

Server 4

Completed Job Multi-GPU Jobs Packed JobsMigrate

J3 J3

1st PUBLIC HEALTH CONVENTION ON THE HEALTH AND WELLNESS OF PERSONS WITH DISABILITIES CURRENT DEVELOPMENTS IN THE PHILIPPINE REHABILITATION SERVICES ROMIL.

Gandiva: Introspective Cluster Scheduling for Deep Learning · Speedup 1x 5.25x 5.66x AutoML: Explore 100 hyper-parameter configs-ResNet-like Model for CIFAR Image dataset; 16 P40

Romil pepsico new1

Gandiva: Introspective Cluster Scheduling for Deep Learningpatelp1/slides/gandiva-osdi18-sli… · Gandiva: Introspective Cluster Scheduling for Deep Learning Wencong Xiao, Romil

Romil Benyamino: Sahra (Moon)

Noria: dynamic, partially-stateful data-ﬂow for high ...osdi18.pdf · Noria’s techniques remain compatible with traditional parallel and distributed data-ﬂow, and allow Noria

ONLINE SHOPPING BY ROMIL SOMAIYA FOR CLASH GROUP community LEICESTER ARTHERITIS SELF HELP ©

Indian Institute of Technology Kanpur · 2015. Shweta Nair and Romil Desai are a part of the team project sponsored by Wärtsilä. This project aims at identifying challenges faced

día internacional da muller - Vigo · Galegas Lugar: Casa das Mulleres (r/ Romil, 20) LUNS, 19 DE MARZO / 20.00 H “Mulleres deportistas: A montaña invisible” Conferencia con

Capturing and Enhancing In Situ System …huang/paper/panorama-osdi18.pdfgradual roll-out. A component in production may ex-perience gray failure [30], a failure whose manifesta-tion

Romil Graaphics & Art

r Página 1 de 1€¦ · Visor Página 1 de 1 30/05/2009. Sta Maria. CASTRELOS. Cº Falcoa. Cº da Pontenova. Cº de Freixo. Cº Romil. Cº da Seara

Gandiva: Introspective Cluster Scheduling for Deep …patelp1/misc/gandiva...memory on resume - Or, run multiple processes simultaneously on a GPU. -Generic GPU processes: Use CRIU

Lightweight Preemptible Functionssboucher/osdi18.pdfLightweight Preemptible Functions Sol Boucher†, Anuj Kalia†, David G. Andersen†, Michael Kaminsky‡ †Carnegie Mellon University,

Finding Crash-Consistency Bugs with Bounded …jaya/pdf/osdi18-B3.pdfJayashree Mohan 1 Ashlie Martinez 1 Soujanya Ponnapalli1 Pandian Raju1 Vijay Chidambaram1;2 1University of Texas

UBC BOARD OF GOVERNORS PRESENTATION - SEPTEMBER … · 2019. 9. 18. · Chris Hakim (AMS President) Romil Jain (UBCSUO President) Julia Burnham (AMS VP Academic and University Aﬀairs)

ChemCare® : Pour une gestion sans souci mais pas ... · B&J, Carlo-Erba / Sds et Romil, pour analyse et synthèse, recherche et dé-veloppement et fabrication. – Solvants HPLC,

Gandiva: Introspective Cluster Scheduling for Deep …...- DL jobs have different sensitivities to resource affinity, due to network architecture or hyper-parameters (e.g., batch-size)

R$ 1,00 LINHARES-ES | SEXTA-FEIRA, 15 DE SETEMBRO DE 2017 ... · razões médicas e Romil-do por composição com as demais lideranças. Secretário morto por Vingança O secretário

Romil Bahl, A.T. Kearney Global Sports and Entertainment Practice Leader May 13, 2003.

Gandiva: Introspective Cluster Scheduling ... - Romil...

Documents

Transcript of Gandiva: Introspective Cluster Scheduling ... - Romil...