Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework
description
Transcript of Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework
![Page 1: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/1.jpg)
Supporting GPU Sharing in CloudEnvironments with a Transparent
Runtime Consolidation Framework
Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)
Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Laboratories America)
1
![Page 2: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/2.jpg)
Two Interesting Trends
• GPU, “Big player” in High Performance Computing– Excellent “price-performance” and “performance-per-watt”
ratio– Heterogeneous architectures – AMD Fusion APU, Intel
Sandy Bridge, NVIDIA Denver Project– 3 out of top 4 super computers (Tianhe-1A, Nebulae, and
Tsubame)• Emergence of Cloud – “Pay-as-you-go” model
– Cluster instances , High-speed interconnects for HPC users– Amazon, Nimbix GPU instances
2
BIG FIRST STEP!But at initial stages
![Page 3: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/3.jpg)
Motivation
• Sharing is the basis of cloud, GPU no exception– Multiple virtual machines may share a physical node
• Modern GPUs are expensive than multi-core CPUs– Fermi cards with 6 GB memory, 4000 $– Better resource utilization
• Modern GPUs expose high degree of parallelism– Applications may not utilize full potential
3
![Page 4: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/4.jpg)
Related Work
• vCUDA (Shi et al.)• GViM (Gupta et al.)• gVirtuS (Guinta et al.)• rCuda (Duato et al.)
4
Enable GPU Visibility from Virtual Machines
Limitation: Only from Single Process Context
How to share GPUs from Virtual Machines?
CUDA Compute 2.0 + Supports Task Parallelism
![Page 5: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/5.jpg)
Contributions
• A Framework for transparent GPU sharing in cloud– No source code changes required, feasible in cloud– Propose sharing through consolidation
• Solution to conceptual consolidation problem– New method for computing consolidation affinity scores– Two new molding methods– Overall Runtime consolidation algorithm
• Extensive evaluation with 8 benchmarks on 2 GPUs– At high contention, 50% improved throughput– Framework overheads are small
5
![Page 6: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/6.jpg)
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
6
![Page 7: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/7.jpg)
Outline
•Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
7
![Page 8: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/8.jpg)
BACKGROUND
8
• GPU Architecture• CUDA Mapping and Scheduling
![Page 9: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/9.jpg)
Background
9
SM
SH MEM
SM
SH MEM
SM
SH MEM
..
....
GPU Device MemoryResource Requirements < Max Available Inter-leaved execution
Resource Requirements > Max Available Serialized execution
![Page 10: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/10.jpg)
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
10
![Page 11: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/11.jpg)
UNDERSTANDING CONSOLIDATION on GPU
11
• Demonstrate Potential of Consolidation
• Relation between Utilization and Performance• Preliminary experiments with consolidation
![Page 12: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/12.jpg)
GPU Utilization vs Performance
12
0
2
4
6
8
10
12
14
2*256 4*256 8*256 16*256 32*256 64*256
Scal
abili
ty O
ver 1
*256
Execution configuration
Black Scholes Binomial Options PDE Solver Image Processing
Scalability of Applications
Linear
Sub-Linear
No Significant Improvement
Good Improvement
![Page 13: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/13.jpg)
Consolidation with Space
and Time Sharing
13
SM
SH MEM
SM
SH MEM
SM
SH MEM
SM
SH MEM
App 1 App 2
Cannot utilize all SMs effectivelyBetter Performance at large no. of blocks
![Page 14: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/14.jpg)
Outline
• Background• Understanding Consolidation on GPU
•Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
14
![Page 15: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/15.jpg)
FRAMEWORK DESIGN
15
• Challenges• gVirtuS Current Design• Consolidation Framework & its Components
![Page 16: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/16.jpg)
Design Challenges
16
Enabling GPU Sharing
When & What to Consolidate
Overheads
Need a Virtual Process Context
Need Policies and Algorithms to decide
Light-Weight Design
![Page 17: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/17.jpg)
gVirtuS Current Design
17
Guest-Host Communication Channel
GPU1 GPUn…
Linux / VMM
Frontend Library
CUDA App2
VM2
Frontend Library
CUDA App1
VM1
CUDA Driver
CUDA Runtime
gVirtuS Backend
Backend Process 1
Backend Process 2
Guest Side
HostSide
• Fork Process• No Communication
b/w processes
![Page 18: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/18.jpg)
Runtime Consolidation
Framework
18
BackEnd Server
Dispatcher
Policies Heuristics
GPU GPU
VirtualContext
VirtualContext
Workload Consolidator
Workload Consolidator
Queues Workloads to Dispatcher
Queues Workloads to Virtual Context Ready Queue
HOST SIDE
Workloads arrive from Frontend
Consolidation Decision Maker
Thread
![Page 19: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/19.jpg)
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
19
![Page 20: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/20.jpg)
CONSOLIDATION DECISION MAKING LAYER
• GPU Sharing Mechanisms & Resource Contention• Two Molding Policies• Consolidation Runtime Scheduling Algorithm
20
![Page 21: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/21.jpg)
Sharing Mechanisms &
Resource Contention
21
Sharing Mechanisms
Consolidation by Space Sharing
Consolidation by Time Sharing
Res
ourc
e C
onte
ntio
n
Large No. of Threads with in a block
Pressure on Shared Memory B
asis
of A
ffin
ity
Scor
e
![Page 22: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/22.jpg)
Molding Kernel Configuration
• Perform molding dynamically• Leverage gVirtuS to intercept kernel launch• Flexible for configuration modification• Mold the configuration to reduce contention• Potential increase in application latency• However, may still improve global throughput
22
![Page 23: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/23.jpg)
Two Molding Policies
23
Molding Policies
Time Sharing with Reduced Threads
Forced Space Sharing
14 * 256
7 * 256
14 * 512
14 * 128
May resolve shared memory
Contention
May reduce register pressure in
the SM
![Page 24: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/24.jpg)
Consolidation SchedulingAlgorithm
• Greedy-based Scheduling Algorithm• Schedule “N” kernels on 2 GPUs• Input: 3-Tuple Execution Configuration list of all kernels • Data Structure: Work Queue for each Virtual Context
24
Overall Algorithm
Generate Pair-wise Affinity
Generate Affinity for List
Get Affinity By Molding
![Page 25: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/25.jpg)
Consolidation Scheduling Algorithm
25
Configuration list
Create Work Queues for
Virtual Contexts
Generate Pair-wise Affinity
Find the pair with min. affinitySplit the pair into diff. Queues
(a1, a2) = Generate Affinity For List for each rem. KernelWith each Work Queue
(a3, a4) = Get Affinity By Molding for each rem. Kernel
With each Work QueueFind Max(a1, a2, a3, a4)
Push kernel into QueueDispatch Queues into Virtual Contexts
![Page 26: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/26.jpg)
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer
•Experimental Results• Conclusions
26
![Page 27: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/27.jpg)
EXPERIMENTAL RESULTS• Setup, Metric & Baselines• Benchmarks• Results
27
![Page 28: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/28.jpg)
Setup, Metric & Baselines
• Setup– A Machine with Two Intel Quad core Xeon E5520 CPU– Two NVIDIA Tesla C2050 GPU Cards
• 14 Streaming Multi Processors, each containing 32 cores• 3 GB Device Memory• 48 KB Shared Memory per SM
– Virtualized with gVirtuS 2.0• Evaluation Metric
– Global Throughput benefit obtained after consolidation of kernels• Baselines
– Serialized execution, based on CUDA Runtime Scheduling– Blind Round-Robin based consolidation (Unaware of exec. configuration)
28
![Page 29: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/29.jpg)
Benchmarks & Goals
29
Benchmarks and its Characteristics
Benchmarks Memory characteristics Data Set DescriptionImage Processing (IP) No ShMem 2*3584*3584 pointsPDE Solver (PDE) No ShMem 2*3584*3584 pointsBlackScholes (BS) No ShMem 1,000,000 optionsBinomial Options (BO) Low ShMem (upto 3KB) 256 options, 2048 stepsK-Means Clustering (KM) Med ShMem (upto 16KB) 4194304 pointsK-Nearest Neighbour (KNN) Med ShMem (upto 16KB) 4194304 pointsEuler (EU) Heavy ShMem (upto 48KB) 10,000 nodes, 60,000 edgesMolecular Dynamics (MD) Heavy ShMem(upto 48KB) 130,000 nodes, 16,200,000 edges
![Page 30: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/30.jpg)
Benefits of Space and Time Sharing Mechanisms
30
Space Sharing Time Sharing
• No resource contention• Consolidation through Blind Round-Robin algorithm• Compared against serialized execution of kernels
![Page 31: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/31.jpg)
Drawbacks of Blind Scheduling
31
Presence of Resource Contentions
No benefit from Consolidation
Large Number of ThreadsShared Memory Contention
![Page 32: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/32.jpg)
Effect of Molding
32
Contention – Large Threads Contention – Shared Memory
Time Sharing with Reduced Threads
Forced Space Sharing
![Page 33: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/33.jpg)
Effect of Affinity Scores
33
Kernel Configurations• 2 kernels with 7*512• 2 kernels with 14*256
• No affinity – Unbalanced Threads per SM• With affinity – Better Thread Balancing per SM
![Page 34: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/34.jpg)
Benefits at High Contention Scenario
34
8 Kernels on 2 GPUs
6 out of 8 Kernels molded31.5% improvement over Blind Scheduling50% over serialized execution
![Page 35: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/35.jpg)
Framework Overheads
35
No Consolidation With Consolidation
Compared to plain gVirtuS executionOverhead always less than 1%
Compared with manually consolidated executionOverhead always less than 4%
![Page 36: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/36.jpg)
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results
•Conclusions36
![Page 37: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/37.jpg)
Conclusions
• A Framework for transparent sharing of GPUs• Use Consolidation as a mechanism for sharing GPUs• No source code level changes• New Affinity and Molding methods• Runtime Consolidation Scheduling Algorithm• At high contention, significant throughput benefits• The overheads of the framework are small
37
![Page 38: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/38.jpg)
38
Thank You for your attention!Questions?
Authors Contact Information:• [email protected]• [email protected]• [email protected]• [email protected]
![Page 39: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/39.jpg)
Impact of Large Number of Threads
39
![Page 40: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816358550346895dd41286/html5/thumbnails/40.jpg)
Per Application Slowdown/ Choice of Molding
40
Application Slowdown Choice of Molding Type