Last time: Runtime infrastructure for hybrid (GPU-based) platforms Task scheduling Extracting...

Last time: Runtime infrastructure for hybrid (GPU-based) platforms Task scheduling

Extracting performance models at runtime

Memory management Asymmetric Distributed Shared Memory

StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, Cédric Augonnet, Samuel Thibault, and Raymond Namyst. TR-7240, INRIA, March 2010. [link]

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems, Isaac Gelado, Javier Cabezas, John Stone, Sanjay Patel, Nacho Navarro, Wen-mei Hwu, ASPLOS’10 [pdf]

Today: Bridging runtime and language support ‘Virtualizing GPUs’

Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf]

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011

Today: Bridging runtime and language support ‘Virtualizing GPUs’

Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf]

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011 best paper!

Context: clouds shift to support HPC applications

initially tightly coupled applications not suited for could applications

today Chinese – cloud with 40Gbps infiniband Amazaon HPC instance GPU instances: Amazon, Nimbix

Challenge: make GPUs shared resources in the could.

Challenge: make GPUs a shared resource in the could.

Why do this? GPUs are costly resources

Multiple VMs on a node with a single GPU Increase utilization

app level: some apps might not use GPUs much; kernel level: some kernels can be collocatd

Two streams1. How?

2. Evaluate … opportunities gains overheads

1. The ‘How?’

Preamble: Concurrent kernels are supported by today’s GPUs Each kernel can execute a different task Tasks can be mapped to different streaming multiprocessors

(using thread-block configuration) Problem: concurrent execution limited to the set of kernels

invoked within a single processor context

Past virtualization solutions API rerouting / intercept library

1. The ‘How?’

Preamble: Concurrent kernels are supported by today’s GPUs Each kernel can execute a different task Tasks can be mapped to different streaming multiprocessors

(using thread-block configuration) Problem: concurrent execution limited to the set of kernels

invoked within a single processor context

1. The ‘How?’

Architecture

2. Evaluation – The opportunity

The opportunity Key assumption: Under-utilization of GPUs

Space-sharing Kernels occupy different SP

Time-sharing Kernels time-share same SP (benefit form harware

support form context switces) Note: is it not always possible

2. Evaluation – The opportunity

The opportunity Key assumption: Under-utilization of GPUs

Sharing Space-sharing

Kernels occupy different SP Time-sharing

Kernels time-share same SP (benefit form harware support form context switces) Note: resource conflicts may prevent this

Molding – change kernel configuration (different number of thread blocks / threads per block) to improve collocation

2. Evaluation – The gains

2. Evaluation – The overheads

Discussion Limitations Hardware support

OpenCL vs. CUDA

http://ft.ornl.gov/doku/shoc/level1 http://ft.ornl.gov/pubs-archive/shoc.pdf

Last time: Runtime infrastructure for hybrid (GPU-based) platforms Task scheduling Extracting...

Documents

Transcript of Last time: Runtime infrastructure for hybrid (GPU-based) platforms Task scheduling Extracting...