S7281: Device Lending: Dynamic Sharing of GPUs in a...
Transcript of S7281: Device Lending: Dynamic Sharing of GPUs in a...
S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster
Jonas MarkussenPhD studentSimula Research Laboratory
• Motivation
• PCIe Overview
• Non-Transparent Bridges
• Device Lending
Outline
Distributed applications may need to access and use IO resources that are physically located inside remote hosts
Front-end
Interconnect
... ......
...Control+Signaling+
Data
ComputenodeComputenodeComputenode
…… …
• rCUDA
• CUDA-awareOpenMPI
• CustomGPUDirect RDMAimplementation
• ...
Software abstractions simplify the use and allocation of resources in a cluster and facilitate development of distributed applications
Front-end
Logicalviewofresources
…
…
…
…
…
…
...Control+Signaling+
Data
Handled in software
Local
Remote
Application Application
Localresource Remoteresourceusingmiddleware
CUDAlibrary+driver CUDA– middlewareintegration
PCIe IObus
CUDAdriver
Interconnecttransport(RDMA)
Interconnecttransport(RDMA)
Middlewareservice/daemon
Interconnect
Middlewareservice
PCIe IObus
In PCIe clusters, the same fabric is used both as local IO bus within a single node and as the interconnect between separate nodes
ExternalPCIecable
PCIe interconnectswitch RAM Memorybus
PCIe bus
PCIe interconnecthostadapter
PCIe IOdevice
CPUandchipset
Interconnectswitch
Local
Remote
Application
Localresource
CUDAlibrary+driver
PCIe IObus
PCIe IObus
PCIe-basedinterconnect
Remoteresourceovernativefabric
Application
CUDAlibrary+driver
PCIe IObus
PCIe Overview
PCIe is the dominant IO bus technology in computers today, and can also be used as a high-bandwidth low-latency interconnect
PCI-SIG.PCIExpress3.1BaseSpecification,2010.http://www.eetimes.com/document.asp?doc_id=1259778
0
5
10
15
20
25
30
35
Gen2 Gen3 Gen4
Gigabytesp
erse
cond
(GB/s)
PCIex4PCIex8PCIex16
Memory reads and writes are handled by PCIe as transactions that are packet-switched through the fabric depending on the address
RAM
PCIe device PCIe device
PCIe device
CPUandchipset
• Upstream• Downstream• Peer-to-peer(shortestpath)
IO devices and the CPU share the same physical address space, allowing devices to access system memory and other devices
RAM
PCIe device PCIe device
PCIe device
CPUandchipset
Addressspace0x00000…
0xFFFFF…
IOdevice
IOdevice
IOdevice
Interruptvecs
• Memory-mappedIO(MMIO/PIO)• DirectMemoryAccess(DMA)• Message-SignaledInterrupts(MSI-X)
RAM
0xfee00xxx
Non-Transparent Bridges
RAM
Remote address space can be mapped into local address space by using PCIe Non-Transparent Bridges (NTBs)
PCIe NTBadapter
CPUandchipset
PCIe NTBadapter
CPUandchipset
Addressspace
NTB
LocalhostRemotehost
LocalRAM
Local Remote0xf000 0x9000
. . . . . .
NTBaddr mapping
RAM
Using NTBs, each node in the cluster take part in a shared address space and have their own “window” into the global address space
NTB-basedinterconnect
Globaladdr space
A B C
Addr spaceinB
Addr spaceinC
LocalRAM
LocalIOdevices
Globaladdr space
A’saddr space
LocalRAM
LocalIOdevices
C’saddr space
Exportedaddressrange
Addr spaceinA
Device Lending
A remote IO device can be “borrowed” by mapping it into local address space, making it appear locally installed in the system
RAM
NTBadapter0xe000
CPUandchipset
NTBadapter0x1000
CPUandchipset
RAM
Physicaldevice0xb000
Inserteddevice0x2000
Devicedriver
Remote Local0xb000 0x2000
. . . . . .
NTBaddr mapping
Owner Borrower
PCIe hot-plug
By intercepting DMA API calls to set up IOMMU mappings and inject reverse NTB mappings, physical location is completely transparent
RAM
NTBadapter0xe000
CPUandchipset
NTBadapter0x1000
CPUandchipset
RAM
Physicaldevice0xb000
Inserteddevice0x2000
Devicedriver
Local Remote0xf000 0x5000
. . . . . .
NTBaddr mapping
Owner Borrowerdma_addr = dma_map_page(0x9000);
IOMMU
IOV Phys0x5000 0x9000
. . . . . .
Use addr0xf000
Local
Remote
Application
Borrowed remoteresource
CUDAlibrary+driver
PCIe IObus
PCIe IObus
PCIe NTBinterconnect
Unmodifiedlocaldriver(withhot-plugsupport)
ResourceappearslocaltoOS,driver,andapp
Hardwaremappingsensurefastdatapath
WorkswithanyPCIe device(evenindividualSR-IOVfunctions)
Local
Remote
Application Application
Borrowed remoteresource Remoteresourceusingmiddleware
CUDAlibrary+driver CUDA– middlewareintegration
CUDAdriver
Interconnecttransport(RDMA)
Interconnecttransport(RDMA)
Middlewareservice/daemon
Interconnect
Middlewareservice
PCIe IObus
PCIe IObus
PCIe IObus
PCIe NTBinterconnect
Local
Remote
Application Application
Borrowed remoteresource Localresource
CUDAlibrary+driver
PCIe IObus
PCIe IObus
PCIe NTBinterconnect
CUDAlibrary+driver
PCIe IObus
0
2
4
6
8
10
12
14
4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB
Gigabytesp
erse
cond
(GB/s)
Transfersize
bandwidthTest(Local) bandwidthTest(Borrowed) PXH830DMA(GPUDirectRDMA)
1. Nvidia CUDA8.0SamplesbandwidthTest2. GPUDirect RDMAbenchmarkusingDolphinNTBDMA
https://github.com/Dolphinics/cuda-rdma-bench
Device-to-host memory transfer
GPU: Quadro P400 Nvidia driver: Version 375.26(Centos7)
CPU: XeonE5-1630 3.7 GHz Memory: DDR42133MHz
Devicepool
Using Device Lending, nodes in a PCIe cluster can share resources through a process of borrowing and giving back devices
TaskA TaskB TaskC
GPU
SSD SSD SSD
NIC FPGA GPU
GPU GPU GPU
CPU+chipsetRAM
CPU+chipsetRAM
CPU+chipsetRAM
NTB
NTB
NTB
SSDSSDSSD
NIC
FPGA
TaskB
TaskA
GPU GPU GPU SSD
FPGA NIC
GPU SSD
SSD
GPUGPUGPUTaskC
EIR – Efficient computer aided diagnosis framework for gastrointestinal examination
Examinationroom Examinationroom
Serverroomhttp://mlab.no/blog/2016/12/eir/
Moving forward
• Strategy-based management
• Fail-over mechanisms
• VFIO and other API integration (“SmartIO”)
• Borrowing vGPU functions
Thank you!“Device Lending in PCI Express Networks”ACM NOSSDAV 2016
“Efficient Processing of Video in a Multi Auditory Environment using Device Lending of GPUs”ACM Multimedia Systems 2016 (MMSys’16)
“PCIe Device Lending”University of Oslo 2015
DeviceLendingdemoandmoreVisit Dolphin in exhibition area (booth 625)
MyemailaddressSelected
publications