HyVM – Hybrid Virtual Machines - ORNL · 2009-09-30 · HyVM Project Goal Uniform runtime model...
Transcript of HyVM – Hybrid Virtual Machines - ORNL · 2009-09-30 · HyVM Project Goal Uniform runtime model...
HyVM – Hybrid Virtual MachinesMachinesAda GavrilovskaAda GavrilovskaKarsten Schwan
Sudha Yalamanchili
and many PhD students
Project ContextProject Context
Future HPC applications on heterogeneous platforms• off (e.g., new NSF Track II machine)-chip and next,
toward on-chip heterogeneous cores• rich memory hierarchies, e.g., NUMAy g
New Execution Models• diverse cores• diverse cores• custom OS kernels• I/O support
Accelerators and tool chains• CUDA, OpenCL, …p• opportunity for compiler-based optimization methods
Future ApplicationsMedia and image processing:• for dynamic web content‘S fi h lik ’
Science and gaming:• Fusion modeling• High perf I/O: GTC Pixie3D XGC• ‘Snapfish‐like’
imagingsuite
• High perf. I/O: GTC, Pixie3D, XGC• Libraries: LAPACK, BLAS,VSIPL, …
Financial and risk analysis:• Black Scholes• Risk Analysis
Data‐intensive and web:• Critical enterprise codes• Data mining• Risk Analysis
• Derivatives processing• Data mining• Sensor data processing
Heterogeneous Platforms• Asymmetries:
– Performance• Different clock speeds, duty cycles• Differing issue widths• Varying cache sizesVarying cache sizes
– NUMA memory– Toward ‘islands of cores’
Functional differences:– Diverse accelerators: heterogeneous:
• Cell, GPU, Communications, Encryption, …
– Shared ISA: asymmetric:• Missing SSE version• Missing SSE version• Missing floating point• Additional instructions for acceleration (Larrabee, …)
HyVM Project GoalUniform runtime model for heterogeneous platforms:
Hybrid Virtual MachinesU if it ft b d l tf t i– Uniformity: software-based platform extensions:
• Virtual Execution Unit (VEU): uniform runtime representation for program executables, targeting heterogeneous cores
• Heterogeneity-awareness: system-level management methods forHeterogeneity awareness: system level management methods for improved platform utilization (incl. cache and energy) and application performance (SLAs)
• Dynamic platform emulation: runtime CK compilation or re-writing for diverse accelerator targets (via LLVM)for diverse accelerator targets (via LLVM)
– High performance: diverse executables:• Commodity and custom VEU ‘containers’: Virtual Machines (VMs)
– processes/threads – commodity cores; Special Executionprocesses/threads commodity cores; Special Execution Environments (e.g., NVIDIA) - Computational Kernels (CKs) -accelerators
• Runtime and adaptive {CK} optimization for parallelismSt d d li t CK i d ti API (O CL• Standards-compliant CK programming and runtime APIs (OpenCL, CUDA)
• Compiler-based optimization techniques for {CK}
HyVM Project ElementsAttaining the uniform HyVM execution model
Leveraging virtualization technologies:– Leveraging virtualization technologies:• Virtual Execution Units (VEUs)
– Finer grain schedulable entities than VCPUs• Specialized execution environments (SEEs) for accelerators• Specialized execution environments (SEEs) for accelerators
– GViM for efficient GPU virtualization (Niraj Tolia, HP)– Cellule: Dilma Silva, Jimi Xenides, Hubertus Franke (IBM)
• Montage: dynamic resource management for sets of VEUs (SLA• Montage: dynamic resource management for sets of VEUs (SLA-awareness, runtime monitoring)
– Coordinated scheduling for accelerators – VMaCS– InTuneS - system abstractions for coordinated management– InTuneS - system abstractions for coordinated management– Cache-aware ‘region’ scheduling + correlation scheduling for
shared ISA VEUs – addressing NUMA and asymmetric platforms• Future hypervisors (using Xen): heterogeneity-awareFuture hypervisors (using Xen): heterogeneity aware
– Extended work: scalable hypervisor structures
HyVM Project ElementsHyVM Project Elements
– Leveraging evolving industry standards for accelerator APIs and interactions:
• Tool chain support for CUDA => LLVM/PTX translation, with future work on OpenCL – Ocelot
• Exploiting compiler methods to optimize execution regimes for sets of VEUs – Harmony
• Runtime APIs at CUDA level - GViM– Exploring new programming and execution models:
• Datalog to GPU compilation• Scheduling via on line performance models and dynamic• Scheduling via on-line performance models and dynamic
code generation for GPU vs. Host ISAs Ocelot/LLVM• New analysis tools for debugging and performance tuning via
emulation Ocelot++emulation Ocelot++
DataInCK1
DataIn
DataIn
CK1 CK2CK3
CK4
Data Data
DataOutCK5
DataOut
i l i
Execution Model(s) OS
Structures
Virtual Execution Unit (VEU)
Virtual Execution Unit (VEU)
Executable Code
State
Virtual Execution Unit
Virtual Execution Units (VEUs)
Structures
Executable Code
State Virtual Execution Unit (VEU)
Executable Code
State
(VEU)
Executable
State
OR Executable Code
State
Virtual Execution Unit (VEU)
Executable Code
State
Executable Code
Executable Code
GPUGPU CPUCPU sCPU (Shared ISA)sCPU (Shared ISA)Emulated GPUEmulated GPU
Homogeneous View
Heterogeneous View
Spectrum of Exported Heterogeneity
Simple GuestSimple Guest
Virtual Platform
Enhanced GuestEnhanced Guest
Vi t l Pl tfVCPUVCPUVCPUVCPU
Virtual Platform
Platform
VCPUVCPUVCPUVCPUVirtual Platform
C i /
sVCPUsVCPUaVCPUaVCPU
Virtual Machine Monitor + Management Domain
Platform interactions
Cooperative export/use of platform information
Virtual Machine Monitor + Management Domain
GPUGPU CPUCPU sCPU (Shared ISA)sCPU (Shared ISA)Emulated GPUEmulated GPU
C llC ll CPUCPU h dh dCPUCPU
Physical Platform
CellCell CPUCPU sCPU (Shared ISA)sCPU (Shared ISA)CPUCPU
Guest RequestHypervisor Response
CPUCPU CPUCPU CPUCPU CPUCPUVirtual Platform Physical Platform
Virtual Platform Physical PlatformCPUCPU CPUCPU CPUCPU sCPUsCPUFault &
Migrate
Physical Platform
CPUCPU acc1acc1 CPUCPU acc1acc1Virtual Platform Physical Platform
CPUCPU 22 CPUCPU 11Virtual Platform Physical Platform
CPUCPU acc2acc2 CPUCPU acc1acc1Ocelot binary
translation
Virtual Machine/GuestVirtual Machine/GuestApplicationsApplications
OpenCL, CUDA, Harmony… kernel(s) for heterogeneous
ISA cores
General purpose or shared ISA programs
threads/processes
State
Activity Unit
Executable Code
StateActivity Unit
Executable Code
Activity Unit
Executable Code
StateActivity Unit
Executable Code
State
e
Communication
OS SchedulerHarmony, OpenCL Runtimes,
GViM - CUDA API Cellule
Guest OS
aVCPUaVCPU VCPUVCPU sVCPUsVCPU
GViM - CUDA API, Cellule
aVCPUaVCPU
Virtual Machine Monitor + Management Domain + SEEs
Ocelot Binary Translation
Admission Control and Monitoring
Federated + Correlation + Region Scheduling
GPUGPU CPUCPU sCPU (Shared ISA)sCPU (Shared ISA)Emulated GPUEmulated GPU
Technical Elements – GViM and VMaCSVishakha Gupta, Niraj Tolia, othersp j
Extending Xen for efficient {CK} execution
Management Domain (Dom0)
Mgmt GPU
VMGPU Application
VMGPU Application
Mgmt Extension
GPU Backend
Traditional D i D i
GPU Frontend
LiGPU Driver
CUDA API
GPU Frontend
Li
CUDA API
Hypervisor (Xen)
Device Drivers Linux Linux
yp ( )
General purpose multicoresGeneral purpose multicores
aPCPU (NVIDIA GPU) Other Devices
Technical Elements – GViM + VMaCS
Polling Domain request queuesAccelerator ready queues
Resource Management
Logic
thread(s) from BackendLogic Backend
GViM:• VEUs on
Dom0
Mgmt Extension
GPU Backend
Traditional Device
VM
GPU FrontendGPU Driver
CUDA APIGPU Application
VM
GPU Frontend
CUDA APIGPU Application
• VEUs ongVCPUs
Hypervisor (Xen)
Device Drivers
G l ltiG l lti
LinuxGPU Driver LinuxVMAcS:• Management Extensions for General purpose multicoresGeneral purpose multicores
pCompute Acc (NVIDIA GPU) DevicesTraditional Devices
Extensions forCoordinatedScheduling
Technical Elements – GViMGPU Guest Performance Baseline Results
300
Total execution time (without host data assignment)
GPU Guest Performance – Baseline Results
250
e (m
sec)
150
200
xecution
Tim
e
Linux
Dom0
50
100
CUDA Ex Dom0
GuestVM
0
Bitonic (512) Matrix(48x48) MersenneTwister(12000) BlackScholes(30000)
Micro‐benchmarks (with dataset values)( )
Less than 15% slowdown in the worst caseFurther improvements made/underway
VMaCS Scheduling Methods - Examplesg p
• Round robin (RR)Round robin (RR)– Some guest domain chosen every period– Poller monitors its queues for the periodq p– Moves to the next domain
• XenoCredit– Guest credits assigned at boot time– Use these credits to calculate proportions – Poll domains for time proportional to their credits
BlackScholes: XenoCredit vs. Round Robin
• BlackScholes with 60000 options and 4096 iterationsH t ith 2 GPU d 4 t VM• Host with 2 GPUs and 4 guest VMs
Motivates new methods for resource management in gaccelerator-based systems – now being completed
Technical Elements - Region and Correlation SchedulingCorrelation Scheduling
Min Lee, Vishakha Gupta, Rob Knauerhase
• Targeting Asymmetric Multicore Platforms• Using Intel quad core and Nehalem server
architectures• Region scheduling – dealing with the
NUCA=>NUMA nature of future manycore systems + NUMAFix (Dulloor Rao)systems + NUMAFix (Dulloor Rao)
• Correlation scheduling – dealing with asymmetric codes with shared ISAasymmetric codes with shared ISA
Region Scheduling:C h A S h d li f CMPCache Aware Scheduling for CMPs
A B C DCurrent working set
Task A
Over utilized! Less utilized!
A B C D
Bad
Task B
Task CTask C
Task D
A C B D
Fully utilized Fully utilizedGood ☺
• IssuesW ki i i i i l ( i d h d)– Working set estimation – size, constituent elements (unique and shared)
• Approach– Runtime ‘region’ determination to capture these working set
Region Scheduling – Basic Idea
• Partition physical memory
P1R1(2)
R2(2)
P2
P3
P4
into regions• Schedule regions on
caches
P5
P6
P7
P8
R3(2)
R4(2)caches• Core can access only the
allowed regions
P9
P10
P11
P12
P13
( )
R5(5)
allowed regions– System-level enforcement
(using page tables)
P13
P14
P15
P16
P17
R6(4)
• Private/Shared regions– we focus on private regions ....
P17
P18
P19R7(2)
Physical page
..
PrivateRegion(Size)
SharedRegion(Size)
Correlation Scheduling: M i t A t i CMapping to Asymmetric Cores
Asymmetry-aware Hypervisor supports asymmetric cores and NUMA memory
VCPU Admission FlowchartNew VCPUNew VCPU
Any correlation
No
information?
Check availability of given PCPU
Yes
Overloaded? Add to the given PCPUYes
No
Occupied by relocatable VCPU(s)?
No
Shuffle VCPUs
Yes
Move to the next most correlated PCPU
No
All busy? Find the best feasible match
YesNo
Technical Elements – ‘Islands of Cores’P i k T bPriyanka Tembey
Dulloor RaoMukil Kesavan
Motivation: many-core platforms• Multiple execution and scheduling contexts: 'islands'Multiple execution and scheduling contexts: islands
• differ in capabilities, micro-architectures• have NUMA memory characteristicsy
• Inter-connection variants – current platforms• PCIe, shared caches, fast point-to-point Interconnect
(QPI/Hypertransport)• Inter-connects – future platforms
• complex cache structures (!)• complex cache structures (!)• PGAS memory with integrated memory
controllers/inter-connects?
Technical Elements – Islands of CoresTechnical Elements Islands of Cores
• InTuneS: Priyanka TembeyI t h d l i l d li ht i ht t h l– Inter scheduler-island lightweight event channel
– Generic yet expressive definition of inter-island message semantics– Gray-box scheduler coordination interface
• scheduler implementation-agnostic APIp g– Application/VM-scheduler interface for explicit SLA policy enforcement– Application/VM-monitoring for implicit observation-based policy
enforcement– Evaluation with complex scheduler (Linux) and hypervisor schedulerEvaluation with complex scheduler (Linux) and hypervisor scheduler
(HV in progress), initially using communication accelerator
• Scalable hypervisor structures: Mukil Kesavan and Dulloor RaoH i t t h ld t h `i l d f ’– Hypervisor structure should match `islands of cores’
– e.g., consider Xen `Dom0’ bundling of functionality⇒ offer solutions for isolation with bundled functionality⇒ Find ways to efficiently run many `small’ domains, specialized, each ⇒ d ays to e c e t y u a y s a do a s, spec a ed, eac
with different tasks
Technical Elements – Creating, Optimizing, and Re-targeting VEUs via Harmony
and Ocelot - Gregory Diamos, Andrew Kerr
Memory
Cap Model 3readInputs();
chunk
Inputs Outputs InputsOutputs
Memory computeInvariants();for all chunks{simulateChunk();
}kernel
Harmony Run-time
chunk
chunkTransparentscheduling, execution management of
}generateResults();
kernel
VEUs
LocalMemoryCache
ACC
DMA
FIFOLocalMemoryCache
LocalMemoryCache
ACC
DMA
FIFOACC
DMA
FIFOCPU CPU CPU
management of chunks
Bi tibl
Network (e.g., Hypertransport, QPI, PCIe)
Binary compatible VEUs across platform
Minimize/avoid re-tuning and porting applications as you add l taccelerators
Advanced optimizationsSpeculation, performance prediction, kernel fusion
Scalable Portable Execution – Harmony RuntimeRuntime
Future extensions of Harmony Current Single Node/MultiCurrent Single Node/Multi--GPU VersionGPU Version
yfunctionality will be to operate across multiple nodes. Intent:
1 Decision models for remote vs1.Decision models for remote vs. local execution2.Transparent deployment and execution of GPU kernels on remote GPUs3.Inclusion of performance optimizations, including:
i) speculative kernel executioni) speculative kernel execution,ii) data movement, and iii) memory footprint vs. concurrency tradeoffs
Problem Scaling – Risk Analysis ApplicationApplication
Measured execution
times
GPU interaction h d
Harmony distributes the work across 4 CPU cores, 2 slow GPU boards, and 1 fast GPU board
overhead dominates
Scaling of GPGPU Applications
Average applications can scale to 1015 cores, each with 180
element -wide SIMD units
Using the Ocelot CUDA Emulator
Ocelot: Dynamic Execution Infrastructure
NVIDIA Virtual ISA
Emerging Environment
Language Front End Language Front End
DatalogC Front End
StatusStatus: • In progress
Kernel IR
StatusStatus: • Moving Forward with Datalog FE retarget (LogicBlox Inc.)
p g
Run Time (Harmony)StatusStatus: • Single node/multi-GPU• being released
Ocelotg
StatusStatus: StatusStatus:CUDA JIT LLVM I/F
Emulator• Released
StatusStatus: • In progress (Fall 2009)
Supported ISAs (MIPS, SPARC, x86, etc.)
Future Directions• Why HyVMs: future applications:
– Media-rich web applications• online `photoshop’ (with HP)
d i t t i ti ( ith M t l )• dynamic stream customization (with Motorola)– Financial and HPC codes
• Benchmarks (e.g., Black Scholes – IBM, HPC – NVIDIA, Interactive - Intel)• Petascale codes (with DOE)
P t l I/O ( ith DOE)• Petascale I/O (with DOE)
• From platforms to clusters to large-scale data centers:– Distributed execution model for petascale machinesDistributed execution model for petascale machines– Other execution models: data-intensive (Hadoop, System S, …)
• Future platforms:– Power/performance tradeoffs and implications– Memory hierarchies and their effects (on- vs. off-chip)– Alternative memory models – e.g., PGAS– Massively parallel HyVMs – e g 1000s of lightweight domains– Massively parallel HyVMs – e.g., 1000s of lightweight domains– Tool chains: future standards (e.g., OpenCL), compiler research (other
faculty)
Broader ContextCERCS Research Center
• CERCS – Center for Experimental Research in Computer Systems (www cercs gatech edu)CERCS Center for Experimental Research in Computer Systems (www.cercs.gatech.edu)– NSF Center– Member companies include HP, IBM, Intel, and Atlanta companies like Logicblox and ICE, plus
several others– ORNL key partners: Steve Poole, Jeff Nichols, Barney Maccabe, Scott Klasky
• Opportunities for large-scale experimentation:– New award for large-scale hybrid cluster machine: NSF Track II award, Vetter, PI, Schwan and
Yalamanchili Co-PIs, first machine to be built Winter 2010• provides additional funding to explore ‘distributed execution environment’ on high end machine and for ‘distributed
performance emulation of multi-GPU machines’
• Project growth and impact through CERCS efforts and faculty:– Involving other faculty:
• Santosh Pande – Glimpses - finding accelerator functions in application codes (NSF)• Hyesoon Kim – Prospector – precise performance (and power) estimation for next generation accelerators (with additional
support from Intel – Larrabee – NVIDIA and NSF) – IBM ArchPIC - Valentina Salapura • Nate Clark – program analysis (Google)Nate Clark program analysis (Google)• Matt Wolf – I/O acceleration for petascale machines (DOE ORNL)
– Working with technology and application partners:• IBM: financial and HPC apps., focus on system-level esource management• NVIDIA: student fellowships – Harmony, Ocelot, equipment donations• Intel: exploring asymmetric architectures• HP: CUDA-based virtualization, coordinated scheduling, data center and web applications, g, pp• Logicblox: accelerating Datalog applications used in retail management and insurance risk estimation
– Additional research support:• NSF Track II award – Oct. 2009• NSF Modeling and Simulation Award (recommended) October 2009 <integrating Ocelot with QEMU>• NSF SHF program award – Sept. 2009• DOE and NSF: I/O for petascale machines, Jan. 2009• Steve Poole: Power/Performance tradeoffs• HP: coordinated scheduling and large-scale data center management, Aug. 2008• Intel: IA-compatible VMs for heterogeneous many-core systems, since 2007
– Linkage to other projects:• The Energy Stack
Publications and ResourcesProject web page: http://www.cercs.gatech.edu/projects/HyVM/Recent Publications:– Priyanka Tembey, Anish Bhatt, Dulloor Rao, Ada Gavrilovska, Karsten Schwan,
“Fl ibl Cl ifi ti H t M lti A li Pl tf ” ICCCN'08“Flexible Classification on Heterogeneous Multicore Appliance Platforms”, ICCCN'08, Aug. 2008
– A. Kerr, G. Diamos, and S. Yalamanchili, "A Characterization and Analysis of PTX Kernels," to appear in the 2009 IEEE International Symposium on Workload Characterization, October 2009.
– V. Gupta, J. Xenidis, P. Tembey, K. Schwan, A. Gavrilovska, "Cellule: Lighweight Execution Environment for Virtualized Accelerators", poster session, SuperComputing'09, Nov. 2009.
– G. Diamos and. S. Yalamanchili, "Kernel Level Speculation”, under submission.– V. Gupta, J. Xenidis, K. Schwan, "Cellule: Lighweight Execution Environment forV. Gupta, J. Xenidis, K. Schwan, Cellule: Lighweight Execution Environment for
Virtualized CellB.E.", under submission. – Min Lee and Karsten Schwan, "Region Scheduling: Virtual Machine Scheduling for
Next Generation Multicore Platforms", in preparation.– Vishakha Gupta, Rob Knauerhase, and Karsten Schwan, “Asymmetric Hypervisors”, in
preparationpreparation.– Priyanka Tembey, Ada Gavrilovska, and Karsten Schwan, “InTuneS: System
Abstractions for Coordinated Scheduling”, in preparation.
Software:– Ocelot, can be obtained at http://code.google.com/p/gpuocelot/, including
documentation.– Harmony programming model:
http://www.ece.gatech.edu/research/labs/casl/harmony/hpdc5hot-diamos.pdf