Computer Laboratory - Cornell University€¦ · Closer integration with the platform / BMC Device...
Transcript of Computer Laboratory - Cornell University€¦ · Closer integration with the platform / BMC Device...
1
Xen and the Art of Virtualization
Ian PrattUniversity of Cambridge and Founder of XenSource Inc.
Computer Laboratory
Outline
Virtualization OverviewXen Today : Xen 2.0 OverviewArchitectureArchitecturePerformanceLive VM RelocationXen 3.0 features (Q3 2005)Research Roadmap
Virtualization Overview
Single OS image: Virtuozo, Vservers, ZonesGroup user processes into resource containersHard to get strong isolation
Full virtualization: VMware, VirtualPC, QEMURun multiple unmodified guest OSesHard to efficiently virtualize x86
Para-virtualization: UML, XenRun multiple guest OSes ported to special archArch Xen/x86 is very close to normal x86
Virtualization in the Enterprise
Consolidate under-utilized serversto reduce CapEx and OpEx
Avoid downtime with VM Relocation
Dynamically re-balance workloadto guarantee application SLAs
Enforce security policy
Xen Today : 2.0 Features
Secure isolation between VMsResource control and QoSOnly guest kernel needs to be portedy g p
All user-level apps and libraries run unmodifiedLinux 2.4/2.6, NetBSD, FreeBSD, Plan9
Execution performance is close to nativeSupports the same hardware as Linux x86Live Relocation of VMs between Xen nodes
Para-Virtualization in Xen
Arch xen_x86 : like x86, but Xen hypercalls required for privileged operations
Avoids binary rewritingMinimize number of privilege transitions into XenMinimize number of privilege transitions into XenModifications relatively simple and self-contained
Modify kernel to understand virtualised env.Wall-clock time vs. virtual processor time• Xen provides both types of alarm timer
Expose real resource availability• Enables OS to optimise behaviour
2
x86 CPU virtualization
Xen runs in ring 0 (most privileged)Ring 1/2 for guest OS, 3 for user-space
GPF if guest attempts to use privileged instrXen li es in top 64MB of linea add spaceXen lives in top 64MB of linear addr space
Segmentation used to protect Xen as switching page tables too slow on standard x86
Hypercalls jump to Xen in ring 0Guest OS may install ‘fast trap’ handler
Direct user-space to guest OS system callsMMU virtualisation: shadow vs. direct-mode
MMU Virtualizion : Shadow-Mode
Guest OSguest writes
guest reads Virtual → Pseudo-physical
MMU
Accessed &dirty bits
VMMHardware
Virtual → Machine
Updates
MMU Virtualization : Direct-Mode
Guest OSguest writes
guest reads
Virtual → Machine
MMU
Xen VMMHardware
Para-Virtualizing the MMU
Guest OSes allocate and manage own PTsHypercall to change PT base
Xen must validate PT updates before useAllows incremental updates, avoids revalidation
Validation rules applied to each PTE:1. Guest may only map pages it owns*2. Pagetable pages may only be mapped RO
Xen traps PTE updates and emulates, or ‘unhooks’ PTE page for bulk updates
MMU Micro-Benchmarks
0.7
0.8
0.9
1.0
1.1
L X V U
Page fault (µs)L X V U
Process fork (µs)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
lmbench results on Linux (L), Xen (X), VMWare Workstation (V), and UML (U)
Queued Update Interface (Xen 1.2)
Guest OSguestwrites
guest readsVirtual → Machine
MMU
Xen VMMHardware
validation
3
Writeable Page Tables : 1 – write fault
Guest OSfirst guest
write
guest reads
Virtual → Machine
MMU
Xen VMMHardware
page fault
Writeable Page Tables : 2 - Unhook
Guest OSguest writes
guest reads
Virtual → MachineX
MMU
Xen VMMHardware
Writeable Page Tables : 3 - First Use
Guest OSguest writes
guest reads
Virtual → MachineX
MMU
Xen VMMHardware
page fault
Writeable Page Tables : 4 – Re-hook
Guest OSguest writes
guest reads
Virtual → Machine
MMU
Xen VMMHardware
validate
I/O Architecture
Xen IO-Spaces delegate guest OSes protected access to specified h/w devices
Virtual PCI configuration spaceVirtual interruptsVirtual interrupts
Devices are virtualised and exported to other VMs via Device Channels
Safe asynchronous shared memory transport‘Backend’ drivers export to ‘frontend’ driversNet: use normal bridging, routing, iptablesBlock: export any blk dev e.g. sda4,loop0,vg3
Xen 2.0 Architecture
GuestOS(XenLinux)
Device Manager & Control s/w
VM0
GuestOS(XenLinux)
UnmodifiedUser
Software
VM1
GuestOS(XenLinux)
UnmodifiedUser
Software
VM2
GuestOS(XenBSD)
UnmodifiedUser
Software
VM3
Event Channel Virtual MMUVirtual CPU Control IF
Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)
NativeDeviceDriver
(XenLinux)
NativeDeviceDriver
(XenLinux)
Front-EndDevice Drivers
(XenLinux)
Front-EndDevice Drivers
(XenBSD)
Safe HW IF
Xen Virtual Machine Monitor
Back-End Back-End
4
System Performance
0 7
0.8
0.9
1.0
1.1
L X V U
SPEC INT2000 (score)
L X V U
Linux build time (s)
L X V U
OSDB-OLTP (tup/s)
L X V U
SPEC WEB99 (score)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Benchmark suite running on Linux (L), Xen (X), VMware Workstation (V), and UML (U)
TCP results
0.7
0.8
0.9
1.0
1.1
L X V UTx, MTU 1500 (Mbps)
L X V URx, MTU 1500 (Mbps)
L X V UTx, MTU 500 (Mbps)
L X V URx, MTU 500 (Mbps)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
TCP bandwidth on Linux (L), Xen (X), VMWare Workstation (V), and UML (U)
Scalability
800
1000
L X2
L X4
L X8
L X16
0
200
400
600
Simultaneous SPEC WEB99 Instances on Linux (L) and Xen(X)
Xen 3.0 Architecture
GuestOS(XenLinux)
Device Manager & Control s/w
VM0
GuestOS(XenLinux)
UnmodifiedUser
Software
VM1
GuestOS(XenLinux)
UnmodifiedUser
Software
VM2
UnmodifiedGuestOS
UnmodifiedUser
Software
VM3
AGP
Event Channel Virtual MMUVirtual CPU Control IF
Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)
NativeDeviceDriver
(XenLinux)
NativeDeviceDriver
(XenLinux)
Front-EndDevice Drivers
(XenLinux)
Front-EndDevice Drivers
GuestOS(WinXP))
Safe HW IF
Xen Virtual Machine Monitor
Back-End Back-End
VT-x
32/64bit
AGPACPIPCI
SMP
3.0 Headline Features
AGP/DRM graphics supportImproved ACPI platform supportSupport for SMP guestspp gx86_64 supportIntel VT-x support for unmodified guestsEnhanced control and management toolsIA64 and Power support, PAE
x86_64
Intel EM64T and AMD OpteronRequires different approach to x86 32 bit:
Can’t use segmentation to protect Xen from guest OS kernels as no segment limitsSwitch page tables between kernel and user• Not too painful thanks to Opteron TLB flush filter
Large VA space offers other optimisations
Current design supports up to 8TB mem
5
SMP Guest OSes
Takes great care to get good performance while remaining secureParavirtualized approach yields many important benefits
Avoids many virtual IPIsEnables ‘bad preemption’ avoidanceAuto hot plug/unplug of CPUs
SMP scheduling is a tricky problemStrict gang scheduling leads to wasted cycles
VT-x / Pacifica
Will enable Guest OSes to be run without paravirtualization modifications
E.g. Windows XP/2003
CPU provides traps for certain privileged instrsCPU provides traps for certain privileged instrsShadow page tables used to provide MMU virtualization Xen provides simple platform emulation
BIOS, Ethernet (e100), IDE and SCSI emulation
Install paravirtualized drivers after booting for high-performance IO
VM Relocation : Motivation
VM relocation enables:High-availabilityHigh availability• Machine maintenance
Load balancing• Statistical multiplexing gain
Assumptions
Networked storageNAS: NFS, CIFSSAN: Fibre ChanneliSCSI, network block devdrdb network RAID
Good connectivitycommon L2 networkL3 re-routeing
Storage
Challenges
VMs have lots of state in memorySome VMs have soft real-time requirements
E.g. web servers, databases, game serversMay be members of a cluster quorum
Minimize down-time
Performing relocation requires resourcesBound and control resources used
Relocation Strategy
6
Stage 0: pre-migration
Stage 1: reservation
Relocation Strategy
VM active on host ADestination host selected(Block devices mirrored)Initialize container on
Stage 2: iterative pre-copy
Stage 3: stop-and-copy
Stage 4: commitment
target host
Copy dirty pages in successive rounds
Suspend VM on host ARedirect network trafficSynch remaining stateActivate on host B
VM state on host A released
Pre-Copy Migration: Round 1
Pre-Copy Migration: Round 1 Pre-Copy Migration: Round 1
Pre-Copy Migration: Round 1 Pre-Copy Migration: Round 1
7
Pre-Copy Migration: Round 2 Pre-Copy Migration: Round 2
Pre-Copy Migration: Round 2 Pre-Copy Migration: Round 2
Pre-Copy Migration: Round 2 Pre-Copy Migration: Final
8
Writable Working Set
Pages that are dirtied must be re-sentSuper hot pages• e.g. process stacks; top of page free list
B ff hBuffer cacheNetwork receive / disk buffers
Dirtying rate determines VM down-timeShorter iterations → less dirtying → …
App. ‘phase changes’ may knock us back
Writable Working Set
Set of pages written to by OS/applicationPages that are dirtied must be re-sent
Hot pages• E.g. process stacks• Top of free page list (works like a stack)
Buffer cacheNetwork receive / disk buffers
Page Dirtying Rate
#dirt
y
Dirtying rate determines VM down-timeShorter iters → less dirtying → shorter itersStop and copy final pages
Application ‘phase changes’ create spikes
time into iteration
Writable Working Setge
s 60000
70000
80000
Tracking the Writable Working Set of SPEC CINT2000
gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf
Elapsed time (secs)0 2000 4000 6000 8000 10000 12000
Num
ber
of p
age
0
10000
20000
30000
40000
50000
PostgreSQL/OLTP down-time CINT2000 down-time
9
Rate Limited Relocation
Dynamically adjust resources committed to performing page transfer
Dirty logging costs VM ~2-3%CPU and network usage closely linked
E.g. first copy iteration at 100Mb/s, then increase based on observed dirtying rate
Minimize impact of relocation on server while minimizing down-time
Web Server Relocation
it/se
c) 800
1000Effect of Migration on Web Server Transmission Rate
1st precopy, 62 secs further iterations9.8 secs
765 Mbit/sec
870 Mbit/sec
Elapsed time (secs)0 10 20 30 40 50 60 70 80 90 100 110 120 130
Thr
ough
put (
Mbi
t/s
0
200
400
600
Sample over 100ms
Sample over 500ms
512Kb files100 concurrent clients
694 Mbit/sec
165ms total downtime
Iterative Progress: SPECWeb
52s
Iterative Progress: Quake3
Quake 3 Server relocation
(se
cs)
0.1
0.12
Packet interarrival time during Quake 3 migration
tion
1
: 50m
s
tion
2
: 48m
s
Elapsed time (secs)0 10 20 30 40 50 60 70
Pac
ket f
light
tim
e (s
0
0.02
0.04
0.06
0.08
0.1
Mig
ratio
dow
ntim
e: 5
Mig
ratio
dow
ntim
e: 4
Relocation Transparency
Requires VMM support
Changes to OS
Able to adapt
QoS
supportTransparent yes none no harder
Assisted yes minor yes harder
Self no significant Yes easier
10
Relocation Notification
Opportunity to be more co-operativeQuiesce background tasks to avoid dirtying
Doesn’t help if the foreground task is the cause of the problem… Self-relocation allows the kernel fine-grained control over trade-off
Decrease priority of difficult processes
Extensions
Cluster load balancingPre-migration analysis phaseOptimization over coarse timescales
Evacuating nodes for maintenanceEvacuating nodes for maintenanceMove easy to migrate VMs first
Storage-system support for VM clustersDecentralized, data replication, copy-on-write
Wide-area relocationIPSec tunnels and CoW network mirroring
Research Roadmap
Software fault toleranceExploit deterministic replay
System debuggingLightweight checkpointing and replay
VM forkingLightweight service replication, isolation
Secure virtualizationMulti-level secure Xen
Xen Supporters
Operating System and Systems Management
Hardware Systems
Platforms & I/O* Logos are registered trademarks of their owners
Acquired by
Conclusions
Xen is a complete and robust GPL VMMOutstanding performance and scalabilityExcellent resource control and protectionpLive relocation makes seamless migration possible for many real-time workloads
http://xensource.com
Thanks!
The Xen project is hiring, both in Cambridge, Palo Alto and New York
Computer Laboratory
11
Backup slides Research Roadmap
Whole distributed system emulationI/O interposition and emulationDistributed watchpoints, replay
VM forkingVM forkingService replication, isolation
Secure virtualizationMulti-level secure Xen
XenBIOSCloser integration with the platform / BMC
Device Virtualization
Isolated Driver VMs
Run device drivers in separate domainsDetect failure e.g.
Illegal access 250
300
350
Illegal accessTimeout
Kill domain, restartE.g. 275ms outage from failed Ethernet driver
0
50
100
150
200
250
0 5 10 15 20 25 30 35 40time (s)
Segmentation Support
Segmentation req’d by thread librariesXen supports virtualised GDT and LDTSegment must not overlap Xen 64MB areaNPT TLS library uses 4GB segs with –ve offset!• Emulation plus binary rewriting required
x86_64 has no support for segment limitsForced to use paging, but only have 2 prot levelsXen ring 0; OS and user in ring 3 w/ PT switch• Opteron’s TLB flush filter CAM makes this fast
Device Channel Interface Live migration for clustersPre-copy approach: VM continues to run‘lift’ domain on to shadow page tables
Bitmap of dirtied pages; scan; transmit dirtiedAtomic ‘zero bitmap & make PTEs read-only’Atomic zero bitmap & make PTEs read only
Iterate until no forward progress, then stop VM and transfer remainderRewrite page tables for new MFNs; RestartMigrate MAC or send unsolicited ARP-ReplyDowntime typically 10’s of milliseconds
(though very application dependent)
12
Scalability
Scalability principally limited by Application resource requirements
several 10’s of VMs on server-class machines
Balloon driver used to control domain memory usage by returning pages to Xen
Normal OS paging mechanisms can deflate quiescent domains to <4MBXen per-guest memory usage <32KB
Additional multiplexing overhead negligible
Scalability
800
1000
L X2
L X4
L X8
L X16
0
200
400
600
Simultaneous SPEC WEB99 Instances on Linux (L) and Xen(X)
Resource Differentation
1.5
2.0
ve to
one
inst
ance
2 4 8 8(diff)OSDB-IR
2 4 8 8(diff)OSDB-OLTP
0.0
0.5
1.0
Simultaneous OSDB-IR and OSDB-OLTP Instances on Xen
Aggr
egat
e th
roug
hput
rela
tiv