Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team...
Embed Size (px)
Transcript of Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team...
-
Live Migration @ Alibaba Cloud Computing
Kaier & Team
-
Agenda• Live Migration Usage Model at Alicloud
• Issues and Challenges at Alicloud
• Live Migration Optimization and Enhancement
• Live migration Practice at Alicloud
• Future Work
-
Live Migration Usage Model at Alicloud
• Usage model• Load balance
• h/w malfunction handle
• s/w update & hotfix
• Expired server replacement
• Benefit• Earn more money by improving vm density
• Cloud maintenance transparent to customer
-
Issues and Challenges at Alicloud
• Complicated cloud computing environment• Hundreds of guestos types and versions, some are unfriendly to live migration• Different server platforms, blocking migrate among different processor types• Some Alicloud history block live migration, i.e., static vnc port binding• Diverse storage and network types
• Xen4.0.1 is not friendly to live migration• Hypervisor cannot be updated online• Qemu and pv drivers bugs• Python xend is not good for live migration while xl is not mature at that time
• High performance and reliability requirement• 100% vm servive• Transparent to customer
• Hundreds-ms-level downtime and breaktime
-
Live Migration Optimization and Enhancement
• Performance optimization• New event-drive parallel live migration
• Current Xen live migration is serial• Alicloud parallel live migration
• event-drive mechanism based on xenstore• Separating virtualization/storage/network components, logically independent and clear• parallel live migration, storage and network parallel working w/ xend process
• Virtualization optimization• Bypass time-consumed domain destroy - Xen memory scrub issue• New end-shakehand for qemu data transfer• Pre-restore most vm work ASAP at destination side
• Storage optimization• Light-weight tap-ctl pause/reopen instead of tapdisk destroy/create
• Network optimization• Send gratuitous arp ASAP at destination side• Flow cache
-
Live Migration Optimization and Enhancement
Send vm configuration
Memory log dirtySend N-1 iter memory
Socket connection
Suspend vm
Send last iter memorySend vm context
Send qemu context
Socket server
Destroy vm1. very time consuming2. increase w/ vm mem size
Socket close
Create vm
Restore memory
Restore last memoryrestore vm context
Restore qemu context
Complete vm restore
Reopen tapdisk
Destroy vm devices
Restore qemu context done
vm Resume
Vif front-backend connection
ARP notify1. send gratuitous arp2.undate vm ARP cache
Wait
End
downtime
breaktime
Source DestinationXen4.0.1 live migration
-
Live Migration Optimization and Enhancement
Send vm configuration
Memory log dirtySend N-1 iter memory
Socket connection
Suspend vm
Send last iter memorySend vm context
Send qemu context
Socket server
Destroy vm1. very time consuming2. increase w/ vm mem size
Socket close
Create vm
Restore memory
Restore last memoryrestore vm context
Restore qemu context
Destroy vm devices
vm Resume
Vif front-backend connection
ARP notify1. send gratuitous arp2.undate vm ARP cache
End
Source Destination
Alicloud live migration optimizing
Light weight tapdisk pause
Storage:Tapdisk reopen
Virt:Complete restore
Network:Send gratuitous arp
downtime
breaktime
Qemu context length pre-notification
-
Downtime and breaktime optimization
Xen4.0.1 Xenserver Alicloud Alicloud Improvementdowntime 2.5 ~ 17s 4.5 ~ 5s 210 ~ 250 ms vs. Xen4.0.1 10~70x
vs. Xenserver 20xbreaktime 3 ~ 25s 8 ~ 9s 500 ~ 600ms vs. Xen4.0.1 5~ 50x
vs. Xenserver 15x
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
512MB 1GB 2GB 4GB 8GB 16GB
downtime optimization (ms)
Xen4.0.1 Xenserver Alicloud
-
Live Migration Optimization and Enhancement
• Reliability enhancement• Redundant storage/network design
• Dual storage and network live migration recover, no single failure point
• Closed-loop live migration• Xend not reliable under complicated cloud computing environment
• Alicloud closed-loop live migration
• Alicloud controller is per server pool, own full live migration picture
• Xend report live migration events and progress to Alicloud controller
• Alicloud controller make reliable failover decision and instruct src/dst side
• Target: 100% vm survive
-
Live Migration Practice at Alicloud
• Hardware malfunction• Case 1: Shu-guang server fans malfunction
• ~200 servers impacted• Migrate all vm to backup server, change fans, then migrate back• Totally transparent to customers• 98.6% vm survive, failure due to wrong Xend installation
• Case 2: memory and disk I/O failure• vm under UCNA/SRAO memory failure can be migrated safely, 100% success• Disk I/O failure handle, 100% success
• Server consolidation• Case 1: Si-Chuan telecom vm consolidation
• Consolidate all vm of 150 servers to 20 servers• Save servers for new business• 100 vm survive
-
Future Work
• Post copy• Downtime and breaktime increase under dirty-page pressure
• Post copy, but w/ failover issue
• Cooperate w/ Xen developers at community
• TSC issue• RDTSC drop badly after migrated to different X86
• Cooperate w/ Intel, RDTSC scaling microcode at BDW-EP server
• Heterogeneous live migration is highly desired at large scale cloud computing
-
Thanks