Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team...

12
Live Migration @ Alibaba Cloud Computing Kaier & Team [email protected]

Transcript of Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team...

Page 1: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Live Migration @ Alibaba Cloud Computing

Kaier & Team

[email protected]

Page 2: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Agenda• Live Migration Usage Model at Alicloud

• Issues and Challenges at Alicloud

• Live Migration Optimization and Enhancement

• Live migration Practice at Alicloud

• Future Work

Page 3: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Live Migration Usage Model at Alicloud

• Usage model• Load balance

• h/w malfunction handle

• s/w update & hotfix

• Expired server replacement

• Benefit• Earn more money by improving vm density

• Cloud maintenance transparent to customer

Page 4: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Issues and Challenges at Alicloud

• Complicated cloud computing environment• Hundreds of guestos types and versions, some are unfriendly to live migration• Different server platforms, blocking migrate among different processor types• Some Alicloud history block live migration, i.e., static vnc port binding• Diverse storage and network types

• Xen4.0.1 is not friendly to live migration• Hypervisor cannot be updated online• Qemu and pv drivers bugs• Python xend is not good for live migration while xl is not mature at that time

• High performance and reliability requirement• 100% vm servive• Transparent to customer

• Hundreds-ms-level downtime and breaktime

Page 5: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Live Migration Optimization and Enhancement

• Performance optimization• New event-drive parallel live migration

• Current Xen live migration is serial• Alicloud parallel live migration

• event-drive mechanism based on xenstore• Separating virtualization/storage/network components, logically independent and clear• parallel live migration, storage and network parallel working w/ xend process

• Virtualization optimization• Bypass time-consumed domain destroy - Xen memory scrub issue• New end-shakehand for qemu data transfer• Pre-restore most vm work ASAP at destination side

• Storage optimization• Light-weight tap-ctl pause/reopen instead of tapdisk destroy/create

• Network optimization• Send gratuitous arp ASAP at destination side• Flow cache

Page 6: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Live Migration Optimization and Enhancement

Send vm configuration

Memory log dirtySend N-1 iter memory

Socket connection

Suspend vm

Send last iter memorySend vm context

Send qemu context

Socket server

Destroy vm1. very time consuming2. increase w/ vm mem size

Socket close

Create vm

Restore memory

Restore last memoryrestore vm context

Restore qemu context

Complete vm restore

Reopen tapdisk

Destroy vm devices

Restore qemu context done

vm Resume

Vif front-backend connection

ARP notify1. send gratuitous arp2.undate vm ARP cache

Wait

End

downtime

breaktime

Source DestinationXen4.0.1 live migration

Page 7: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Live Migration Optimization and Enhancement

Send vm configuration

Memory log dirtySend N-1 iter memory

Socket connection

Suspend vm

Send last iter memorySend vm context

Send qemu context

Socket server

Destroy vm1. very time consuming2. increase w/ vm mem size

Socket close

Create vm

Restore memory

Restore last memoryrestore vm context

Restore qemu context

Destroy vm devices

vm Resume

Vif front-backend connection

ARP notify1. send gratuitous arp2.undate vm ARP cache

End

Source Destination

Alicloud live migration optimizing

Light weight tapdisk pause

Storage:Tapdisk reopen

Virt:Complete restore

Network:Send gratuitous arp

downtime

breaktime

Qemu context length pre-notification

Page 8: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Downtime and breaktime optimization

Xen4.0.1 Xenserver Alicloud Alicloud Improvementdowntime 2.5 ~ 17s 4.5 ~ 5s 210 ~ 250 ms vs. Xen4.0.1 10~70x

vs. Xenserver 20xbreaktime 3 ~ 25s 8 ~ 9s 500 ~ 600ms vs. Xen4.0.1 5~ 50x

vs. Xenserver 15x

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

512MB 1GB 2GB 4GB 8GB 16GB

downtime optimization (ms)

Xen4.0.1 Xenserver Alicloud

Page 9: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Live Migration Optimization and Enhancement

• Reliability enhancement• Redundant storage/network design

• Dual storage and network live migration recover, no single failure point

• Closed-loop live migration• Xend not reliable under complicated cloud computing environment

• Alicloud closed-loop live migration

• Alicloud controller is per server pool, own full live migration picture

• Xend report live migration events and progress to Alicloud controller

• Alicloud controller make reliable failover decision and instruct src/dst side

• Target: 100% vm survive

Page 10: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Live Migration Practice at Alicloud

• Hardware malfunction• Case 1: Shu-guang server fans malfunction

• ~200 servers impacted• Migrate all vm to backup server, change fans, then migrate back• Totally transparent to customers• 98.6% vm survive, failure due to wrong Xend installation

• Case 2: memory and disk I/O failure• vm under UCNA/SRAO memory failure can be migrated safely, 100% success• Disk I/O failure handle, 100% success

• Server consolidation• Case 1: Si-Chuan telecom vm consolidation

• Consolidate all vm of 150 servers to 20 servers• Save servers for new business• 100 vm survive

Page 11: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Future Work

• Post copy• Downtime and breaktime increase under dirty-page pressure

• Post copy, but w/ failover issue

• Cooperate w/ Xen developers at community

• TSC issue• RDTSC drop badly after migrated to different X86

• Cooperate w/ Intel, RDTSC scaling microcode at BDW-EP server

• Heterogeneous live migration is highly desired at large scale cloud computing

Page 12: Live Migration @ Alibaba Cloud Computing · Live Migration @ Alibaba Cloud Computing Kaier & Team jinsong.liu@alibaba-inc.com. Agenda • Live Migration Usage Model at Alicloud •

Thanks