Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao...

Multi-level Selective Deduplication for VMSnapshots in Cloud Storage

Wei Zhang*, Hong Tang†, Hao Jiang†, Tao Yang*, Xiaogang Li†, Yue Zeng†

* University of California at Santa Barbara† Aliyun.com Inc.

Motivations

• Virtual machines on the cloud use frequent backup to improve service reliability Used in Alibaba’s Aliyun - the largest public cloud

service in China• High storage demand

Daily backup workload: hundreds of TB @ Aliyun Number of VMs per cluster: 10000+

• Large content duplicates• Limited resource for deduplication

No special hardware or dedicated machines Small CPU& memory footprint

Focus and Related Work

• Previous work Version-based incremental snapshot backup

– Inter-block/VM duplicates are not detected. Chunk-based file duduplication

– High cost for chunk lookup• Focus on

Parallel backup of a large number of virtual disks. – Large files for VM disk images.

Contributions– Cost-constrained solution with very limited computing

resource– Multi-level selective duplicate detection and parallel backup.

Requirements

• Negligible impact on existing cloud service and VM performance

Must minimize CPU and IO bandwidth consumption for backup and deduplication workload

– (e.g. <1% of total resource).

• Fast backup speed Compute backup for 10,000+ users within a few hours

each day during light cloud workload.• Fault tolerance constraint addition of data deduplication should not decrease the

degree of fault tolerance.

Design Considerations

• Design alternatives An external and dedicated backup storage

system.

A decentralized and co-hosted backup system with full deduplication

BackupCloud service

backup

Cloud service

backup

Cloud service. . .

backup

Cloud service

Design Considerations

• Decentralized architecture running on a general purpose cluster• co-hosting both elastic computing and backup

service• Multi-level deduplication Localize backup traffic and exploit data parallelism Increase fault tolerance• Selective deduplication Use minimal resource while still removing most of

redundant content and accomplishing good efficiency

Key Observations

Inner-VM data characteristicsExploit unchanged data to localize

deduplicationCross-VM data characteristics

Small common data dominates duplicates

Zipf-like distribution of VM OS/user dataSeparate consideration of OS and user

data

VM Snapshot Representation

Data blocks are variable-sized

Segments are fix-sized

Processing Flow of Multi-level Deduplication

Data Processing Steps

• Segment level checkup. Use dirty bitmap to see which segments are modified.

• Block level checkup Divide a segment into variable-sized blocks,

and compare their signatures with the parent snapshot

• Checkup from common dataset (CDS) Identify duplicate chunks from CDS

• Write new snapshot blocks Write new content chunks to stoage.

• Save recipes Save segment meta-data information

Architecture of Multi-level VM snapshot backup

Cluster node

Status& Evaluation

• Prototype system running on Alibaba’s Aliyuan cloud. Based on Xen. 100 nodes and each has 16 cores, 48G memory,

25VMs. Use <150MB per machine for backup&deduplication

• Evaluation data from Aliyuan’s production cluster 41TB. 10 snapshots per VM Segment size: 2MB. Avg. Block size: 4KB

Data Characteristics of the Benchmark

• Each VM uses 40GB storage space on average

• OS and user data disks: each takes ~50% of space

• OS data 7 main stream OS releases: Debian, Ubuntu, Redhat, CentOS, Win2003

32bit, win2003 64 bit and win2008 64 bit.• User data

From 1323 VM users

Impacts of 3-Level Deduplication

Level 1: Segment-level detection within VMLevel 2: Block-level detection within VMLevel 3: Common data block detection across-VM

Impact for Different OS Releases

Separate consideration of OS and user data

Both have Zipf-like data distributionBut popularity growth differs as the cluster size/VM users increase

Commonality among OS releases

1G common OS meta datacovers 70+%

Cumulative coverage of popular user data

Coverage is the summation of covered data block size*frequency

Space saving compared to perfect deduplication as CDS size increases

100G CDS (1GB index) -> 75% of perfect dedup

Impact of dataset-size increase

Conclusions

• Contributions: A multi-level selective deduplication scheme among VM

snapshots– Inner-VM deduplication localizes backup and exposes more

parallelism– global deduplication with a small common data set appeared in

OS and data disks

Use less than 0.5% of memory per node to meet a stringent cloud resource requirement -> accomplish 75% of what perfect deduplication does.

• Experiments Achieve 500TB/hour on a 1000-node cloud cluster Reduce bandwidth by 92% -> 40TB/hour

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao...

Documents

Transcript of Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao...