2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Primary Data Deduplication in Windows Server 2012
Sudipta Sengupta & Jim Benton Microsoft Corporation Redmond, WA, USA
1
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
File-based Storage Market
Growth in file-based data 60% growth 2001-2015
[IDC] Rising storage TCO despite
declining disk drive cost Increasing access to data
over low bandwidth links 60% of IWs working from
branch, road, or home Data access increasingly
over WAN
Source: IDC Enterprise Disk Storage Consumption Model
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Major players
EMC/DataDomain Backup data dedup $2.1B acquisition
NetApp Backup and Primary data dedup
Dell/Ocarina Extensive file format aware optimization $150M acquisition
Permabit Backup and Primary data dedup
Oracle/ZFS Native dedup support in ZFS
Linux/SDFS Open source Linux Dedup project
PAGE 3
Customer/Market response
Trends: • Package Dedup in a Storage appliance
offering • Move from Backup Primary Live
data
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Deduplication of Storage
Detect and remove duplicate data in storage systems Data at-rest: Storage space savings Data on-wire: Network bandwidth and disk I/O
savings High-level techniques
Content based chunking, detect/store unique chunks only
Similarity based, differential (delta) encoding
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash Declare a chunk boundary
If Hash matches a particular pattern,
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash Declare a chunk boundary
If Hash matches a particular pattern,
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash Declare a chunk boundary
If Hash matches a particular pattern,
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash Declare a chunk boundary
If Hash matches a particular pattern,
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Content based Chunking
Calculate Rabin fingerprint hash for each sliding window (16 byte) 1010 1010 0101 0000 0000 1010 0100 1010 1010 0101 0101 0101 0101 0011 0101
-4 -2 0 2 4
0 2 4 6
Hash
3 Chunks
Declare a chunk boundary
If Hash matches a particular pattern,
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Deduplication of Storage
Detect and remove duplicate data in storage systems Data at-rest: Storage space savings Data on-wire: Network bandwidth and disk I/O
savings High-level techniques
Content based chunking, detect/store unique chunks only
Similarity based, differential (delta) encoding On what type of data?
Backup data: mature market Primary data: relatively recent interest
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Deduplication Problem Space
Slide 19
Primary Storage
Concurrent Data Serving
Commodity Hardware
Low Duplication Locality
Primary Data dedup Backup Data dedup
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Key Challenge
Slide 20
Because Windows Server needs to run well on commodity hardware and because deduplication has to be friendly to the primary data serving workload, this introduces unique challenges in terms of balancing server resource usage, deduplication throughput, and deduplication space savings.
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Key Design Decisions
Post-processing deduplication Preserve latency/throughput of primary data access Flexibility in scheduling dedup as background job on cold data
Deduplication granularity and data chunking Chunk-level: variable sized chunking, large chunk size (~80KB) Modifications to Rabin fingerprint based chunking to achieve more
uniform chunk size distribution
Deduplication resource usage scaling slowly with data size Reduced chunk metadata RAM frugal chunk hash index Data partitioning
Crash consistent on all hardware Detect, Contain, Repair, Report user and metadata corruptions
22
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Large Scale Study of Primary Datasets
Used to drive key design decisions
About 7TB of data spread across 15 globally distributed servers in a large enterprise
Data crawled and chunked at different average chunk sizes Using Rabin fingerprint
based variable sized chunker
23
SHA-1 hash, size, compressed size, offset in file, file information logged for each chunk
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Key Design Decisions
Post-processing deduplication Preserve latency/throughput of primary data access Flexibility in scheduling dedup as background job on cold data
Deduplication granularity and data chunking Chunk-level: variable sized chunking, large chunk size (~80KB) Modifications to Rabin fingerprint based chunking to achieve more
uniform chunk size distribution
Deduplication resource usage scaling slowly with data size Reduced chunk metadata RAM frugal chunk hash index Data partitioning
Crash consistent on all hardware Detect, Contain, Repair, Report user and metadata corruptions
24
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Deduplication Granularity: Whole-file vs. Sub-file
Significant improvement in dedup space savings with sub-file dedup vs. whole-file dedup Numbers obtained using average chunk size of ~64KB
Slide 25
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Average Chunk Size
Compression compensates for savings decrease with higher chunk size Compression is
more efficient on larger chunks
Use larger chunk size of ~64KB
Dedup savings loss w/o compression
Dedup savings preserved w/ compression
GFS-US dataset
26
Without sacrificing dedup savings Reduce chunk metadata in the system
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Chunk compression
Slide
GFS-US dataset Compression/decompression can have a significant perf impact
Compression savings skewed 50% of unique chunks responsible
for 80% of compression savings 25% of chunks don’t compress at
all
Solution: Selective compression Reduces cost of compression for large fraction of chunks while
preserving most of compression savings Reduces decompression costs (reduce CPU pressure during heavy
reads) Heuristics for determining which chunks should not be compressed
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Skewed chunk size distribution Small chunk size
increased chunk metadata in the system
Large chunks reduced dedup savings, benefit of caching
Forced chunk boundaries Forced boundary at max chunk size is content independent, hence
may reduce dedup savings
Basic version of fingerprint based chunking
x
x
Freq
uenc
y di
strib
utio
n Chunk size min max
29
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Regression Chunking Algorithm
Goal 1: To obtain uniform chunk size distribution Goal 2: Reduce forced chunk boundaries at max
size Basic idea
When max chunk size is reached, relax match condition to some suffix of bit pattern P
Match |P| - i bits of P, with decreasing priority for i = 0,1, …, k
Reduces probability of forced boundary at max size 2x10-3 for k=1, 10-14 for k=4
30
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Regression Chunking: Uniform Chunk Size Distribution
Slide 32
GFS-US dataset
GFS-US dataset
Peak at max size reduced
Uniform chunk size distribution
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Regression Chunking: Dedup Savings Improvement
Slide 33
GFS-US dataset
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Key Design Decisions
Post-processing deduplication Preserve latency/throughput of primary data access Flexibility in scheduling dedup as background job on cold data
Deduplication granularity and data chunking Chunk-level: variable sized chunking, large chunk size (~80KB) Modifications to Rabin fingerprint based chunking to achieve more
uniform chunk size distribution
Deduplication resource usage scaling slowly with data size Reduced chunk metadata RAM frugal chunk hash index Data partitioning
Crash consistent on all hardware Detect, Contain, Repair, Report user and metadata corruptions
34
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
The Chunk Indexing Problem
Chunk metadata too big to fit in RAM 48 bytes per chunk: 32-byte SHA hash + 16-byte
location 20TB of unique data, average 64KB chunk size =>
15GB of RAM No locality in key space for chunk hash lookups Our approach
Log-structured organization of chunk metadata on secondary storage
Specialized hash table in RAM to index chunks in log; cuts down RAM footprint by 8x
Prefetch cache for chunk metadata
Slide 35
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Chunk Index Architecture
Slide 37
Chunk metadata organized in log-structured manner
Chunk metadata indexed in RAM using a specialized space efficient hash table
RAM write buffer for chunk mappings in currently open session
Prefetch cache for chunk metadata in RAM for sequential predictability of chunk lookups
Chunk Store
DISK or
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Data Partitioning and Reconciliation
Two-phase deduplication Divide the data into disjoint partitions, and perform
deduplication within each partition Reconcile duplicates across partitions
Slide 39
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Data Partitioning: Divide-and-Conquer
Partition data (e.g., by file type) Deduplicate each partition separately Selective and/or delayed reconciliation of partitions
Slide 40
. . . . . .
Partitions files by type (extension)
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Partitioning by file path Partition by directory sub-trees (each partition ≤ 10% of
total bytes) Not as effective as partitioning by file type (for dedup
savings) Partitioning by system/volume
Efficient partitioning strategies
How close is dedup savings within partitions to that of global dedup?
Partitioning by file type Dedup savings almost as
good as with global dedup
Slide 41
Dedup amenable to partitioned processing
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Data Partitioning and Reconciliation
Reconciliation algorithm Iterative procedure Grow the set of reconciled partitions by considering
some number of unreconciled partitions at a time
Slide 42
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Reconciliation of Data Partitions
Slide 43
Reconciled partitions Unreconciled partitions (k) (indexed in RAM)
Compare unreconciled partitions with each partition in reconciled set
k = #unreconciled partitions considered per iteration; provides trade-off between memory usage and reconciliation speed
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Key Design Decisions
Post-processing deduplication Preserve latency/throughput of primary data access Flexibility in scheduling dedup as background job on cold data
Deduplication granularity and data chunking Chunk-level: variable sized chunking, large chunk size (~80KB) Modifications to Rabin fingerprint based chunking to achieve more
uniform chunk size distribution
Deduplication resource usage scaling slowly with data size Reduced chunk metadata RAM frugal chunk hash index Data partitioning
Crash consistent on all hardware Detect, Contain, Repair, Report user and metadata corruptions
45
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
System Overview
Deduplication pipeline Post-processing Data chunking Index lookups + insertions Chunk Store insertions
Data path Dedup filter Chunk cache File stub tx update
Maintenance jobs Garbage collection (in Chunk Store) Data scrubbing
46
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Stream Map
Data Chunk
Phase I – Identify the duplicate data 1. Scan files according to policy 2. Chunk files intelligently to maximize recurring chunks 3. Identify common data chunks
Phase II – Optimize the target files 4. Single-instance data chunks in file stream order 5. Create stream map metadata 6. Truncate original file data stream
Deduplication & on-disk structures
47
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Container Files: Scale: Avoid performance cost
of large number of small files Simplicity: Self-contained
volume
Optimization Performance Efficient append only chunk
insertion (no explicit ref-counts)
Efficient hash index load via sequential read
Rehydration Performance
Storing in stream order reduces fragmentation
Reliable Chunk ID reduces seeks during rehydration
Chunk Store file layout for efficient Deduplication & Rehydration
48
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Post Write File Layout
Pre write File Layout
Write path to Optimized File
1
2 Write flow
3
Deduplicated File Write Path Partial Recall
1. Hold User Write IO
2. Deduped Chunk View
3. Fixed Recall Boundaries
4. Read Underlying Chunks
5. Write Recall-Aligned Data and Flush File
6. Write Reparse Buffer with Recall Bitmap
7. Release User Write (no explicit file flush)
Reduce latency for small writes to large files (e.g. OS image patching scenario)
Incremental Dedup will later optimize the new data
Recall granularity grows with file size
GC cleans up unreferenced chunks (chunk “D” in example)
49
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Crash Consistency State Diagram Partial Recall
51
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Data Scrubbing: Analysis
Assumptions: 1TB Volume containing 250,000 files with a 35% Dedup ratio Resulting Dedup on-disk structures: 10 x 100MB Stream Map containers. 625 x 1GB Chunk Containers
Resulting User Data
Loss
Impact Locality: Within
Files
Impact Surface Area:
Dedup on-disk files
Risks
Silent Data Corruption
Container Header
Stream Map Container 10,000s of files
Chunk Container 100s of files
Container Body
Stream Map Container <= 1 file
Chunk Container
Most Chunks: 1-99 files
Popular Chunk: 100-1000s of files
Other Metadata
Delete Log
Hotspot Table
No Impact
Latent Sector Error
Invest in Early detection Containment Resiliency Repair Reporting
Focus on where Dedup adds significant risk over the File System (Red boxes)
52
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Data Scrubbing: Approach Checksum & Redundancy
Detection , Containment & Resiliency
Scrubbing & Repair Reporting
Checksum all data and metadata
Extra copies of container headers
Mirror most popular chunks
Checksum validation on data access, optimization and jobs
Log internal corruption event for scrubbing job to handle
Retry on checksum mismatch (transient errors)
Fail operation or skip container and continue if possible
Never GC or delete data in corrupted container
Regular validation of critical data and metadata
Repair from redundant copy if available
• Redundant header
• Hotspot chunks • Storage Spaces
mirror • Future chunks
Informative, actionable reporting at file level (No chunks, offsets, containers)
Report corruption repairs when able
Advise restore of specific file(s) from backup.
53
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Performance: Deduplication Throughput
Quad-core Intel Xeon 2.27GHz machine, 12GB RAM Four scenarios, from combinations of
Index type (pure in-memory vs. memory/disk) Data partitioning (off or on)
Deduplication throughput 25-30 MB/s (single thread performance) Only about 10% decrease from baseline to least memory case Three orders of magnitude higher than typical data ingestion
rates of 0.03 MB/sec (Leung, et al.)
GFS-US dataset
54
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Performance: Resource Usage
RAM frugality Index memory usage
reduction of 24x vs. baseline
Low CPU utilization 30-40% per core Enough room available for
serving primary workload in multi-core modern file servers
Low disk usage Median disk queue depth is zero in all cases At 75-th percentile, increase by 2-3; impact of index lookups going
to disk and/or reconciliation
55
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Performance: Data Access
Scenarios: • 2% to 3% on office file
access • 0.5-1.5x on VHD copy
• 1.3x on VHD update
1
2
3
1
2 3
56
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
… game-changers like deduplication …
Windows Server 8 includes storage deduplication. Microsoft demonstrations of the technology reduced the disk footprint of a VDI server by some 96 percent. Microsoft's approach is also different from that of a virtualization system like VMware, insofar as it's not restricted to virtualization workloads. File servers too can benefit from the technology. The result should be both more robust and more effective …
Dedup among top Windows Server 2012 features being talked about
Slide 57
Top 10: New Features in Windows Server 8
Why Windows 8 server is a game-changer Windows Server 8: built for the cloud, built for virtualization
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Summary
Primary data deduplication in Windows Server 2012 Extending the problem frontier in multiple dimensions w.r.t. backup
data deduplication
Design decisions driven by large scale data analysis 7TB of data across 15 globally distributed servers in a large
enterprise
Primary workload friendly Scale deduplication processing resource usage with data size Data chunking and compression Chunk indexing Data partitioning and reconciliation
Robust and reliable on commodity hardware Detect, Contain, Repair, Report user and metadata corruptions 58
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Appendix Slides
59
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Optimized Data view
After Dedup:
Before Dedup:
10TB
Non-Optimized Files
Chunk Store
Data Volume with 10TB of data with 80% (5:1) deduplication savings
Optimized files (Stubs)
80% savings also lead to smaller/faster: Backup/Restore, Archive, Migration What do you do with the savings? (8TB of space)
Add more data Buy less hardware Increase availability by adding redundancy (RAID or Windows Storage Spaces)
60
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Core Deduplication Architecture
1 Pipeline: coordinates deduplication & compression of files in policy/scope Hash Index: indexes strong signatures for each unique chunk in the dataset Chunk Store: stores chunks & stream map metadata Mini-Filter: processes all IO directed to optimized files
2
3
4
1
2
3
4
61
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Low RAM Usage: Cuckoo Hashing
High hash table load factors while keeping lookup times fast Collisions resolved using cuckoo hashing Key can be in one of K candidate positions Later inserted keys can relocate earlier keys to
their other candidate positions K candidate positions for key x obtained using
K hash functions h1(x), …, hK(x) In practice, two hash functions can simulate K
hash functions using hi(x) = g1(x) + i*g2(x)
System uses value of K=16 and targets 90% hash table load factor
Insert X
62
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Low RAM Usage: Compact Key Signatures
Compact key signatures stored in hash table 2-byte key signature (vs. 20-byte SHA-1 hash) Key x stored at its candidate position i derives its signature from
hi(x) False flash read probability < 0.01%
Total 6-10 bytes per entry (4-8 byte flash pointer)
Related work on key-value stores on flash media MicroHash, FlashDB, FAWN, BufferHash
63
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Performance: Parallelizability
Parallel processing across datasets and CPU cores/disks Disk diverse datasets
One session per volume in current implementation One CPU core allocated per dedup session
One process and thread per deduplication session No cross-dependencies in deduplication sessions (each session uses a
separate index) Aggregate dedup throughput scales as expected with number of cores
(provided sufficient RAM is available) Workload scheduler
Assigns jobs (deduplication, GC, scrubbing) with CPU cores Allocates memory per job Keeps track of job activity (cancel jobs on memory or CPU pressure)
64
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Garbage Collection
Clean up unreferenced chunks arising from User deleted data Crash after chunk store insertion but before metadata update
Identify zero ref count chunks Create bloom filter (BF) for all referenced chunks Shallow GC: Check which user deleted chunks exist in BF Deep GC: Build logical maps per chunk store container
Top Related