Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication...
Transcript of Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication...
![Page 1: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/1.jpg)
A study of practical deduplication
Dutch T. MeyerUniversity of British Columbia
Microsoft Research InternWilliam Bolosky
Microsoft Research
![Page 2: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/2.jpg)
A study of practical deduplication
Dutch T. MeyerUniversity of British Columbia
Microsoft Research InternWilliam Bolosky
Microsoft Research
![Page 3: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/3.jpg)
Why study deduplication?
$0.046 per GB
9ms 9ms per seekper seek
![Page 4: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/4.jpg)
When do we exploit duplicates?
It Depends.• How much can you get back from deduping?
• How does fragmenting files affect performance?
• How often will you access the data?
![Page 5: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/5.jpg)
Outline
• Intro
• Methodology
• “There’s more here than dedup” teaser
(intermission)
• Deduplication Background
• Deplication Analysis
• Conclusion
![Page 6: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/6.jpg)
Methodology
MD5(name)MetadataMD5(data)
MD5(name)MetadataMD5(data)
MD5(name)MetadataMD5(data)
Once per week for 4 weeks.~875 file systems~40TB~200M Files
![Page 7: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/7.jpg)
There’s more here than dedup!
• We update and extend filesystem metadata findings from 2000 and 2004
• File system complexity is growing
• Read the paper to answer questions like:
Are my files bigger now than they used to be?
![Page 8: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/8.jpg)
Teaser: Histogram of file size
0%
2%
4%
6%
8%
10%
12%
14%
0 8 128 2K 32K 512K 8M 128M
File Size (bytes), power-of-two bins
2009 2004 2000
4KSince 1981!
![Page 9: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/9.jpg)
There’s more here than dedup!
How fragmented are my files?
![Page 10: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/10.jpg)
Teaser: Layout and Organization
• High linearity: only 4% of files fragmented in practice
– Most windows machines defrag weekly
• One quarter of fragmented files have at least 170 fragments
![Page 11: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/11.jpg)
Intermission
• Intro
• Methodology
• “There’s more here than dedup” teaser
(intermission)
• Deduplication Background
• Deplication Analysis
• Conclusion
![Page 12: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/12.jpg)
Dedup Background
foo01101010….. ….110010101
bar01101010….. ….110010101
Whole file Deduplication
![Page 13: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/13.jpg)
Dedup Background
foo01101010….. ….110010101
bar01101010….. ….110010101
Fixed Chunk Deduplication
1
01101010…..
01101010…..
….110010101
….1100101011
![Page 14: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/14.jpg)
Dedup Background
foo01101010….. ….110010101
bar01101010….. ….110010101
Rabin Figerprinting
1
110101101010010100
101101010…..
![Page 15: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/15.jpg)
The Deduplication Space
Algorithm Parameters Cost Deduplication effectiveness
Whole-file Low Lowest
Fixed Chunk
Chunk Size SeeksCPUComplexity
Middle
Rabin fingerprints
Average Chunk Size
SeeksMore CPUMore Complexity
Highest
![Page 16: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/16.jpg)
What is the relative deduplication rate of the algorithms?
![Page 17: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/17.jpg)
Dedup by method and chunk size
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
64K 32K 16K 8K
Spac
e D
ed
up
licat
ed
Chunk Size
Whole File Fixed-Chunk Rabin
![Page 18: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/18.jpg)
What if I was doing full weekly backups?
![Page 19: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/19.jpg)
Backup dedup over 4 weeks
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Whole File
Whole File+ Sparse
8K rabin
Deduplicated Space
![Page 20: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/20.jpg)
How does the number of filesystems influence deduplication?
![Page 21: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/21.jpg)
Dedup by filesystem count
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 4 8 16 32 64 128 256 512 Whole Set
Spac
e D
ed
up
licat
ed
Deduplication Domain Size (file systems)
Whole File 64 KB Fixed 8KB Fixed 64KB Rabin 8KB Rabin
![Page 22: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/22.jpg)
So what is filling up all this space?
![Page 23: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/23.jpg)
Bytes by containing file size
0%
2%
4%
6%
8%
10%
12%
1K 16K 256K 4M 64M 1G 16G 256G
Pe
rce
nta
ge o
f To
tal B
yte
s
Containing File Size (Bytes), Power-of-2 bins
2000 2004 2009
![Page 24: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/24.jpg)
What types of files take up disk space?
![Page 25: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/25.jpg)
Disk consumption by file type
dll dll ø
pdbvhd
dllexe
pdb
libpst
exe
vhdpch
wma
pdb
mp3
lib
exe
lib
cab
pch
chm
pst
cab
cab
mp3
wma
ø
ø
iso
0%
10%
20%
30%
40%
50%
60%
2000 2004 2009
![Page 26: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/26.jpg)
Disk consumption by file type
dll dll ø
pdbvhd
dllexe
pdb
libpst
exe
vhdpch
wma
pdb
mp3
lib
exe
lib
cab
pch
chm
pst
cab
cab
mp3
wma
ø
ø
iso
0%
10%
20%
30%
40%
50%
60%
2000 2004 2009
![Page 27: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/27.jpg)
Which of these types deduplicate well?
![Page 28: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/28.jpg)
Whole-file duplicates
Extension
% of Duplicate
Space
Mean File
Size (bytes)
% of
Total Space
dll 20% 521K 10%
lib 11% 1080K 7%
pdb 11% 2M 7%
<none> 7% 277K 13%
exe 6% 572K 4%
cab 4% 4M 2%
msp 3% 15M 2%
msi 3% 5M 1%
iso 2% 436M 2%
<a guid> 1% 604K <1%
![Page 29: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/29.jpg)
What files make up the 20% difference between whole file dedup and sparse file, as compared to more aggressive deduplication?
![Page 30: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/30.jpg)
Where does fine granularity help?
vhdvhd
pch
lib
dll
obj
pdb
pdb
lib
pch
wma
iso
pst
dll
ø
avhd
avhd
wma
mo3
wim
0%
10%
20%
30%
40%
50%
60%
70%
8K Fixed 8K Rabin
Pe
rce
nta
ge o
f d
iffe
ren
ce v
s.w
ho
le f
ile +
sp
arse
![Page 31: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/31.jpg)
Last plea to read the whole paper
• ~4x more results in paper!
• Real world filesystem analysis is hard
– Eight machines months in query processing
– Requires careful simplifying assumptions
– Requires heavy optimization
![Page 32: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky](https://reader035.fdocuments.net/reader035/viewer/2022071420/6119c3f04d039009d419e6ff/html5/thumbnails/32.jpg)
Conclusion
• The benefit of fine grained dedup is < 20%
– Potentially just a fraction of that.
• Fragmentation is a manageable problem
• Read the paper for more metadata results
We’re releasing this dataset