Open Source Data Deduplication

39
Open Source Data Deduplication Nick Webb [email protected] www.redwireservices.com @RedWireServices (206) 829-8621 Last updated 8/10/2011

description

Data deduplication is a hot topic in storage and saves significant disk space for many environments, with some trade offs. We’ll discuss what deduplication is and where the Open Source solutions are versus commercial offerings. Presentation will lean towards the practical – where attendees can use it in their real world projects (what works, what doesn’t, should you use in production, etcetera).

Transcript of Open Source Data Deduplication

Page 1: Open Source Data Deduplication

Open Source Data Deduplication

Nick Webbnickwredwireservicescom wwwredwireservicescom RedWireServices

(206) 829-8621

Last updated 8102011

Introduction

What is Deduplication Different kinds Why do you want it How does it work Advantages Drawbacks Commercial Implementations Open Source implementations performance

reliability and stability of each

What is Data Deduplication

Wikipedia

data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data typically to improve storage utilization In the deduplication process duplicate data is deleted leaving only one copy of the data to be stored along with references to the unique copy of data Deduplication is able to reduce the required storage capacity since only the unique data is stored

Depending on the type of deduplication redundant files may be reduced or even portions of files or other data that are similar can also be removed

Why Dedupe

Save disk space and money (less disks) Less disks = less power cooling and space Improve write performance (of duplicate data) Be efficient ndash donrsquot re-copy or store previously

stored data

Where does it Work Well Secondary Storage

BackupsArchives Online backups with limited

bandwidthreplication Save disk space ndash additional

full backups take little space

Virtual Machines (Primary amp Secondary)

File Shares

Not a Fit

Random data Video Pictures Music Encrypted files

ndash many vendors dedupe then encrypt

Types

Source Target Global FixedSliding Block File Based (SIS)

Drawbacks

Slow writes slower reads High CPUmemory utilization (dedicated server

is a must) Increases data loss risk corruption

Collision risk of 13x10^-49 chance per PB (256 bit hash amp 8KB Blocks)

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 2: Open Source Data Deduplication

Introduction

What is Deduplication Different kinds Why do you want it How does it work Advantages Drawbacks Commercial Implementations Open Source implementations performance

reliability and stability of each

What is Data Deduplication

Wikipedia

data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data typically to improve storage utilization In the deduplication process duplicate data is deleted leaving only one copy of the data to be stored along with references to the unique copy of data Deduplication is able to reduce the required storage capacity since only the unique data is stored

Depending on the type of deduplication redundant files may be reduced or even portions of files or other data that are similar can also be removed

Why Dedupe

Save disk space and money (less disks) Less disks = less power cooling and space Improve write performance (of duplicate data) Be efficient ndash donrsquot re-copy or store previously

stored data

Where does it Work Well Secondary Storage

BackupsArchives Online backups with limited

bandwidthreplication Save disk space ndash additional

full backups take little space

Virtual Machines (Primary amp Secondary)

File Shares

Not a Fit

Random data Video Pictures Music Encrypted files

ndash many vendors dedupe then encrypt

Types

Source Target Global FixedSliding Block File Based (SIS)

Drawbacks

Slow writes slower reads High CPUmemory utilization (dedicated server

is a must) Increases data loss risk corruption

Collision risk of 13x10^-49 chance per PB (256 bit hash amp 8KB Blocks)

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 3: Open Source Data Deduplication

What is Data Deduplication

Wikipedia

data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data typically to improve storage utilization In the deduplication process duplicate data is deleted leaving only one copy of the data to be stored along with references to the unique copy of data Deduplication is able to reduce the required storage capacity since only the unique data is stored

Depending on the type of deduplication redundant files may be reduced or even portions of files or other data that are similar can also be removed

Why Dedupe

Save disk space and money (less disks) Less disks = less power cooling and space Improve write performance (of duplicate data) Be efficient ndash donrsquot re-copy or store previously

stored data

Where does it Work Well Secondary Storage

BackupsArchives Online backups with limited

bandwidthreplication Save disk space ndash additional

full backups take little space

Virtual Machines (Primary amp Secondary)

File Shares

Not a Fit

Random data Video Pictures Music Encrypted files

ndash many vendors dedupe then encrypt

Types

Source Target Global FixedSliding Block File Based (SIS)

Drawbacks

Slow writes slower reads High CPUmemory utilization (dedicated server

is a must) Increases data loss risk corruption

Collision risk of 13x10^-49 chance per PB (256 bit hash amp 8KB Blocks)

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 4: Open Source Data Deduplication

Why Dedupe

Save disk space and money (less disks) Less disks = less power cooling and space Improve write performance (of duplicate data) Be efficient ndash donrsquot re-copy or store previously

stored data

Where does it Work Well Secondary Storage

BackupsArchives Online backups with limited

bandwidthreplication Save disk space ndash additional

full backups take little space

Virtual Machines (Primary amp Secondary)

File Shares

Not a Fit

Random data Video Pictures Music Encrypted files

ndash many vendors dedupe then encrypt

Types

Source Target Global FixedSliding Block File Based (SIS)

Drawbacks

Slow writes slower reads High CPUmemory utilization (dedicated server

is a must) Increases data loss risk corruption

Collision risk of 13x10^-49 chance per PB (256 bit hash amp 8KB Blocks)

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 5: Open Source Data Deduplication

Where does it Work Well Secondary Storage

BackupsArchives Online backups with limited

bandwidthreplication Save disk space ndash additional

full backups take little space

Virtual Machines (Primary amp Secondary)

File Shares

Not a Fit

Random data Video Pictures Music Encrypted files

ndash many vendors dedupe then encrypt

Types

Source Target Global FixedSliding Block File Based (SIS)

Drawbacks

Slow writes slower reads High CPUmemory utilization (dedicated server

is a must) Increases data loss risk corruption

Collision risk of 13x10^-49 chance per PB (256 bit hash amp 8KB Blocks)

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 6: Open Source Data Deduplication

Not a Fit

Random data Video Pictures Music Encrypted files

ndash many vendors dedupe then encrypt

Types

Source Target Global FixedSliding Block File Based (SIS)

Drawbacks

Slow writes slower reads High CPUmemory utilization (dedicated server

is a must) Increases data loss risk corruption

Collision risk of 13x10^-49 chance per PB (256 bit hash amp 8KB Blocks)

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 7: Open Source Data Deduplication

Types

Source Target Global FixedSliding Block File Based (SIS)

Drawbacks

Slow writes slower reads High CPUmemory utilization (dedicated server

is a must) Increases data loss risk corruption

Collision risk of 13x10^-49 chance per PB (256 bit hash amp 8KB Blocks)

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 8: Open Source Data Deduplication

Drawbacks

Slow writes slower reads High CPUmemory utilization (dedicated server

is a must) Increases data loss risk corruption

Collision risk of 13x10^-49 chance per PB (256 bit hash amp 8KB Blocks)

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 9: Open Source Data Deduplication

How Does it Work

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 10: Open Source Data Deduplication

Without Dedupe

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 11: Open Source Data Deduplication

With Dedupe

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 12: Open Source Data Deduplication

Block Reclamation

In general blocks are not removedfreed when a file is removed

We must periodically check blocks for references a block with no reference can be deleted freeing allocated space

Process can be expensive scheduled during off-peak

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 13: Open Source Data Deduplication

Commercial Implementations

Just about every backup vendor Symantec CommVault Cloud Asigra Baracuda Dropbox (global) JungleDisk

Mozy

NASSANBackup Targets NEC HydraStor DataDomainEMC Avamar Quantum NetApp

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 14: Open Source Data Deduplication

Open Source Implementations Fuse Based

Lessfs SDFS (OpenDedupe)

Others ZFS btrfs ( Off-line only)

Limited (file based SIS) BackupPC (reliable) Rdiff-backup

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 15: Open Source Data Deduplication

How Good is it

Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical

Especially true in backup or virtual environments

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 16: Open Source Data Deduplication

SDFS OpenDedupewwwopendeduporg

Java 7 Based platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great HW) Support for globalclustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 17: Open Source Data Deduplication

SDFS

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 18: Open Source Data Deduplication

SDFS Install amp Go

Install Java rpm ndashUvh SDFS-107-2x86_64rpm

sudo mkfssdfs --volume-name=sdfs_128k

--io-max-file-write-buffers=32

--volume-capacity=550GB

--io-chunk-size=128

--chunk-store-data-location=mntdata

sudo modprobe fuse

sudo mountsdfs -v sdfs_128k -m

mntdedupe

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 19: Open Source Data Deduplication

SDFS Pro

Works when configured properly Appears to be multithreaded

Con Slow resource intensive (CPUMemory) Fragile easy to mess up options leading to crashes little

user feedback Standard POSIX utilities do not show accurate data (eg df

must use getfattr -d ltmount pointgt and calculate bytes rarr GBTB and free yourself)

Slow with 4k blocks recommended for VMs

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 20: Open Source Data Deduplication

LessFSwwwlessfscom

Written in C = Less CPU Overhead Have to build yourself (configure ampamp make ampamp make install)

Has replication encryption Uses fuse

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 21: Open Source Data Deduplication

LessFS Install

wget httplessfs-142targz

tar zxvf targz

wget httpdb-4830targz

yum install buildstuffhellip

echo never gt syskernelmmredhat_transparent_hugepagedefrag

echo no gt syskernelmmredhat_transparent_hugepagekhugepageddefrag

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 22: Open Source Data Deduplication

LessFS Go

sudo vi etclessfscfg

BLOCKDATA_PATH=mntdatadtablockdatadta

META_PATH=mntmetamta

BLKSIZE=4096 only 4k supported on centos 5

ENCRYPT_DATA=on

ENCRYPT_META=off

mklessfs -c etclessfscfg

lessfs etclessfscfg mntdedupe

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 23: Open Source Data Deduplication

LessFS

Pro Does inline compression by default as well Reasonable VM compression with 128k blocks

Con Fragile StatsFS info hard to see (per file accounting no totals) Kernel gt= 2626 required for blocks gt 4k (RHEL6 only) Running with 4k blocks is not really feasible

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 24: Open Source Data Deduplication

LessFS

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 25: Open Source Data Deduplication

Other OSS

ZFS Tried it and empirically it was a drag but I have no

hard data (got like 3x dedupe with identical full backups of VMs)

At least itrsquos stablehellip

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 26: Open Source Data Deduplication

Kick the Tires

Test data set ~330GB of data 22GB of documents pictures music Virtual Machines

ndash 220GB Windows 2003 Server with SQL Datandash 2003 AD DC ~60GBndash 2003 Server ~8GBndash Two OpenSolaris VMs 15 amp 27GBndash 3GB Windows 2000 VMndash 15GB XP Pro VM

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 27: Open Source Data Deduplication

Kick the Tires

Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~25GHz each ext4

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 28: Open Source Data Deduplication

Compression Performance

First round (all ldquouniquerdquo data)

If another copy was put in (like another full) we should expect 100 reduction for that non-unique data (1x dedupe per run)

FS Home Data

Home Reduction

VM Data

VM Reduction

Combined Total Reduction

MBps

SDFS 4k 21GB 450 109GB

64 128GB 61 16

lessfs 4k (est)

24GB -9 NA 51 NA 50 4

SDFS 128k 21GB 450 255GB

16 276GB 15 40

lessfs 128k 21GB 450 130GB

57 183GB 44 24

targz --fast 21GB 450 178GB

41 199GB 39 35

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 29: Open Source Data Deduplication

Compression (ldquoUniquerdquo Data)

SDFS 4klessfs 4k (est)

SDFS 128klessfs 128k

targz --fast-1000

000

1000

2000

3000

4000

5000

6000

7000

Home Reduction

VM Reduction

Total Reduction

5

-9

55

5

64

51

16

57

41

61

50

15

44

39

Home Reduction VM Reduction Total Reduction

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 30: Open Source Data Deduplication

Write Performance(dont trust this)

raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k targz --fast0

5

10

15

20

25

30

35

40

MBps

MBps

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 31: Open Source Data Deduplication

Kick the Tires Part 2

Test data set ndash two ~204GB full backup archives from a popular commercial vendor

Test Environment VirtualBox VM 2GB RAM 2 Cores 2x7200RPM

SATA drives (meta amp data separated for LessFS) Physical CPU Quad Core Xeon

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 32: Open Source Data Deduplication

Write Performance

raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W0

5

10

15

20

25

30

35

40

MBps

MBps

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 33: Open Source Data Deduplication

Compression (Backup Data)

SDFS LessFS Raw000

500

1000

1500

2000

2500

3000

3500

4000

7

34

0

Reduction (128K Blocks)

Reduction

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 34: Open Source Data Deduplication

Load(SDFS 128k)

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 35: Open Source Data Deduplication

Open Source Dedupe

Pro Free Can be stable if well managed

Con Not in repos yet Efforts behind them seem very limited 1 dev each NoPoor documentation

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 36: Open Source Data Deduplication

The Future

Eventual Commodity brtfs

Dedupe planned (off-line only)

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 37: Open Source Data Deduplication

ConclusionRecommendations

Dedupe is great if it works and it meets your performance and storage requirements

OSS Dedupe has a way to go SDFSOpenDedupe is best OSS option right

now JungleDisk is good and cheap but not OSS

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 38: Open Source Data Deduplication

About Red Wire Services

If you found this presentation helpful consider Red Wire Services for your next Backup Archive or IT Disaster Recovery Planning project

Learn more at wwwRedWireServicescom

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb
Page 39: Open Source Data Deduplication

About Nick Webb

Nick Webb is the founder of Red Wire Services in Seattle WA Nick is available to speak on a variety of IT Disaster Recovery related topics including

Preserving Your Digital Legacy

Getting Started with your Small Business Disaster Recovery Plan

Archive Storage for SMBs

If interested in having Nick speak to your group please call (206) 829-8621 or email inforedwireservicescom

  • Open Source Data Deduplication
  • Introduction
  • What is Data Deduplication
  • Why Dedupe
  • Where does it Work Well
  • Not a Fit
  • Types
  • Drawbacks
  • How Does it Work
  • Without Dedupe
  • With Dedupe
  • Block Reclamation
  • Commercial Implementations
  • Open Source Implementations
  • How Good is it
  • SDFS OpenDedupe wwwopendeduporg
  • SDFS
  • SDFS Install amp Go
  • SDFS (2)
  • LessFS wwwlessfscom
  • LessFS Install
  • LessFS Go
  • LessFS
  • LessFS (2)
  • Other OSS
  • Kick the Tires
  • Kick the Tires (2)
  • Compression Performance
  • Compression (ldquoUniquerdquo Data)
  • Write Performance (dont trust this)
  • Kick the Tires Part 2
  • Write Performance
  • Compression (Backup Data)
  • Load (SDFS 128k)
  • Open Source Dedupe
  • The Future
  • ConclusionRecommendations
  • About Red Wire Services
  • About Nick Webb