Data Deduplication for Dummies 2011

8/4/2019 Data Deduplication for Dummies 2011

1/43

Complimentsof

Quantum2ndSpecialEdition Get up to speed

on the hottest

topic in storage!

Data

Deduplication

Mark R. Coppock

Steve Whitner

A Referencefor the

Rest of Us!FREE eTips at dummies.com


2/43

These materials are the copyright of Wiley Publishing, Inc. and anydissemination, distribution, or unauthorized use is strictly prohibited.


3/43

DataDeduplicationFOR

DUMmIES

QUANTUM 2ND SPECIAL EDITION

by Mark R. Coppockand Steve Whitner



4/43

Data Deduplication For Dummies, Quantum 2nd Special Edition

Published by

Wiley Publishing, Inc.111 River StreetHoboken, NJ 07030-5774

www.wiley.com

Copyright 2011 by Wiley Publishing, Inc., Indianapolis, Indiana

Published by Wiley Publishing, Inc., Indianapolis, Indiana

No part of this publication may be reproduced, stored in a retrieval system or transmitted in anyform or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without theprior written permission of the Publisher. Requests to the Publisher for permission should beaddressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ

07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Trademarks: Wiley, the Wiley Publishing logo, For Dummies, the Dummies Man logo, A Referencefor the Rest of Us!, The Dummies Way, Dummies.com, Making Everything Easier, and related tradedress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in theUnited States and other countries, and may not be used without written permission. Quantum andthe Quantum logo are trademarks of Quantum Corporation. All other trademarks are the propertyof their respective owners. Wiley Publishing, Inc., is not associated with any product or vendormentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKENO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETE-NESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES,

INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE.NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS.THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITU-ATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOTENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PRO-FESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONALPERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLEFOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE ISREFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHERINFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THEINFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS ITMAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED INTHIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRIT-TEN AND WHEN IT IS READ.

For general information on our other products and services, please contact our BusinessDevelopment Department in the U.S. at 317-572-3205. For details on how to create a customFor Dummies book for your business or organization, contact [email protected] . Forinformation about licensing theFor Dummies brand for products or services, contactBrandedRights&[email protected].

ISBN: 978-1-118-03204-6

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

http://www.wiley.com/http://www.wiley.com/http://www.wiley.com/go/permissionshttp://www.wiley.com/go/permissionshttp://www.wiley.com/go/permissionshttp://www.wiley.com/


5/43

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

How This Book Is Organized .................................................... 1

Icons Used in This Book ............................................................ 2

Chapter 1: Data Deduplication: Why Less Is More . . . . .3

Duplicate Data: Empty Calories for Storageand Backup Systems .............................................................. 3

Data Deduplication: Putting Your Data on a Diet .................. 4

Why Data Deduplication Matters ............................................. 6

Chapter 2: Data Deduplication in Detail . . . . . . . . . . . . . .7

Making the Most of the Building Blocks of Data .................... 7

Fixed-length blocks versus

variable-length data segments ................................... 8

Effect of change in deduplicated storage pools ......... 10Sharing a Common Data Deduplication Pool ....................... 12

Data Deduplication Architectures ......................................... 13

Chapter 3: The Business Case forData Deduplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

Deduplication to the Rescue: Replication

and Disaster Recovery Protection ..................................... 16

Reducing the Overall Cost of Storing Data ........................... 18

Data Deduplication Also Works for Archiving ..................... 20

Looking at the Quantum Data Deduplication Advantage ......20

Chapter 4: Ten Frequently Asked DataDeduplication Questions (And Their Answers) . . . .23

What Does the Term Data Deduplication Really Mean? .....23

How Is Data Deduplication Applied to Replication? ............ 24

What Applications Does Data Deduplication Support? ...... 24

Is There Any Way to Tell How Much ImprovementData Deduplication Will Give Me? ...................................... 24

What Are the Real Benefits of Data Deduplication? ............ 25

What Is Variable-Block-Length Data Deduplication? ........... 25

If the Data Is Divided into Blocks, Is It Safe? ......................... 26

When Does Data Deduplication Occur during Backup? ...... 26

Does Data Deduplication Support Tape? .............................. 27

What Do Data Deduplication Solutions Cost? ...................... 28



6/43

Data Deduplication For Dummies, Quantum 2nd Special Editioniv

Appendix: Quantums Data DeduplicationProduct Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

DXi4500 ........................................................................... 31

DXi6500 Family ............................................................... 31

DXi6700 ........................................................................... 31

DXi8500 ........................................................................... 32

iv



7/43

Publishers AcknowledgmentsWere proud of this book and of the people who worked on it. For details on how to

create a customFor Dummies book for your business or organization, contact [email protected]. For details on licensing theFor Dummies brand for products or services,contact BrandedRights&[email protected].

Some of the people who helped bring this book to market include the following:

Acquisitions, Editorial, and Media

Development

Project Editor: Linda Morris

Editorial Managers: Jodi Jensen,Rev Mengle

Acquisitions Editor: Kyle Looper

Business Development Representative:Karen Hattan

Custom Publishing Project Specialist:Michael Sullivan

Composition Services

Project Coordinator: Kristie Rees

Layout and Graphics: Lavonne Roberts,Laura Westhuis

Proofreaders: Jessica Kramer,Lindsay Littrell

Publishing and Editorial for Technology Dummies

Richard Swadley, Vice President and Executive Group Publisher

Andy Cummings, Vice President and Publisher

Mary Bednarek, Executive Director, Acquisitions

Mary C. Corder, Editorial Director

Publishing and Editorial for Consumer Dummies

Diane Graves Steele, Vice President and Publisher, Consumer Dummies

Ensley Eikenburg, Associate Publisher, Travel

Composition Services

Debbie Stailey, Director of Composition Services

Business Development

Lisa Coleman, Director, New Market and Brand Development



8/43



9/43

Introduction

Right now, duplicate data is stealing time and moneyfrom your organization. It could be a presentation sit-ting in hundreds of users network folders or a group e-mail

sitting in thousands of inboxes. This redundant data makesboth storage and your backup process more costly, moretime-consuming, and less efficient. Data deduplication, usedon Quantums DXi-Series disk backup and replication appli-ances, dramatically reduces this redundant data and the costsassociated with it.

Data Deduplication For Dummies, Quantum 2nd SpecialEdition, discusses the methods and rationale for reducing the

amount of duplicate data maintained by your organization.This book is intended to provide you with the information youneed to understand how data deduplication can make a mean-ingful impact on your organizations data management.

How This Book Is OrganizedThis book is arranged to guide you from the basics of data

deduplication, through its details, and then to the businesscase for data deduplication.

Chapter 1: Data Deduplication: Why Less Is More:Provides an overview of data deduplication, includingwhy its needed, the basics of how it works, and why itmatters to your organization.

Chapter 2: Data Deduplication in Detail: Gives a relatively

technical description of how data deduplication functions,how it can be optimized, its various architectures, andwhat happens when it gets applied to replication.

Chapter 3: The Business Case for Data Deduplication:Provides an overview of the business costs of duplicatedata, how data deduplication can be effectively appliedto your current data management process, and how itcan aid in backup and recovery.



10/43

Data Deduplication For Dummies, Quantum 2nd Special Edition2

Chapter 4: Ten Frequently Asked Data DeduplicationQuestions (And Their Answers): This chapter lists, well,

frequently asked questions and their answers.

Icons Used in This BookHere are the helpful icons you see used in this book.

The Remember icon flags information that you should payspecial attention to.

The Technical Stuff icon lets you know that the accompanyingtext explains some technical information in detail.

A Tip icon lets you know that some practical information thatcan really help you is on the way.

A Warning lets you know of a potential problem that canoccur if you dont take care.



11/43

Chapter 1

Data Deduplication:Why Less Is More

In This Chapter Understanding where duplicate data comes from

Identifying duplicate data

Using data deduplication to reduce storage needs

Figuring out why data deduplication is needed

Maybe youve heard the clich Information is the life-blood of an organization. But many clichs have truthbehind them, and this is one such case. The organization thatbest manages its information is likely the most competitive.

Of course, the data that makes up an organizations informa-tion must also be well-managed and protected. As the amount

and types of data an organization must manage increase expo-nentially, this task becomes harder and harder. Complicatingmatters is the simple fact that so much data is redundant.

To operate most effectively, every organization needs toreduce its duplicate data, increase the efficiency of its storageand backup systems, and reduce the overall cost of storage.Data deduplication is a powerful technology for doing just that.

Duplicate Data: Empty Caloriesfor Storage and Backup Systems

Allowing duplicate data in your storage and backup systemsis like eating whipped cream straight out of the bowl: You get



12/43


plenty of calories, but no nutrition. Take it to an extreme, andyou end up overweight and undernourished. In the IT world,

that means buying lots more storage than you really need.

The tricky part is that its not really the IT team that controlshow much duplicate data you have. All of your users andsystems generate duplicate data, and the larger your organiza-tion and the more careful you are about backup, the biggerthe impact is.

For example, say that a sales manager sends out a 10MB pre-

sentation via e-mail to 500 salespeople and each person storesthe file. The presentation now takes up 5GB of your storagespace. Okay, you can live with that, but look at the impact onyour backup!

Because yours is a prudent organization, each users networkshare is backed up nightly. So day after day, week after week,you are adding 5GB of data each day to your backup, and mostof the data in those files consists of the same blocks repeated

over and over and over again. Multiply this by untold numbersof other sources of duplicate data, and the impact on your stor-age and backup systems becomes clear. Your storage needsskyrocket, and your backup costs explode.

Data Deduplication: Putting

Your Data on a DietIf you want to lose weight, you either reduce your calories orincrease your exercise. The same is sort of true for your data,except you cant make your storage and backup systems runlaps to slim down.

Instead, you need a way to identify duplicate data and theneliminate it.Data deduplication technology provides just such

a solution. Systems like Quantums DXi products that useblock-based deduplication start by segmenting a dataset intovariable-length blocks and then check for duplicates. Whenthey find a block theyve seen before, instead of storing itagain, they store a pointer to the original. Reading the file issimple the sequence of pointers makes sure all the blocksare accessed in the right order.



13/43

Chapter 1: Data Deduplication: Why Less Is More 5

Compared to other storage reduction methods that look forrepeated whole files (single-instance storage is an example),

data deduplication provides much more granularity. Thatmeans that in most cases, it dramatically reduces the amountof storage space needed.

As an example, consider the sales deck that everybody saved.Imagine that everybody put their name on the title page. Asingle-instance system would identify all the files as uniqueand save all of them. A system with data deduplication, how-ever, can tell the difference between unique and duplicate

blocks inside files and between files, and its designed to saveonly one copy of the redundant data segments. That meansthat you use much less storage.

Data deduplication isnt a stand-alone technology it canwork with single-instance storage and conventional compres-sion. That means data deduplication can be integrated intoexisting storage and backup systems to decrease storagerequirements without making drastic changes to an

organizations infrastructure.

A brief history of data reductionOne of the earliest approaches todata reduction was data compres-sion, which searches for repeated

strings within a single file. Different types of compression technologiesexist for different types of files, butall share a common limitation: Eachreduces duplicate data only withinspecific parts of individual files.

Next came single-instance storage,which reduces storage needs byrecognizing when files are repeated.Single-instance storage is used inbackup systems, for example, wherea full backup is made first, and then

incremental backups are made ofonly changed and new files. Theeffectiveness of single-instance

storage is limited because it savesmultiple copies of files that may haveonly minor differences.

Data deduplication is the newest technique for reducing data.Because it recognizes differences ata variable-length block basis withinfiles and betweenfiles, data dedu-plication is the most efficient datareduction technique yet developedand allows for the highest savings instorage costs.



14/43


Data deduplication utilizes proven technology. Most data isalready stored in non-contiguous blocks, even on a single-disk

system, with pointers to where each files blocks reside. InWindows systems, theFile Allocation Table (FAT) maps thepointers. Each time a file is accessed, the FAT is referenced toread blocks in the right sequence. Data deduplication refer-ences identical blocks of data with multiple pointers, but ituses the same basic principles for reading multi-block filesthat you are using today.

Why Data Deduplication MattersIncreasing the data you can put on a given disk makes sensefor an IT organization for lots of reasons. The obvious one isthat it reduces direct costs. Although disk costs have droppeddramatically over the last decade, the increase in the amountof data being stored has more than eaten up the savings.

Just as important, however, is that data deduplication also re-duces network bandwidth needs for transmitting data whenyou store less data, you have to move less data, too. That opensup new protection and disaster recovery capabilities replica-tion of backup data, for example which make management ofdata much easier.

Finally, there are major impacts on indirect costs theamount of space required for storage, cooling requirements,and power use. Management time is also reduced oftendramatically. Quantum DXi customers in a recent surveyaveraged a 63 percent reduction in the amount of timethey had to spend managing their backups.



15/43

Chapter 2

Data Deduplicationin Detail

In This Chapter Understanding how data deduplication works

Optimizing data deduplication

Defining the data deduplication architectures

Data deduplication is really a simple concept with verysmart technology behind it: You only store a block once.If it shows up again, you store a pointer to the first one thattakes up less space than storing the whole thing again. Whendata deduplication is put into systems that you can actuallyuse, however, there are several options for implementation.And before you pick an approach to use or a model to plug in,you need to look at your particular data needs to see whether

data deduplication can help you. Factors to consider includethe type of data, how much it changes, and what you want todo with it. So lets look at how data deduplication works.

Making the Most of theBuilding Blocks of Data

Basically, data deduplication segments a stream of data intovariable-length blocks and writes those blocks to disk. Alongthe way, it creates a digital signature like a fingerprint for each data segment and an index of the signatures it hasseen. The index, which can be recreated from the stored datasegments, lets the system know when its seeing a new block.



16/43


When data deduplication software sees a duplicate block, itinserts a pointer to the original block in the datasets meta-data (the information that describes the dataset) rather thanstoring the block again. If the same block shows up more thanonce, multiple pointers to it are created. Its a slam dunk pointers are smaller than blocks, so you need less disk space.

Data deduplication technology clearly works best when it seessets of data with lots of repeated segments. For most people,thats a perfect description of backup. Whether you back upeverything every day (and lots of us do this) or once a weekwith incremental backups in between, backup jobs by theirnature send the same pieces of data to a storage system overand over again. Until data deduplication, there wasnt a goodalternative to storing all the duplicates. Now there is.

Fixed-length blocks versusvariable-length data segmentsSo why variable-length blocks? You have to think about thealternative. Remember, the trick is to find the differencesbetween datasets that are made up mostly but not com-

pletely of the same segments. If segments are found by

A word about wordsTheres no science academy thatforces IT writers to standardize worduse thats a good thing. But itmeans that different companies usedifferent terms. In this book, we usedata deduplication to mean a vari-able-length block approach to reduc-

ing data storage requirements and

thats the way most people use the term. But some companies use thesame word to describe systems thatlook for duplicate data in other ways,like at a file level. If you hear the termand youre not sure how its beingused, ask.



17/43

Chapter 2: Data Deduplication in Detail 9

dividing a data stream into fixed-length blocks, then chang-ing any single block means that all the downstream blocks

will look different the next time the data set is transmitted.Bottom line, you wont find very many common segments.

So instead of fixed blocks, Quantums deduplication technol-ogy divides the data stream into variable-length data seg-ments using a system that can find the same block boundariesin different locations and contexts. This block-creation pro-cess lets the boundaries float within the data stream so thatchanges in one part of the dataset have little or no impact on

the blocks in other parts of the dataset. Duplicate data seg-ments can then be found globally at different locations insidea file, inside different files, inside files created by differentapplications, and inside files created at different times.Figure 2-1 shows fixed-block data deduplication.

A B C D

E F G H

Figure 2-1: Fixed-length block data in data deduplication.

The upper line shows the original blocks the lowershows the blocks after making a single change to Block A(an insertion). The shaded sequence is identical in bothlines, but all of the blocks have changed and no duplicationis detected there are eight unique blocks.

Data deduplication utilizes variable-length blocks. In Figure 2-2,Block A changes when the new data is added (it is now E), butnone of the other blocks are affected. Blocks B, C, and D are allidentical to the same blocks in the first line. In all, we have onlyfive unique blocks.



18/43


E B C D

A B C D

Figure 2-2: Variable-length block data in data deduplication.

Effect of change in deduplicatedstorage poolsWhen a dataset is processed for the first time by a data de-duplication system, the number of duplicate data segmentsvaries depending on the nature of the data (both file typeand content). The gain can range from negligible to 50% ormore in storage efficiency.

But when multiple similar datasets like a sequence ofbackup images from the same volume are written to acommon deduplication pool, the benefit is very significantbecause each new write only increases the size of the totalpool by the number of new data segments. In typical businessdata sets, its common to see block-level differences betweentwo backups of only 1% or 2%, although higher change ratesare also frequently seen.

The number of new data segments in each new backupdepends a little on the data type, but mostly on the rate ofchange between backups. And total storage requirement alsodepends to a very great extent on your retention policies the number of backup jobs and the length of time they areheld on disk. The relationship between the amount of datasent to the deduplication system and the disk capacity actu-ally used to store it is referred to as the deduplicationratio.



19/43


Figure 2-3 shows the formula used to derive the data dedupli-cation ratio, and Figure 2-4 shows the ratio for four different

backup datasets with different change rates (compressionalso figures in, so the figure also shows different compressioneffects). These charts assume full backups, but deduplicationalso works when incremental backups are included. As it turnsout, though, the total amount of data stored in the deduplica-tion appliance may well be the same for either method becausethe storage pool only stores new blocks under either system.The deduplication ratio differs, though, because the amount ofdata sent to the system is much greater in a daily full model.

So the storage advantage is greater for full backups even if theamount of data stored is the same.

Data deduplication ratio =Total data before reduction

Total data after reduction

Figure 2-3: Deduplication ratio formula.

It makes sense that data deduplication has the most powerfuleffect when it is used for backup data sets with low or modestchange rates, but even for data sets with high rates of change,the advantage can be significant.

To help you select the right deduplication appliance, Quantumuses a sizing calculator that models the growth of backup data-sets based on the amount of data to be protected, the backupmethodology, type of data, overall compressibility, rates of

growth and change, and the length of time the data is to beretained. The sizing calculator helps you understand wheredata deduplication has the most advantage and where moreconventional disk or tape backup systems provide moreappropriate functionality.



20/43


0

1

1

2

2

3

3

4

4

5

5

Day 1 Day 2 Day 3 Day 4

-

5

10

15

20

25

De-dup RatioCumulative Protected TB

TBS

tored

Cumulative Unique TB

De-Dup

Ratio

Compressibility = 5:1Data change = 0%

Events to reach 20:1 ratio = 4

Backups for Data set 1

Compressibility = 2:1

Data change = 1%

Events to reach 20:1 ratio = 11

Backups for Data set 2

0

2

4

6

8

10

12

14

D ay 1 D ay 2 D ay 3 D ay 4 D a y 5 D a y 6 D a y 7 D a y 8 D a y 9 D a y 1 0 D a y 1 1

-

5

10

15

20

25

De-Dup

Ratio

TBS

tored

Cumulative Protected TB Cumulative Unique TB De-dup Ratio

Figure 2-4: Effects of data change on deduplication ratios.

Contact your Quantum representative to participate in adeduplication sizing exercise.

Sharing a Common Data

Deduplication PoolSeveral data deduplication systems allow multiple streams ofdata from different servers and different applications to besent into a common deduplication pool (also called a block-pool) that way, common blocks between different datasetscan be deduplicated on a global basis. Quantums DXi-Seriesappliances are an example of such systems.



21/43


DXi-Series systems offer different connection personalitiesdepending on the model and configuration, including NAS

volumes (CIFS or NFS) and virtual tape libraries (VTLs). Theseries even supports Symantecs specific Logical Storage Unit(LSU) presentation, which is part of the OpenStorage Initiative(OST). Because all the presentations offered in the same unitaccess a common blockpool, redundant blocks are eliminatedacross all the datasets written to the appliance global dedu-plication. This means that a DXi-Series appliance recognizesand deduplicates the same data segments on a print and fileserver coming in through one backup job and on an e-mail

server backed up on a different server. Figure 2-5 demon-strates a sharing pool utilizing DXi-Series appliances.

DXi-Series Appliance Storage Pool

Sharing storage pool in DXi-Series appliancesAll the datasets written to the DXi appliance share a common,deduplicated storage pool irrespective of what presentation,interface, or application is used during ingest. One DXi-Seriesappliance can support multiple backup applicationsat the same time.

Source1

Source2

Source3

Figure 2-5: Sharing a global deduplication storage pool.

Data DeduplicationArchitectures

Data deduplication, like compression or encryption, introducescomputational overhead, so the choice of where and how dedu-plication is carried out can affect backup performance. The



22/43


most common approach today is to carry out deduplicationat the destination end of backup, but deduplication can also

occur at the source (that is, at the server where the backupdata is initially processed by the backup software, or even atthe host server where an application is backed up initially).

Wherever the data deduplication is carried out, just as withcompression or encryption, you get the fastest performancefrom purpose-built systems optimized for the process. If de-duplication is carried out by backup software agents runningon general-purpose servers, its usually slower, you have to

manage agents on all the servers, and deduplication can com-pete with and slow down primary applications. It can also becomplex to deploy or change.

The data deduplication approach with the highest performanceand ease of implementation is generally one that is carried outon specialized hardware systems at the destination end ofthe backup. Backup is faster and deduplication can workwith any backup software, so its easier to deploy and to

change down the road.

Deduplication appliances have been around for three or fouryears, and as vendors create later-generation products, thedevelopment teams are getting smarter about how to getthe most performance and data reduction out of a system.Quantums latest generation of products, for example, usedifferent kinds of storage inside the appliances to store thedata used for specific, often repeated operations. Looking up

and checking signatures happens all the time and is a prettyintensive operation, so that data is held on solid-state disks oron small, fast, conventional disk drives with a high-bandwidthconnection. Since both have very fast seek times, the perfor-mance of the whole system is increased significantly. Onerecent new product more than tripled the performance of themodel it replaced. Is there room for even more improvement?The engineers seem to think so so keep an eye out.



23/43

Chapter 3

The Business Case forData Deduplication

In This Chapter Looking at the business value of deduplication

Finding out why applying the technology to replication anddisaster recovery is key

Identifying the cost of storing duplicate data

Looking at the Quantum data deduplication advantage

As with all IT investments, data deduplication must makebusiness sense to merit adoption. At one level, the valueis pretty easy to establish. Adding disk to your backup strategycan provide faster backup and restore performance, as well asgive you RAID levels of fault tolerance. But with conventionalstorage technology, the amount of disk people need for backup

just costs too much. Data deduplication solves that problemfor many users by letting them reduce the amount of disk theyneed to hold their backup data by 90 percent or more, whichtranslates into immediate savings.

Conventional disk backup has a second limitation that someusers think is even more important disaster recovery (DR)protection. Can data deduplication help there? Absolutely!The key is using the technology to power remote replication,

and the outcome provides another compelling set ofbusiness advantages.



24/43


Deduplication to the Rescue:Replication and DisasterRecovery Protection

The minimum disaster recovery (DR) protection you need isto make backup data safe from site damage and other naturalor man-made disasters. After all, equipment and applicationscan be replaced, but digital assets may be irreplaceable. And

no matter how many layers of redundancy a system has, whenall copies of anything are stored on a single hardware system,they are vulnerable to fires, floods, or other site damage.

For most users, removable media provides all or most of theirsite loss protection. And its one of the big reasons that diskbackup isnt used more: When backup data is on disk, it justsits there. You have to do something else to get DR protection.People talk about replicating backup data over networks, but

almost nobody actually does it: Backup sets are too big andnetwork bandwidth is too limited.

Data deduplication changes all that by finally making remotereplication of backup practical and smart. How does datadeduplication work? Just like you store only the new blocksin each backup, you have to replicate only the new blocks.Suppose 1 percent of a 500GB backup has changed since theprevious backup. That means you have to move only 5GB of

data to keep the two systems synchronized and you canmove that data in the background over several hours. Thatmeans you can use a standard WAN to replicate backup sets.

For disaster recovery, that means you can have an off-sitereplica image of all your backup data every day, and you canreduce the amount of removable media you handle. Thats espe-cially nice when you have smaller sites that dont have IT staff.Less removable media can mean lower costs and less risk. Daily

replication means better protection. Its a win-win situation.

How do you get them synched up in the first place? Thefirst replication event may take longer, or you can co-locatedevices and move data the first time over a faster network, oryou can put backup data at the source site on tape and copyit locally onto the target system. After that first sync-up is fin-ished, the replication needs to move only the new blocks.



25/43

Chapter 3: The Business Case for Data Deduplication 17

What about tape? Do you still need it? Disk-based deduplica-tion and replication can reduce the amount of tape you use,

but most IT departments combine the technologies, using tapefor longer-term retention. This approach makes sense for mostusers. If you want to keep data for six months or three years orseven years, tape provides the right economics and portability,and the new encryption capabilities that tape drives offer nowmake securing the data that goes off site on tape easy.

The best solution providers will help you get the right balance,and at least one of them Quantum lets you manage the

disk and tape systems from a single management console, and itsupports all your backup systems with the same service team.

The asynchronous replication method employed by Quantumin its DXi-Series disk backup and replication solutions can giveusers extra bandwidth leverage. Before any blocks are replicatedto a target, the source system sends a list of blocks it wants toreplicate. The target checks this list of candidate blocks againstthe blocks it already has, and then it tells the source what it

needs to send. So if the same blocks exist in two different offices,they have to be replicated to the target only one time.

Figure 3-1 shows how the deduplication process works onreplication over a WAN.

C e

Target

WAN

Step 2:Only the missing datablocks are replicatedand moved over the WAN.

Step 1:Source sends a list of elements toreplicate to the target. Targetreturns list of blocks not already

stored there.

A B C D A B D

C

A,B,C,D?

Sourceource

Source

Figure 3-1: Verifying data segments prior to transmission.

Because many organizations use public data exchanges tosupply WAN services between distributed sites, and becausedata transmitted between sites can take multiple paths fromsource to target, deduplication appliances should offer encryp-tion capabilities to ensure the security of data transmissions.



26/43


In the case of DXi-Series appliances, all replicated data bothmetadata and actual blocks of data can be encrypted at the

source using SHA-AES 128-bit encryption and decrypted at thetarget appliance.

Reducing the OverallCost of Storing Data

Storing redundant backup data brings with it a number ofcosts, from hard costs such as storage hardware to opera-tional costs such as the labor to manage removable backupmedia and off-site storage and retrieval fees. Data deduplica-tion offers a number of opportunities for organizations toimprove the effectiveness of their backup and to reduceoverall data protection costs.

These include the opportunity to reduce hardware acquisi-

tion costs, but even more important for many IT organizationsis the combination of all the costs that go into backup. Theyinclude ongoing service costs, costs of removable media,the time spent managing backup at different locations, andthe potential lost opportunity or liability costs if critical databecomes unavailable.

The situation is also made more complex by the fact that in thebackup world, there are several kinds of technology and different

situations often call for different combinations of them. If data ischanging rapidly, for example, or only needs to be retained for afew days, the best option may be conventional disk backup. If itneeds to be retained for longer periods six months, a year, ormore traditional tape-based systems may make more sense.For many organizations, the need is likely to be different fordifferent kinds of data.

The savings from combining disk-based backup, deduplication,

replication, and tape in an optimal way can provide very sig-nificant savings when users look at their total data-protectioncosts. A recent analysis at a major software supplier showedhow the supplier could add deduplication and replication toits backup mix and save more than $1,000,000 over a five-year



27/43


period reducing overall costs by about one-third. Wherewere the savings? In reduced media usage, lower power and

cooling, and savings on license and service costs. The key wasdata deduplication and combining it with traditional tape inan optimal way. If the supplier tried the same approach usingconventional disk technology, it would have increased costs both because of higher acquisition expenses and much higherrequirements for space, power, and cooling. (See Figure 3-2.)

Conventional Disk 1PB, 10 Racks

versus

Quantums DXi Appliance 28:1 DeDup = 1PB, 20 U

Figure 3-2: Conventional disk technology versus Quantums

DXi-Series appliances.

The key to finding the best answer is looking clearly at all thealternatives and finding the best way to combine them. A sup-plier like Quantum that can provide and support all the differ-ent options is likely to give users a wider range of solutionsthan a company that offers only one kind of technology, andsuch suppliers have teams of people that can help IT depart-ments look at the alternatives in an objective way.

Work with Quantum and the companys sizing calculator tohelp identify the right combination of technologies for theoptimal backup solution both in the short term and the longterm. See Chapter 2 for more on the sizing calculator.



28/43


Data Deduplication AlsoWorks for Archiving

Weve talked about the power of data deduplication in thecontext of backup because that application includes so muchredundant data. But data deduplication can also have verysignificant benefits for archiving and nearline storage appli-cations that are designed to handle very large volumes ofdata. By boosting the effective capacity of disk storage, data

deduplication can give these applications a practical way ofincreasing their use of disk-based resources cost effectively.Storage solutions that use Quantums patented data dedupli-cation technology work effectively with standard archivingstorage applications as well as with backup packages, andthe company has integrated the technology into its ownStorNext data management software. Combining high-speeddata sharing with cost effective content retention, StorNexthelps customers consolidate storage resources so that work-

flow operations run faster and the storage of digital businessassets costs less. With StorNext, data sharing and retentionare combined in a single solution that now also includes datadeduplication to provide even greater levels of value acrossall disk storage tiers.

Looking at the Quantum DataDeduplication Advantage

The DXi-Series disk backup and replication systems useQuantums data deduplication technology to reduce theamount of disk users need to store backup data by 90 percentor more. And they make automated replication of backup dataover WANs a practical tool for DR protection. All DXi-Seriessystems share a common replication methodology, so users

can connect distributed and midrange sites with Enterprisedata centers. The result is a cost-effective way for IT depart-ments to store more backup data on disk, to provide high-speed, reliable restores, to increase DR protection, to centralizebackup operations, and to reduce media management costs.



29/43


Quantum deduplication products cover a broad range of sizes,from compact units for small businesses and remote offices, to

midrange appliances, to enterprise systems that can hold4 petabytes of backup data. All systems include deduplicationand replication functionality in their base price, and the largersystems include software for creating tapes directly.

The DXi-Series works with all leading backup software, includ-ing Symantecs OpenStorage API, to provide end-to-end sup-port that spans multiple sites and integrates with tape backupsystems to make integrating deduplication technology into

existing backup architecture easy for users. DXi-Series appli-ances are part of a comprehensive set of backup solutionsfrom Quantum, the leading global specialist in backup, recov-ery, and archive. Whether the solution is disk with deduplica-tion and replication, conventional disk, tape, or a combinationof technologies, Quantum offers advanced technology, provenproducts, centralized management, and expert professionalservices offerings for all your backup and archive systems.

The results that Quantum DXi customers report show the kindof direct business benefits that adding deduplication technol-ogy can have on IT departments. In a recent survey, IT depart-ments that added DXi to their backup systems reported that:

Average backup performance more than doubledup125 percent -- while time for restores was reduced to afew minutes for most files.

Failed backup jobs were reduced by 87 percent.

Even though users still deployed tape for long-termretention and regulatory compliance, removable mediapurchase costs were reduced by an average 48 percentand media retrieval costs were reduced by 97 percent.

Overall, the amount of time people spent managing their backupand restore processes was reduced by an average 63 percent.For environments that deployed deduplication-based replica-

tion for DR, overall savings were higher. Dollar savings varied,but it was common for IT departments to reduce costs enoughthat they could pay for their deployments in roughly a year.



30/43




31/43

Chapter 4

Ten Frequently Asked DataDeduplication Questions

(And Their Answers)In This Chapter Figuring out what data deduplication really means

Discovering the advantages of data deduplication

In this chapter, we answer the ten questions most oftenasked about data deduplication.What Does the Term Data

Deduplication Really Mean?Theres really no industry-standard definition yet, but thereare some things that everyone agrees on. For example, every-body agrees that its a system for eliminating the need tostore redundant data, and most people limit it to systems thatlook for duplicate data at a block level, not a file level. Imagine20 copies of a presentation that have different title pages: Toa file-level data-reduction system, they look like 20 completelydifferent files. Block-level approaches see the commonalitybetween them and use much less storage.

The most powerful data deduplication uses a variable-lengthblock approach. A product using this approach looks at asequence of data, segments it into variable length blocks, and,when it sees a repeated block, stores a pointer to the original



32/43


instead of storing the block again. Because the pointer takesup less space than the block, you save space. In backup,

where the same blocks show up again and again, userstypically reduce disk needs by 90 percent or more.

How Is Data DeduplicationApplied to Replication?

Replication is the process of sending duplicate data from asource to a target. Typically, a relatively high performancenetwork is required to replicate large amounts of backup data.But with deduplication, the source system the one sendingdata looks for duplicate blocks in the replication stream.Blocks already transmitted to the target system dont needto be transmitted again. The system simply sends a pointer,which is much smaller than the block of data and requiresmuch less bandwidth.

What Applications Does DataDeduplication Support?

When used for backup, data deduplication supports allapplications and all qualified backup packages. Certain filetypes some rich media files, for example dont see much

advantage the first time they are sent through deduplicationbecause the applications that wrote the files already elimi-nated redundancy. But if those files are backed up multipletimes or backed up after small changes are made, deduplica-tion can create very powerful capacity advantages.

Is There Any Way to Tell HowMuch Improvement DataDeduplication Will Give Me?

Four primary variables affect how much improvement you willrealize from data deduplication:



33/43

Chapter 4: Ten Frequently Asked Data Deduplication Questions 25

How much your data changes (that is, how many newblocks get introduced)

How well your data compresses using conventionalcompression techniques

How your backup methodology is designed (that is,full versus incremental or differential)

How long you plan to retain the backup data

Quantum offers sizing calculators to estimate the effect thatdata deduplication will have on your business. Pre-salessystems engineers can walk you through the process andshow you what kind of benefit you will see.

What Are the Real Benefitsof Data Deduplication?

There are two main benefits of data deduplication. First, datadeduplication technology lets you keep more backup data ondisk than with any conventional disk backup system, whichmeans that you can restore more data faster. Second, it makesit practical to use standard WANs and replication for disasterrecovery (DR) protection, which means that users can pro-vide DR protection while reducing the amount of removablemedia (thats tape) handling that they do.

What Is Variable-Block-LengthData Deduplication?

Its easiest to think of the alternative to variable-length, whichis fixed-length. If you divided a stream of data into fixed-lengthsegments, every time something changed at one point, all

the blocks downstream would also change. The system ofvariable-length blocks that Quantum uses allows some of thesegments to stretch or shrink, while leaving downstream blocksunchanged. This increases the ability of the system to findduplicate data segments, so it saves significantly more space.



34/43


If the Data Is Divided intoBlocks, Is It Safe?

The technology for using pointers to reference a sequence ofdata segments has been standard in the industry for decades:You use it every day, and it is safe. Whenever a large file iswritten to disk, it is stored in blocks on different disk sectorsin an order determined by space availability. When you reada file, you are really reading pointers in the files metadata

that reference the various sectors in the right order. Block-based data deduplication applies a similar kind of technology,but it allows a single block to be referenced by multiple setsof metadata.

When Does Data Deduplication

Occur during Backup?There are really three choices.You can send all your backup data to a backup target andperform deduplication there (usually called target-baseddeduplication), you can perform the deduplication on eachprotected host, or you can use a central media server tocarry out the deduplication. All three systems are available

and have advantages.

If you deduplicate on the host during backup, you send less dataover your backup connection, but you have to manage softwareon all the protected hosts, backup slows down because dedu-plication adds overhead, and youre using a general-purposeserver, which can slow down other applications.

If deduplication is carried out in the backup application on

the media server, you dont have to buy a special-purposetarget deduplication device, but support is limited to oneapplication and all the overhead of the deduplication is addedto the servers other duties and deduplication systemsthat provide good reduction require significant processing.



35/43

Chapter 4: Ten Frequently Asked Data Deduplication Questions 27

So users deploying server-based deduplication report slowerbackup, limited scalability, and requirements to upgrade

their disk storage and buy more, heavier-duty servers.

If you use a target deduplication appliance, you send all thedata to the device and deduplicate it there. You have to buyan appliance, but in most cases, the appliance is designed justfor deduplication. This means the backup and restore perfor-mance stays high and deduplication doesnt slow down otherbackups or require that you beef up your backup servers.

Does Data DeduplicationSupport Tape?

Yes and no. Data deduplication needs random access to datablocks for both writing and reading, so it must be implementedin a disk-based system. But tape can easily be written from

a deduplication data store, and, in fact, that is the typicalpractice. Most deduplication customers keep a few weeks ormonths of backup data on disk, and then use tape for longer-term storage. Quantum makes that easy by providing a directdisk-to-tape connection in its larger deduplication appliancesso you can create tapes directly without sending the databack through a backup server. Supported applications includemany of the leading backup software, including SymantecsOpenStorage API (OST).

An important point: When you create a tape from data in adeduplicated datapool, most vendors re-expand the data andapply normal compression. That way files can be read directlyin a tape drive and do not have to be staged back to a disksystem first. That is important because you want to be able toread those tapes directly in case of an emergency restore. Afew suppliers write deduplicated data blocks to tape to savespace, but there is a big downside: Youll have to write any

data back to disk before you can restore it, so for a restore ofa significant size, or one that involves files of different ages,you might have to have a lot of free disk space available. Mostusers find that being able to read data directly from tape is amuch better solution.



36/43


What Do Data DeduplicationSolutions Cost?

Costs can vary a lot, but seeing list prices in the range of 30to 75 cents per GB of stored, deduplicated data is common. Agood rule-of-thumb rate for deduplication is 20:1 meaningthat you can store 20 times more data than conventional disk.Using that figure, systems that could retain 40TB of backupdata would have a list price of $12,500 or 31 cents a GB. So

even at the manufacturers suggested list and discounts arenormally available deduplication appliance costs are a lotlower than if you protected the same data using conventionaldisk. Even more important, customers commonly report thatthey save enough money from switching to a dedupe appli-ance to pay for their system in about a year.



37/43

Appendix

Quantums DataDeduplication Product Line

In This Appendix Reviewing the Quantum DXi-series disk backup and remote

replication solutions

Identifying the features and benefits of the DXi-Series

Quantum Corp. is the leading global storage companyspecializing in backup, recovery, and archive. Combining

focused expertise, customer-driven innovation, and platformindependence, Quantum provides a comprehensive range ofdisk, tape, and software solutions supported by a world-classsales and service organization. As a long-standing and trustedpartner, the company works closely with a broad network ofresellers, original equipment manufacturers (OEMs), and othersuppliers to meet customers evolving data protection needs.

Quantums DXi-Series disk backup solutions leverage patenteddata deduplication technology to reduce the disk needed forbackup by 90 percent or more and make remote replicatedata between sites over existing wide area networks (WANs)a practical and cost-effective DR technique. Figure A-1 showshow DXi-Series replication uses existing WANs for DR protec-tion, linking backup data across sites and reducing or elimi-

nating media handling.



38/43


DXi8500located atcentraldata center

Quantums Replication TechnologyUsers replicate data over existing WANs to provide automated DRprotection and centralized media management. Quantum replicationfeatures cross-site deduplication prior to data transmission foradditional bandwidth savings.

Remote office ADXi4500

DXi6500

Remote office B

Remote office C

Scalar i500tape library

DXi4500

Figure A-1: DXi-Series replication.

The DXi Series spans the widest range of backup capacitypoints in the industry. Some of the features and benefits ofQuantums DXi Series include:

Patented data deduplication technology that reducesdisk requirements by 90 percent or more

A broad solution set of turnkey appliances for small andmedium business, distributed and midrange sites, andscalable systems for the enterprise

High backup performance that provides enterprise-scaleprotection, even for tight backup windows

Software licenses that are included in the base price tomaximize value and steamline deployment

Quantums data deduplication also dramatically reduces thebandwidth needed to replicate backup data between sites for automated disaster recovery protection.



39/43

Appendix: Quantums Data Deduplication Product Line 31

All models share a common software layer, including dedu-plication and remote replication, allowing IT departments to

connect all their sites in a comprehensive data protectionstrategy that boosts backup performance, reduces or elimi-nates media handling, and centralizes disaster recovery oper-ations. Support includes Symantec OpenStorage API (OST) forboth disk and tape on DXi4500, DXi6500 and DXi8500 models.

The following sections offer more details about the individualDXi systems.

DXi4500The DXi4500 disk appliances with deduplication make it easyand affordable to increase backup performance, improverestores, and reduce data protection costs. Quantums dedu-plication technology provides disk performance for yourbackups, while it reduces typical capacity needs. Backupscan be economically retained on disk for instant restores,

simplified management, and reduced use of removable media.DXi4500 units are designed for rapid, seamless integrationand maximum client performance without changes to existingbackup architectures or potentially disruptive media serverupgrades, unlike software-based deduplication. Support forremote replication, Symantec OpenStorage (OST) interface,and virtual environments are standard features.

DXi6500 FamilyThe DXi6500 is a family of pre-configured disk backup appli-ances that provides simple and affordable solutions for userbackup problems. They provide disk-to-disk backup andrestore performance with all leading backup applicationsusing a simple NAS interface, and they leverage deduplicationtechnology to reduce typical capacity requirements. For DRprotection, the DXi6500 models replicate encrypted backup

data between sites using global deduplication to reduce typi-cal network bandwidth needs by a factor of 20 or more.

DXi6700The DXi6700 is a high-performance disk backup appliancefor Fibre Channel environments that provides a simple and



40/43


affordable solution for backup problems using a provenVTL interface. The deduplication technology of the DXi6700

reduces typical capacity requirements by 90 percent or moreso systems stop filling up, and it scales easily without a ser-vice visit, providing effective investment protection. For DRprotection, the DXi6700 replicates encrypted backup databetween sites to reduce typical network bandwidth needs bya factor of 20 or more. For long-term retention, the DXi6700 isdesigned to provide direct tape creation in conjunction withleading backup applications.

DXi8500The DXi8500 is a high-performance deduplication solution withthe power and flexibility to anchor an enterprise-wide backup,disaster recovery, and data protection strategy. The DXi8500offers industry-leading performance and advanced dedupli-cation technology that reduces typical disk and bandwidthrequirements by 90 percent or more. The DXi8500 presents

a wide range of interface choices. Featuring an automated,direct path to tape for both VTL and OST presentations, theDXi8500 integrates short-term protection and long-termretention requirements.



41/43

Notes



42/43

Notes



43/43

What are the true costs in storage space, cooling

requirements, and power use for all your redundant

data? Redundant data increases disk needs and makes

backup and replication more costly and more time-

consuming. By using data deduplication techniques and

technologies from Quantum, you can dramatically

reduce disk requirements and media management

overhead while increasing your DR options.

Find listings of all our books

Choose from manydifferent subject categories

Sign up for eTips atetips.dummies.com

ExplanationsinplainEnglish

Getin,getoutinformation

Iconsandothernavigationalaids

Toptenlists

Adashofhumorandfun

Use replication to automatedisaster recovery across sites!

Make a meaningful impact on your data

protection and retention

Eliminate

duplicate data

Reduce disk

requirements

Lower networkbandwidth

requirements

Data Deduplication for Dummies 2011

Documents

Transcript of Data Deduplication for Dummies 2011