understanding deduplication

download understanding deduplication

of 28

Transcript of understanding deduplication

  • 7/29/2019 understanding deduplication

    1/28

    Understanding the HP Data Deduplicat ion

    Strategy

    W hy one size d oesnt f it everyone

    Tab le of contents

    Executive Summary ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Introduction ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A wo rd of caution . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . 5

    Customer Benefi ts of Data Dedup lication ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    A wo rd of caution . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . 9 Understanding Customer N eeds for Data Dedupl ication . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . 10

    HP Accelerated Deduplica tion for the Larg e Enterprise Customer ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Issues A ssociated w ith O bject-Level Differencing ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4 W hat M akes HP A ccelerated Dedupl ication unique?. . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . 14

    HP Dynam ic Dedup lication for Small and Med ium IT Environments... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 How Dynamic Dedupl ication W orks. .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . 17 Issues A ssociated w ith Ha sh-Based Chunking ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 9

    Low -Bandwid th Repl ication Usage M odels. . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 21

    W hy HP for Dedupl ication? . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . 23

    Dedup lication Technologies A l igned w ith HP Virtual Libra ry Products... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    Summary . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . 25

    Ap pendix A G lossary of Terminology . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 26

    Ap pendix BDedupl icat ion compared to other da ta reduction technologies. . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 27

    For more information ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

  • 7/29/2019 understanding deduplication

    2/28

    Executive Summary

    Data d eduplica tion technology represents one of the most signif icant stora ge enha ncements in recent

    years, promising to reshape future data protection and disaster recovery solutions. Data deduplication

    offers the ability to store more on a given amount of storage and replicate data using lower

    bandwidth links at a signif icantly reduced cost.

    HP offers two comp lementary ded uplica tion technologies that meet very di fferent customer needs:

    A ccelerated ded uplica tion (object-level differencing) for the high-end enterprise customer w horequires:

    Fastest possible b ackup performa nce Fastest re stor es M ost scalable solution p ossible in terms of performa nce and c ap acity M ulti-node low -band w idth repl icat ion Hig h dedupl icat ion ratios W ide range o f rep lica tion mode ls

    Dynamic deduplication (hash-based chunking) for the mid size enterprise and remote off icecustomers who require:

    Low er cost device through smaller RA M footprint and op timized d isk usag e A fully integrated deduplication appliance with lights-out operation Backup a ppl icat ion and data type independence for maximum f lexib i l i ty W ide range o f rep lica tion mode ls

    This whitepaper explains how HP deduplication technologies work in practice, the pros and cons of

    each ap proach, w hen to choose a pa rticu lar type, and the type of low -band w idth repl ication mod els

    HP plans to support.

    W hy HP for Dedupl icat ion?

    The HP Virtual Libra ry System (VLS) incorpora te Accelerated ded uplica tion technology that delivershigh-performance d eduplica tion for enterprise customers. H P is one of the few vendors to da te with

    an object level differencing architecture that combines the virtual tape library and the deduplication

    engine in the same a pp liance. O ur competitors w ith object level differencing use a sepa rate

    ded uplica tion engine a nd VTL, w hich tends to b e ineff icient, as da ta is shunted b etween the tw o

    appliances, as well as expensive.

    HP D2D (Disk to Disk) Backup Systems use Dyna mic d eduplica tion technology that provid es asignif icant price advantage over our competitors. The combination of HP patents allows optimal

    RA M and disk usag e, intell igent chunking, and minim al pa ging . Tog ether w ith the cost benefits of

    using HP industry-standa rd ProLiant servers sets a new price po int for d eduplic ation a pp liances.

    2

  • 7/29/2019 understanding deduplication

    3/28

    HP D2D Backup Systems and VLS virtual l ibraries provide deduplication ratio monitoring as can be

    seen in the following screenshots.

    Figure 1. Deduplication ratio screens on HP VLS and D2D devices

    3

  • 7/29/2019 understanding deduplication

    4/28

    Introduction

    O ver recent years, virtual tap e libraries have become the ba ckbo ne of a modern da ta protection

    strategy because they offer:

    Disk-based backup at a reasonable cost Improved b ackup p erformance in a SA N environment because new resources (virtual tape drives)

    are easier to provision.

    Faster single file restores than physical tape Seamless integration into an exist ing backup strategy, making it low risk The ab ility to off load o r mig rate the da ta to physical tap e for off-site disaster recovery or for lo ng-

    term archiving

    Because virtual tap e libra ries are disk-ba sed b ackup devices wi th a virtual f i le system and the backup

    pro cess itself tends to ha ve a g reat dea l of repetit ive data, virtual tap e libra ries lend themselves

    particularly well to data deduplication. In storage technology, deduplication essentially refers to the

    elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only

    one copy of the data to be stored. However, indexing of all data is st i l l retained should that data ever

    be required. Deduplication is able to reduce the required storage capacity since only the unique data

    is stored .

    The amount of duplicate data that can be typically removed from a part icular data type is estimated

    to be as follows:-

    PACS 5%

    W eb and M ic roso f t o f fi ce Da ta 30 %

    Engineering Data Directories 3 5%

    Softwa re code archive 45 %

    Technical Publications 5 2%

    Database Backup 70 % or h igher

    In the above example PACs are Picture Archiving and Communication systems, a type of data used

    in X-rays and medi cal im ag ing. These have very lit tle duplica te data. A t the other end of the spectrum,

    databases contain a lot of redundant datatheir structure means that there will be many records with

    empty f ields or the same da ta in the sam e fields.

    W ith a virtual tape lib rary that has dedup lication, the net effect is that, over t ime, a gi ven amount of

    disk storage capacity can hold more data than is actually sent to it . To work deduplication needs a

    random access capability offered by disk based backup. This is not to say physical tape is dead,

    indeed tape is st i l l required for archiving and disaster recovery, both disk and tape have their own

    unique a ttributes in a comp rehensive da ta pro tection solution.

    The capa city o pt imizat ion of fered by dedupl icat ion is dependent on:

    Backup policy, full, incremental Retention p eriod s Data rate change

    4

  • 7/29/2019 understanding deduplication

    5/28

    Figure 2. A visual explanation of ded upl ication

    W hy use Dedupl icat ion?

    0

    2 0

    4 0

    6 0

    8 0

    1 0 0

    1 2 01 4 0

    1 2 3 6 9 1 2

    Da ta Sto red Da ta Sen t

    TBsStored

    Months

    TBsStored

    Months

    A w ord of caution

    Some people view dedupl icat ion w ith the app roach of That is great! I can now buy less storag e,

    but it does not w ork like that. Dedup lication i s a cumulative pro cess that can take several months to

    yield i mpressive ded uplica tion ratios. Initially, the am ount of storag e you b uy has to sized to reflect

    your exist ing backup tape rotation strategy and expected data change rate within your environment.

    HP has developed deduplication sizing tools to assist with deciding the amount of storage capacity

    with deduplication that is required. However these tools do rely on the customer having a degree of

    knowledge of how much the data change rate is in their systems.

    HP Backup Sizer Tool

    Deduplication has become popular because as data growth soars, the cost of storing data also

    increases, especially backup data on disk. Deduplication reduces the cost of storing mult iple backups

    on d isk. Deduplica tion is the latest in a series of technolog ies that offer space saving to a g reater or

    lesser degree. To compare Deduplication with other data reduction, or space saving technologies

    please look at A ppendix B.

    A worked example of deduplication is i l lustrated as follows:

    5

    http://h30144.www3.hp.com/SWDSizerWeb/default.htmhttp://h30144.www3.hp.com/SWDSizerWeb/default.htmhttp://h30144.www3.hp.com/SWDSizerWeb/default.htm
  • 7/29/2019 understanding deduplication

    6/28

    Figure 3. A wo rked example of dedupl icat ion for f i le system da ta over t ime

    Retention policy

    1 w eek, dai ly incrementals (5)

    6 months, weekly ful ls (25)

    Data parameters

    Data comp ression rate = 2:1

    Dai ly change rate = 1%

    (10 % of data in 1 0% o f f i les)

    Exa mp le 1 TB fi le server ba ckup

    1 ,125 GB25,500 GBTOTAL

    2 5 G B2 5 G B

    2 5 G B

    5 G B

    5 G B

    5 G B

    5 G B

    5 G B

    5 0 0 G B

    Data stored w ithdeduplicat ion

    1 0 0 0 G B2 5 th weekly ful l backup1 0 0 0 G B3

    rd

    weekly ful l backup

    1 0 0 0 G B2 nd weekly ful l backup

    1 0 0 G B5 th daily incremental backup

    1 0 0 G B4 th daily incremental backup

    1 0 0 G B3 rd daily incremental backup

    1 0 0 G B2 nd daily incremental backup

    1 0 0 G B1 st dai ly incremental b ackup

    1 0 0 0 G B1 st dai ly fu ll backup

    Data sent from backuphost

    1 ,125 GB25,500 GBTOTAL

    2 5 G B2 5 G B

    2 5 G B

    5 G B

    5 G B

    5 G B

    5 G B

    5 G B

    5 0 0 G B

    Data stored w ithdeduplicat ion

    1 0 0 0 G B2 5 th weekly ful l backup1 0 0 0 G B3

    rd

    weekly ful l backup

    1 0 0 0 G B2 nd weekly ful l backup

    1 0 0 G B5 th daily incremental backup

    1 0 0 G B4 th daily incremental backup

    1 0 0 G B3 rd daily incremental backup

    1 0 0 G B2 nd daily incremental backup

    1 0 0 G B1 st dai ly incremental b ackup

    1 0 0 0 G B1 st dai ly fu ll backup

    Data sent from backuphost

    ~23 :1 reduction in data stored

    2 .5 TB of disk backup =

    only two w eeks of dataretention normally

  • 7/29/2019 understanding deduplication

    7/28

    Customer Benefits of Data Deduplication

    W hat data d edup lication offers to customers is:

    The ability to store dramatically more data online (by o nline we mean d isk ba sed) A n increase in the rang e of Recovery Point O bjectives (RPO s) ava ilab le da ta can be reco vered

    from further back in time from the backup to better meet Service Level Agreements (SLAs). Disk

    recovery of a single f i les is alw ays faster than tape

    A reduction of investment in physical tap e by restricting i ts use more to a deep archivi ng a ndDisaster recovery usage m od el

    Dedup lication ca n automa te the disaster recovery pro cess by pro viding the ability to perform site tosite repl icat ion at a low er cost. Because dedupl icat ion know s what da ta has changed at a b lock or

    byte level, replication becomes more intell igent and transfers only the changed data as opposed to

    the complete da ta set. This saves time a nd rep lication ba ndw idth a nd i s one of the most attractive

    pro po sit ions that ded uplica tion offers. C ustomers who do not use di sk based replica tion acro ss sites

    today will embrace low-bandwidth replication, as it enables better disaster tolerance without the

    need a nd o pera tional costs associa ted w ith transporting d ata off-site on physical tap e. Replica tion

    is performed at a tape cartridge level

    Figure 4. Remote site data pro tection BEFO RE low-bandw idth repl ica tion

    Local site

    Loca l site

    Of fsite tap e vault

    Remote site d ata p rotection before low ba ndw idth repl ication

    Backup hostsLocal site

    SA N

    1 year

    Slow restores (from tape) beyond 2 weeks

    Loss of co ntrol of tapes w hen given to o ffsiteservice

    Excessive cost for offsite vaulting services

    Frequent backup fai lures during off hours

    Tedious d aily onsite media management o f

    tapes, labels and offsite shipment coo rdination

    Risk a nd o pera tional cost impa ct

    2 w eeks >2 w eeks

    Process replicated o n ea ch site, requiringlocal operators for managing tape

    Tapes made nightly andsent to an offsite for DR

    Data staged to disk andthen copied to tape

    7

  • 7/29/2019 understanding deduplication

    8/28

    Figure 5. Remote site data pro tection A FTER low -ba ndw idth repl ica tion

    Remote site

    Remote site data protection after low bandwidth replication

    Loca l site

    Local site

    Backup ho stsLocal site

    4 mo nths

    SA N

    4 months Ta p ecopies

    SA N

    Improved RTO SLA all restores are from disk

    N o outside vaulting service required N o ad ministrat ive media management requirements

    at local sites

    Reliable ba ckup pro cess

    Cop y-to-tape less frequently; consolidate tap e usageto a single site reducing number of tapes

    Risk a nd op erationa l cost impa ct

    Data o n disk is extended to4 months. Al l restorescome from disk. N o tapesare created locally.

    N o operators are requi red atlocal sites for tape o perations

    Data automatical ly repl icated toremote si tes across a W AN .Copies made to tape on amonthly ba sis for archive.

    To show how much of an imp act dedupl icat ion can have on repl ication t imes, take a look at the

    following Figure 6. This model also takes into account a certain overhead in control information that

    has to be sent site to site as well as the data deltas themselves. Currently without deduplication the full

    amount of data has to be transferred between sites and in general this requires high bandwidth links

    such as Gb E or Fibre C hannel. W ith Deduplica tion only the delta chang es are transferred between

    sites and this reduction allow s low er band w idth links such as T3 o r O C1 2 to b e used at low er cost.

    The follow ing ex am ple il lustrates the estimated replica tion t imes for varyi ng a mounts of chang e. M ost

    customers would be happy with a replication t ime of say 2 hours between sites using say a T3 link.

    The feed from HP D2 D b ackup systems or H P Virtual Libra ry systems to the replica tion link i s one or

    more G bE pipes.

    8

  • 7/29/2019 understanding deduplication

    9/28

    Figure 6. Replication times with and without deduplication

    Estimated Time to Replicate Datafor a 1TB Backup Environment @ 2:1

    Link Type

    W ith d edupeChange Rate

    W ithout d edupeBackup Type

    OC12T3T1Data Sent

    4 . 3 m in5 9 m in2 9 h rs1 3 . 1 G B0 . 5 %

    5 . 3 m in7 3 m in3 5 h rs1 6 . 3 G B1 . 0 %

    4 9 h rs

    4 5 . 4 d a ys

    4 . 5 d a y s

    1 .5 M b / s

    2 2 . 5 G B

    5 0 0 G B

    5 0 G B

    6 2 2 .1 M b / s4 4 .7 M b / s

    1 6 m in3 .8 h rsIncremental

    2 .7 h rs1 . 6 d a y sFull

    7 . 3 m in1 0 2 m in2 . 0 %

    Link Rate (66% efficient)

    Link Type

    W ith d edupeChange Rate

    W ithout d edupeBackup Type

    OC12T3T1Data Sent

    4 . 3 m in5 9 m in2 9 h rs1 3 . 1 G B0 . 5 %

    5 . 3 m in7 3 m in3 5 h rs1 6 . 3 G B1 . 0 %

    4 9 h rs

    4 5 . 4 d a ys

    4 . 5 d a y s

    1 .5 M b / s

    2 2 . 5 G B

    5 0 0 G B

    5 0 G B

    6 2 2 .1 M b / s4 4 .7 M b / s

    1 6 m in3 .8 h rsIncremental

    2 .7 h rs1 . 6 d a y sFull

    7 . 3 m in1 0 2 m in2 . 0 %

    Link Rate (66% efficient)

    A w ord of caution

    An init ial synchronization of the backup device at the primary site and the one at the secondary site

    must be p erformed. Because the volume of da ta that requires synchroniz ing at this stag e is high, a

    low-bandwidth link will not suff ice. Synchronization can be achieved in three different ways:

    Provision the tw o d evices on the sam e site and use a feature such as local rep lication o ver high-ba ndw idth f ibre cha nnel l inks to synchroniz e the data. Then ship o ne of the libraries to the remote

    site

    Install the two sepa rate devices at sepa rate sites, perform init ial ba ckup a t Site A. C op y the backupfrom Site A to phy sical tape, then transfer the phy sical tapes to site B and i mp ort them. W hen the

    systems at bo th sites are synchroniz ed, start low-band w idth replica tion betw een the tw o

    A fter in i tia l b ackup a t si te A a l low a multi -day w indow for in i t ia l synchronizat ion a l lowi ng the tw odevices to cop y the in i tia l b ackup d ate over a low -band w idth l ink

    9

  • 7/29/2019 understanding deduplication

    10/28

    Und erstand ing C ustomer N eeds for Da ta Dedup l ica tion

    Both large and small organizations have remarkably similar concerns when it comes to data

    pro tection. W hat differs is the priority o f their issues.

    Figure 7. Co mmon cha llenges w ith data protection amo ngst remote offices, SMEs and la rge customers

    O vercome a lack o f dedi cated IT resources

    M anage da ta g row th

    M ainta in ba ckup ap pl i cat ion, f i le and O Sindependence

    Spend less t ime managing backups

    Hand le explosive data g row th

    M eet and mainta in backup w indow s

    Achieve greater backup rel iabi l i ty

    Accelerate restore f rom tap e ( inc v i r tual tape)

    M ana ge remote site da ta p rotection

    Co mmon chal lenges

    SME

    Remote office

    Data center

    Environment N eeds

    Different priorit ies are what have led HP to develop two dist inct approaches to data deduplication.

    For example:

    Large enterprises have issues meeting backup windows, so any deduplication technology that couldslow do w n the ba ckup p rocess is of no use to them. M edium a nd Small enterprises are concerned

    abo ut backup w indow s as w el l but to a lesser degree

    M ost larg e enterprise custom ers have Service Level A gre ements (SLA s) per taini ng to restore timesany deduplication technology that slows down restore t imes is not welcome either

    M any la rge customers back up hundred s of terab ytes per nig ht, and their ba ckup solution w ithdeduplication needs to scale up to these capacit ies without degrading performance. Fragmenting

    the approach by having to use several smaller deduplication stores would also make the whole

    backup p rocess harder to manag e

    Conversely remote of f ices and smaller organiz at ions general ly need an easy a ppro ach adedicated appliance that is self-containedat a reasonable cost

    Remote off ices and SM Es do no t wa nt or need a system that is inf initely scalable, a nd the costassociated w ith l inear ly scalable ca pacity a nd p erformance. They need a single engine ap proach

    that can w ork transparently in any o f their environments

    1 0

  • 7/29/2019 understanding deduplication

    11/28

    HP Accelerated Deduplication for the Large Enterprise

    Customer

    HP Accelerated d eduplica tion technolog y is designed for la rge enterprise data centers. It is the

    technology HP has chosen for the H P Stora geW orks Virtual Libra ry Systems. Acc elerated

    deduplication has the following features and benefits:

    Utilizes object-level differencing technology with a design centered on performance and scalability Delivers fastest possible b ackup performa nce it leverages post-pro cessing technology to pro cess

    data dedupl icat ion as ba ckup job s complete, dedupl icat ing p revious ba ckups whi lst o ther backups

    are stil l completing.

    Delivers fastest restore from recently ba cked up da ta it maintains a co mplete copy of the mostrecent backup d ata and e l iminates dupl icate da ta in p revious backups

    Scalable deduplication performanceit uses distributed architecture where performance can beincreased by adding addit ional nodes

    Flexible replica tion op tions to p rotect your investment

    Figure 8. O bject-level differencing comp ares only current and p revious backups from the same ho sts and el iminates duplica tedata by means of pointers. The latest backup is always held intact.

    ______ ________ ________ ________ ________ __

    A______ ________ ________ ________ ________ __

    A

    _____ ________ ________ ________ ________ ___

    A

    HP Accelerated Deduplication

    SpaceReclamation

    Deletesdupl icated data,

    reallocatedunused space

    PreviousBackup

    CurrentBackup

    CurrentBackup

    DataGrooming

    Identif ies similardata ob jects

    ______ _________ _________ _________ ___

    ______ _________ _________ _________ ___

    ______ _________ _________ _________ ___

    ______ _________ _________ _________ _________ __

    ______ ________ ________ ________ __

    ______ ________ ________ ________ ________ __

    A

    DataDiscrimination/

    Data Comparison

    CurrentBackup

    PreviousBackup

    ______ ________ ________ ________ ________ __

    A

    Pointer toExisting Data

    Optional SecondIntegrity Check

    Compares dedup l icateddata to or iginal data

    objects

    _____ ________ ________ ________ ________ ___

    A

    Identi fies di fferencesat byte level,

    ensures dataintegri ty

    ______ ________ ________ ________ ________ __

    A

    N e w d a tastored. Dupl icateddata replacement

    wi th pointer toexisting data.

    Reassembly

    ______ _________ _________ _________ ___

    ______ _________ _________ _________ ___

    ______ _________ _________ _________ ___

    ______ _________ _________ _________ _________ __

    ______ ________ ________ ________ __

    ______ ________ ________ ________ ________ __

    A

    1 1

  • 7/29/2019 understanding deduplication

    12/28

    How Accelerated Deduplication Works

    W hen the backup runs the data stream is processed a s it is stored to di sk assembling a c ontent

    database on the f ly by interrogating the meta data attached by the backup application. This process

    has min imal performance impa ct.

    1 . A fter the f irst backup job comp letes, tasks are scheduled to b egin the ded uplica tion processing.The content database is used to identify subsequent backups from the same data sources. This is

    essential, since the w ay ob ject-level differencing w orks is to comp are the current backup from a

    host to the p revious backup from that same ho st.

    Figure 9. Identi fy ing dupl icated da ta by str ipping aw ay the meta da ta associated w i th backup formats, f i les and da tabases

    O b ject Level Differencing str ip s aw a ythe meta data to reveal real dupl ication

    Physical d ata

    A ctual f i le AMETAA

    A ctual f i le A

    METAB

    Logical data

    These tw o f i les look di f ferent at abackup object level because of thedif ferent backup meta da ta but at

    a logical level they a re ident ical .O bject Level Dif ferencingdedupl ica tion str ips awa y theba ckup meta d ata to reveal realdupl icated data.

    Backup A ppM eta da ta

    Session 1

    Session 2

    2 . A data compa rison is performed b etw een the current backup a nd the previous backup f rom thesame host. There are d ifferent levels of co mp arison. For exa mple, some ba ckup sessions are

    comp ared at an entire session level. Here, d ata is comp ared by te-for-byte betw een the tw o

    versions and co mmo n streams of da ta are identif ied. O ther ba ckup sessions comp are versions of

    f i les w ithin the ba ckup sessions. N ote that w ithin A ccelerated ded uplica tions object-leveldifferencing, the comp arison is done A FTER the backup m eta da ta and f i le system meta da ta has

    been stripped away. (See the example in the following Figure 10) This makes the deduplication

    process much more eff icient but relies on an intimate knowledge of both the backup application

    meta da ta types and the data typ e meta da ta (f i le system file, da taba se f i le, and so on).

    3 . W hen duplica te data is found i n the comp arison pro cess, the duplicate da ta streams in the oldestba ckup a re replaced by a set of pointers to a mo re recent cop y of the same d ata. This ensures tha

    the latest backup is alw ays fully contiguous, and a restore from the latest backup w ill alw ays take

    place at maxim um speed.

    1 2

  • 7/29/2019 understanding deduplication

    13/28

    Figure 10 . W ith object level differencing the last ba ckup is alwa ys fully intact. Duplicated ob jects in previous backup s are

    replaced wi th pointers plus byte level differences.

    HP A cce lera ted Data Dedupl ica tion Deta i ls

    SESSION1

    SESSION2

    SESSION3

    DAY 3DAY 2DAY 1

    SESSION1

    SESSION2

    SESSION3

    DAY 3DAY 2DAY 1

    A C DA C D

    A CA C

    A B A B A B

    A C

    A, B, C and D are f i les within a backup sessiondifferenced data

    poi nter to currentversion

    In the preceding d ia gram, backup session 1 had f i les A and B. W hen backup session 2 completed

    and was compared with backup session 1, f i le A was found and a byte-level difference calculated for

    the older version. So i n the older ba ckup (session1), f i le A w as replaced by po inters plus differencedeltas to the f i le A d ata in b ackup session 2. Subsequently, when b ackup session 3 comp letes, it is

    compa red w ith backup session 2 and f i le C is found to be dupl icated. Hence a d if ference and a

    pointer is placed in backup session 2 pointing to the f i le C data in backup session 3, also at the same

    time the orig inal p ointer to f i le A in Session1 is readjusted to p oint to the new location o f f i le A. This is

    to prevent multiple ho ps for po inters w hen restoring o lder d ata. So the process continues, every t ime

    comparing the current backup with the previous backup. Each t ime a difference plus pointer is written

    stora ge ca pa city is saved . This process allow s the ded uplica tion to track even a byte level change

    betw een f i les.

    4 . Secondary Integr i ty Checkb efore a ba ckup tape is replaced by a d edupl icated version w ithpo inters to a mo re recent occurrence of that data, a byte-for-byte comp arison can take place

    comparing the original backup with the reconstructed backup, including pointers to ensure thatthe two are ident ica l. O nly w hen the compa re succeeds wi l l the or ig ina l backup tape be replaced

    by a version including po inters. This step is optional. See Figure 9 Step 4 .

    5 . Space reclamation occurs when all the free space created by replacing duplicate data withpointers to a single instance of the data is complete. This can take some time and results in used

    capacity being returned to a free pool on the device.

    Replication can take place from Step 3 because the changed data is available to be replicated even

    before the space has been reclaimed.

    1 3

  • 7/29/2019 understanding deduplication

    14/28

    HP A ccelerated dedupl icat ion:

    W ill scale up to hundred s of TB Has no impa ct on ba ckup performance, s ince the comparison is done af ter the ba ckup job

    completes (post process)

    A llows more dedupl icat ion compute nodes to be add ed to increase dedupl icat ion performanceand ensure the post proc essing is comp lete before the ba ckup cy cle starts aga in.

    Yields high deduplication ratios because it strips away meta data to reveal true duplication, anddoes not rely on data chunking.

    Provides fast bulk data restore and tape cloning for recently backed up datamaintains thecomplete most recent copy of the backup data but eliminates duplicate data in previous backups.

    Issues A ssoc ia ted w ith O b ject-Level Differenc ing

    The ma jor i ssue w ith obj ect-level differencing is that the device ha s to b e know ledgea ble in terms of

    backup formats and data types to understand the M eta data. HP Accelerated d edupl icat ion w i l l

    support a subset of backup applications and data types at launch.

    Addit ionally, object-level differencing compares only backups from the same host against each other,

    so there is no d eduplica tion ac ross hosts, but the a mount of co mmo n d ata a cross different hosts can

    be quite low .

    W hat M akes HP A ccelerated Dedup l ication unique?

    The ob ject-level differencing in H P A ccelerated ded uplica tion is unique in the marketpla ce. Unlike

    hash-ba sed techniques that are an all-or-nothing m ethod o f ded uplica tion, ob ject-level differencing

    applies intell igence to the process, giving users the ability to decide what data types are deduplicated

    and allowing f lexibil ity to reduce the deduplication load if it is not yielding the expected or desired

    results. HP O bject-level differencing technology is also the only d edup lication technolog y that can

    scale to hundreds of terabytes with no impact on backup performance, because the architecture does

    not depend on managing ever increasing amounts of Index tables, as is the case with Hash based

    chunking. It is also w ell suited for la rger scaleab le system since it is ab le to di stribute the

    deduplication workload across all the available processing resources and can even have dedicatednodes purely for deduplication activit ies.

    HP Accelerated d edupl icat ion wi l l be supported o n a rang e of backup a ppl icat ions: HP Data Protector Symantec N etBackup Tivol i Storag e M anag er Legato N etw orker

    HP Accelerated dedupl icat ion w i l l support a w ide ra nge of f i le types: W i nd o w s 2 0 0 3 W indows Vista HP-UX 1 1 .x Solaris standard f i le backups Linux Redh at Linu x SuSe AIX f i le backups Tru64 f i le backups

    1 4

  • 7/29/2019 understanding deduplication

    15/28

    HP Accelerated deduplication will support database backups over t ime: O rac le RM AN Ho t SQ L Backups O nline Exchange M API ma i lbox backups

    For the latest more details on what Backup software and data types are supported with HP

    Accelerated Deduplication please look at the HP Enterprise Backup Solutions compatibil ity guide at

    h ttp : / / w w w . hp . co m/ g o / eb s

    HP A ccelerated deduplica tion technolog y is ava ilab le by license on HP Stora geW orks Virtual Libra ry

    Systems (models 6000, 9000, and 12000). The license fee is per TB of user storage (before

    comp ression or d eduplica tion takes effect).

    Figure 11 . Pros and cons of HP Accelerated Deduplication

    Pros & C ons of H P A ccelerated Ded uplica tion

    PRO CO N

    Does not restrict ba ckup ra te sinceda ta is processed a fter the ba ckuphas completed.

    Faster restore rate forwa rdreferencing p ointers a l low rapidaccess to da ta.

    Can handle datasets > 100TBw ithout having to pa rti t ion ba ckups no hashing tab le depend encies.

    Can selectively compare data l ikelyto match, increasing performancefur ther higher dedupl icat ion ratios.

    Best suited to large Enterprise VTLs.

    Ha s to b e ISV format aw are andda ta type aw are, content coveragewil l grow over t ime.

    M ay need ad di t ional computenod es to speed up p ost pro cessingded upl icat ion in scenarios w ith longbackup windows.

    N eeds to cache 2 b ackups in orderto perform post process comparison.

    So a dd itiona l disk capa city equal tothe size o f the larg est backup needsto be sized into the solution.

    A t ingest t ime when the tape co ntent da taba se is generated there is a small p erformance o verhead

    (< 0 .5 %) and there is a sma ll amount of di sk spa ce required to hold this datab ase (much less than the

    hash tab les in the hash based chunking d edup lication technolog y). Even if this content da taba se w ere

    comp letely destroyed it wo uld stil l be p ossible to ma intain a ccess to the da ta beca use the pointers are

    stil l fully i ntact and held w ithin the re-w ritten tap e forma ts.

    HP object level differencing also has ability to provide selective deduplication by content type, and in

    the future could b e used to ind ex content provid ing content addressab le archive searches.

    The question o ften arises W hat hap pens if dedup lication is not comp lete by the time the same

    backup from the same host arrives? Typically the deduplication process takes about 2 x as long as

    the backup process for a given backup, so as long as a single backup job does not take > 8 hours

    1 5

    http://www.hp.com/go/ebshttp://www.hp.com/go/ebshttp://www.hp.com/go/ebs
  • 7/29/2019 understanding deduplication

    16/28

    this w ill not occur. In ad dit ion the mult i-node architecture ensures that each nod e is load ba lanced to

    provide 33% of its processing capabilit ies to deduplication whilst st i l l maintaining the necessary

    performance for backup and restore. Final ly add it ional dedicated 1 00 % dedupl icat ion compute

    nodes can be added if necessary.

    Let us now analyze HPs second type of d edupl icat ion technology Dynamic dedupl icat ion, w hich

    uses hash-based chunking.

    HP Dyna mic Ded upl ication for Smal l and M edium ITEnvironments

    HP Dynam ic d eduplic ation i s designed fo r customers w ith smaller IT environments. Its main features

    and benefits include:

    Hash-based chunking technology with a design center around compatibil ity and cost Low cost and a small RA M footprint Independence f rom backup app l icat ions Systems w ith built-in d ata d eduplic ation Flexible rep lication o ptions for increa sed investment protection.Ha sh-ba sed chunking techniques for da ta reduction have b een around for yea rs. Ha shing consists of

    app lying an a lgor i thm to a specif ic chunk of da ta a nd yie ld ing a unique f ingerpr int of that data. The

    backup stream is simply broken down into a series of chunks. For example, a 4K chunk in a data

    stream can be hashed so that it is uniquely represented by a 20-byte hash code. See Figure13

    Figure 12 . Hashing technology

    Hashing Technology

    in-line = ded uplication on the fly as data is ingested using hashingtechniques

    hashing = is a reprod ucib le method o f turn ing some kind of da ta intoa (relatively) small number that may serve as a d igital " f ingerprint" ofthe data

    IN PUT O UTPUT

    HP invent

    HP StorageW orks

    Hashing DFCD3453

    Hashing 7 8 5 C 3 D 9 2

    HP Nea r l ineStorage

    Hashing 4 6 7 3 F D 7 4 B

    The larger the chunks, the less chance there is of finding an identical chunk that generates the same

    hash code thus, the ded uplica tion ratio w ill not be as high. The smaller the chunk size, the more

    1 6

  • 7/29/2019 understanding deduplication

    17/28

    eff icient the data deduplication process, but then a larger number of indexes are created, which leads

    to problems storing enormous numbers of indexes (see the following example and Glossary).

    Figure 13 . How hash based chunking w orks

    _____ _________ _________ _________ ____

    Backup 1

    _____ _________ _________ _________ ____

    Backup 1

    Backup has been split into chunks and thehashing function has been a ppl ied

    Hash generated andlook-up performed

    N ew H ash genera tedand entered into index

    #33 #13 #1 #65 #9 #245 #21 #127

    #33 #13 #222 #75 #9 #24 5 #86 #127

    # 8 6

    # 3 3

    5677#8 6

    34 7#7 5

    6459#222

    13#127

    12 3#2 1

    97 6#245

    78 5#9

    3245#6 5

    89#1

    23 4#1 3

    5#3 3

    Disk Block# N o s

    Index (RAM )

    5677#8 6

    34 7#7 5

    6459#222

    13#127

    12 3#2 1

    97 6#245

    78 5#9

    3245#6 5

    89#1

    23 4#1 3

    5#3 3

    Disk Block# N o s

    Index (RAM )

    _____ _________ _________ _________ ____

    Backup 2

    _____ _________ _________ _________ ____

    Backup 2

    Chunks are storedon Disk

    How Dynamic Dedupl ica tion W orks

    1 . As the backup data stream enters the target device (in this case the HP D2D2500 or D2D4000Backup System), it is chunked into nom inal 4 K chunks ag ainst which the SHA -1 hashing a lgori thm

    is run. These results are place i n an index (hash value) and stored in RA M in the targ et D2D

    device. The hash value is also stored a s an entry in a recipe f i le whi ch represents the ba ckup

    stream, and po ints to the da ta in the dedup lication store where the origi nal 4 K chunk is stored.

    This happ ens in real t ime as the backup is taking plac e. Step 1 continues for the who le bac kup

    data stream.

    2 . W hen another 4K chunk generates the same ha sh index a s a p revious chunk, no index is add edto the index list and the da ta is not wri tten to the dedup lication store. A n entry w ith the hash value

    is simply added to the recipe f i le for that backup stream pointing to the previously stored data,

    so spa ce is saved. N ow as you scale this up over m any b ackups there are ma ny instances of the

    same hash value being generated, b ut the actual da ta is only stored once, so the spa ce saving s

    increase.

    3 . N ow let us consider ba ckup 2 in Figure 1 3 . A s the da ta stream is run through the hashingalgor i thm ag ain, much of the data w i l l generate the same hash index cod es as in backup 1

    hence, there is no need to ad d i ndexes to the tab le or use storage in the ded uplica tion store. In

    this backup, some of the data has changed. In some cases (#2 22 , #7 5, a nd # 8 6), the data is

    unique and generates new indexes for the index store and new data entries into the deduplication

    store.

    4 . A nd so the hashing proc ess continues ad infinitum until as ba ckups are overw ritten by the taperotation strategy certain hash indexes are no longer required, and so in a housekeeping operation

    they are removed.

    1 7

  • 7/29/2019 understanding deduplication

    18/28

    Figure 14 . How hash-based chunking performs restores

    13#127

    12 3#2 1

    97 6#245

    78 5#9

    3245#6 5

    89#1

    23 4#1 3

    5#3 3

    Disk Block# N o s

    Index (RAM )

    13#127

    12 3#2 1

    97 6#245

    78 5#9

    3245#6 5

    89#1

    23 4#1 3

    5#3 3

    Disk Block# N o s

    Index (RAM )

    RestoreBackup 1

    ___________________________________

    ___________________________________

    ___________________________________

    ___________________________________ #33 #13 #1 #65 #9 #245 #21 #127#33 #13 #1 #65 #9 #245 #21 #127

    #33 #13 #1 #65 #9 #245 #21 #127#33 #13 #1 #65 #9 #245 #21 #127

    Backup 1 Recipe f ile

    Restore commences and recipe file isreferenced in the Dedupe store

    A recipe file is stored inthe Dedupe store, whichis used to re-construct thetape blocks thatconstitute a backup

    Recipe filerefers to ind ex

    #33 is restored

    Recipe file

    Chunks are storedon Disk

    # 3 3

    5 . O n receiving a restore comma nd from the backup system, the D2D device selects the correctrecipe file and starts sequentially re-assembling the file to restore.

    a. Read recipe f i le.

    b. Look up hash in index to get disk po inter.

    c. G et or ig ina l chunk f rom disk.

    d. Return da ta to restore strea m.

    e. Repea t for every hash entry in the recipe f i le.

    1 8

  • 7/29/2019 understanding deduplication

    19/28

    Issues Associated with Hash-Based Chunking

    The mai n issue wi th hash-ba sed chunking technology i s the gro w th of indexes and the limited a mount

    of RA M storage req uired to store them. Let us take a simple exa mple: i f w e have a 1 TB backup da ta

    stream using 4 K chunks, and every 4 K chunk prod uces a uniq ue hash value. This equates to 25 0

    mill ion 20-byte hash values or 5GB of storage.

    If we performed no other optimization (for example, paging of indexes onto and off disk), then the

    ap plia nce wo uld need 5G B of RA M for every TB of ded uplica ted unique da ta. M ost server systems

    canno t supp ort much more than 16 G B of RA M . For this reason, hash-ba sed chunking ca nnot easilyscale to hundreds of terabytes.

    M ost low er-end to mid -rang e dedup lication technologi es use varia tions on hash-ba sed chunking , but

    w ith add it ional techniques to red uce the size of the indexes generated, reducing the amount of RA M

    required, but generally at the expense of some deduplication eff iciency or performance. If the index

    management is not eff icient, it will slow the backup down to unacceptable levels or miss many

    instances of d uplica te data. The other op tion is to use larg er chunk sizes to red uce the size of the

    index . A s mentioned earlier, the do w nside of this is that deduplic ation w ill be less eff icient. These

    algorithms can also be adversely affected by non-repeating data patterns that occur in some backup

    softw are tap e forma ts. This beco mes a b igg er issue w ith larger chunk sizes.

    HP has developed a unique innovated technology leveraging work from HP Labs that dramatically

    reduces the amount of memory required for managing the index without sacrif icing performance or

    ded uplica tion eff iciency. N ot only does this technolog y enab le low -cost hig h performance d isk

    ba ckup systems, but it also allow s the use of much smaller chunk sizes to p rovid e mo re effective d ata

    deduplication which is more robust to variations in backup stream formats or data types.

    Restore t imes can be slow with hash-based chunking. As you can see from figure 14, to recover a 4K

    piece of d ate from a hash-ba sed ded uplica tion store req uires a reconstruction pro cess. The restore

    can take longer than i t d id to ba ck up.

    Finally y ou ma y here the term hashing co llisions this means that 2 different chunks of data

    pro duce the same ha sh value, w hich ob viously undermines the da ta integrity. The chances of this

    hap pening are rem ote to say the least. H P Lab s calculated

    Using a TW EN TY BYTE (16 0 bit) hash such as SHA 1 , the time required for a ha shing co llision tooccur is 10 0 ,0 00 ,00 0 ,0 00 ,00 0 years, based on the back ing up 1TB o f da ta per work ing day.

    Even so, HP Dynamic deduplication adds a further Cyclic Redundancy Checksum (CRC) at a tape

    record level that w ould ca tch the highly unlikely event of a ha sh collision.

    Despite the ab ove limitations, dedup lication using ha sh-ba sed chunking is a w ell-pro ven technolog y

    and serves remote off ices and medi um sized businesses very w ell. The b ig gest benefit of hash-ba sed

    chunking i s that it is totally d ata forma t-indep endent and it does not have to be engineered to w ork

    with specif ic backup applications and data types. The products using the hash based deduplication

    technology stil l have to be tested with the various backup applications but the design approach is

    generic.

    HP is deploying Dynamic deduplication technology on its latest D2D Backup Systems, which aredesigned for remote off ices and small to medium organizations.

    HP D2 D 25 0 0 a nd 4 00 0 Backup Systems come with dedupl icat ion as standard w ith no add it ional

    licensing costs.

    1 9

  • 7/29/2019 understanding deduplication

    20/28

    Figure 15 . Pros and cons of hash-based chunking deduplication

    Pros & Co ns of HP Dyna mic Dedup lication

    PRO CON

    Dedupl ication performed at ba ckuptime Can restrict ingest rate (backup rate) i f not do ne efficiently and couldslow backups down. Can instantly handle any data

    format Restore time may be longer thanob ject level d ifferencingSignificant processing overhead, butdedupl icat ionbecause of datakeeping pace with processorregeneration pro cess. developments.Co ncerns over scalabi l ityFast search, algorithms alreadyw hen using very large ha sh ind exes.proven to aid hash detectionFor d ata sets > 1 0 0 TB may ha ve to

    Low storag e overhead do n

    W hat makes HP Dynamic Dedupl icat ion technology unique a re a lgor i thms developed w ith HP Labs

    that drama t ica l ly reduce the amo unt of memory required for ma naging the index, a nd w ithout

    sacrif ici ng p erformance o r ded uplica tion effectiveness. Specif ically, this technolog y:

    Uses far less memory b y im plementing a lgori thms that determine w hich are the most op timalindexes to hold in RAM for a g iven backup da ta stream

    Allows the use of much smaller chunk sizes to provide more effective data deduplication which ismore robust to variations in backup stream formats or data types

    Provid es intelligent storage of chunks and recipe f i les to limit disk I/ O and pa ging W orks wel l in a b road range o f environments since i t is independent of backup sof tw are format

    and data types

    t have startto hold complete backups (TBs) forpo st ana lysis

    Best suited to smaller size VTLs

    par ti tioning backups to

    ensure better hash indexmanagement.

    2 0

  • 7/29/2019 understanding deduplication

    21/28

    Low -Band w id th Repl ication Usag e M od els

    The second mai n benefit of ded uplica tion is the ab ility to replicate the changes in da ta on site A to a

    remote site B at a fraction of the cost because high-bandwidth links are no longer required. A general

    guideline is that a T1 link is about 10% of the cost of a 4Gb FC link over the same distance. Low-

    band w idth repl icat ion wi l l be avai lab le on both D2D and VLS products. Upto 2 G bE ports w i l l be

    avai lab le for repl icat ion on D2 D devices and 1 G bE port per node w i l l be avai lab le on the VLS

    products.

    HP wil l suppo rt three topo logies for low band w idth repl ication:

    Box --> Box Active Active M any --- O neThe unit of replication is a ca rtridge. O n VLS, it will b e po ssible to pa rt ition slots in a virtual l ibrary

    replication target device to be associated with specif ic source replication cartridges.

    Figure 16 . Active Active repl ication on HP VLS and D2D systems with deduplication

    Accelerated Deduplication ReplicationExamp le Use Case A ctive/ A ctive

    Backup Server

    TCP/ IP

    VLib1 VLib1VLS1

    VLS2

    VLib2 VLib2

    Backup Server

    Generally datacenter-to-datacenter replication, with each device performinglocal backups and also acting as the replication store for the other datacenter

    2 1

  • 7/29/2019 understanding deduplication

    22/28

    Figure 17 . M any-to-one replica tion on H P VLS and D2 D systems w ith deduplica tion

    A ccelerated Dedup l ication Repl icationExample Use Case M any to O ne

    Backup Server

    VLib1

    Backup Server

    VLib1

    Backup Server

    VLib1

    {

    {

    {

    Backup Server

    VLib2

    VLib1

    TCP/ IPVLS4

    Ca n d ivide up a sing le destination targ et into m ultiple slots rang es to al low

    many-to-one without needing a separate replication l ibrary for each one

    VLS1

    VLS2

    VLS3

    Initially i t will no t possible for D2 D d evices to rep licate into the much larger VLS devices, since their

    ded uplica tion technologies are so di fferent, but HP plans to be a ble to offer this feature in the near

    future.

    W hat wi l l be possib le is to repl icate mult ip le HP D2D2 50 into a centra l D2D4 00 0 o r repl icate smalle

    VLS62 00 models in to a centra l VLS 12 00 0 (See Figure 1 8)

    Deduplication technology is leading us is to the point where many remote sites can replicate data

    ba ck to a central da ta center at a reasonable co st, removing the need for tedio us off-site vaulting o f

    tapes and fully a utomating the pro cesssaving even more co sts.

    This ensures

    The most cost effective solution is deployed at each specific site The costs and issues associated w ith off site vault ing o f phy sical tape are rem oved The whole Disaster recovery process is automated The solution is scalable at all sites

    2 2

  • 7/29/2019 understanding deduplication

    23/28

    Figure 18 . Enterprise Deployment with repl ication across remote and branch offices back to data centers

    Enterpri se Dep loym ent

    Servers

    1-4 servers

    > 20 0 G B storage

    Servers

    Larg e Remote Offi ce

    Regional Site orSmall Datacenter

    LAN

    Large Datacenter

    LAN

    M obi le/ desktop c l ient agents

    Backup agent

    Backup/ media server

    D2D ApplianceD2D Appliance

    DiskStorage

    Virtual LibrarySystem

    LA

    N

    Small Remote O ffice

    Mobile/ Desktops

    Backup Server

    LA

    N

    Mobile/ Desktops

    Backup Server

    Servers

    Mobile/ Desktops

    Mobile/ DesktopsServers

    Servers

    VirtualLibrarySystem

    BackupServer

    SA N

    SA N

    TapeLibrarySystem

    D2D Appliance

    W ith small an d lar ge ROBOs

    SecondaryDatacenter

    BackupServers

    DiskStorage

    Servers

    W hy HP for Dedupl ication?

    Deduplication is a powerful technology and there are many different ways to implement it , but most

    vendors offer only one method and , as w e have seen, no one method i s best in all circumstances. HPoffers a choice of deduplication technologies depending on your needs. HP does not pretend that

    one size f its all.

    Cho ose HP Dynamic ded uplica tion for small and mid -size IT environments because it offers the best

    technology footprint for deduplication at a price point that is affordable. Flexible replication options

    further enhance the solution.

    Choose HP Accelerated deduplication for Enterprise data centers where scalability and backup

    performa nce are p ara mount. Flexible replic ation o ptions further enhance the solution.

    The scalabi lity issues associa ted w ith hash-ba sed chunking are a dd ressed b y some com petitors by

    creating mult iple separate deduplication stores behind a single management interface, but what this

    do es is create islands of d edup lication, so the customer sees reduced b enefits and excessive costsbecause the solution is not inherently scalable.

    A t the da ta center level, the major comp etitors of HP using ob ject-level differencing ha ve used bo lt-

    on deduplication engines with exist ing virtual tape library architectures and have not integrated the

    ded uplica tion engine w ithin the VTL itself. This leads to d ata b eing m oved b ack a nd forth betw een the

    virtual l ibrary and the deduplication engine, which is very ineff icient.

    2 3

  • 7/29/2019 understanding deduplication

    24/28

    Ded uplica tion Technolog ies A l igned w ith H P Virtual Lib rary

    Products

    HP has a rang e of di sk-ba sed b ackup prod ucts w ith dedup lication start ing w ith the entry-level

    D2D2500 at 2.25TB user unit for small businesses and remote off ices, right up to the VLS12000 EVA

    G atewa y w ith cap acit ies over 1 PB for the high-end enterprise data center customer. They emulate a

    range of HP Physical tape autoloaders and libraries.

    Figure 19 . HP disk-based backup portfol io with deduplication

    HP Storag eW orksDisk-to-disk a nd Virtual Library po rtfolio w ith deduplication

    Mid-rangeEntry-level Enterprise

    High-capacity and performancemulti-node system

    Avai lab le and sca lab le

    Enterprise data centers

    Large FC SAN s M a n a g e a b le ,rel iable

    Mids izedbusinesses or ITwith remotebranch of f ices

    (iSCSI)

    Scalab le , manageable ,re l iab le a pp l iance

    M edium to la rge datacenters

    M edium to large FCS A N s

    VLS60 00 Fam ily

    VLS 12000EVA Ga teway

    VLS9000

    D2D2500D2D4000

    Ma nageable , re liab le

    M idsized businessesor IT with small datacentres

    iSCSI & FC

    Dynamic Dedup l ication

    Hash Based Chunking

    Accelerated Dedupl icat ion

    O bject level Differencing

    Capacity

    The HP StorageW orks D2 D25 00 and D2 D4 00 0 Backup Systems support HP dynam ic dedupl icat ion

    These range in size from 2.25TB to 7.5TB and are aimed at remote off ices or small enterprise

    customers. The D2 D2 5 0 0 has an iSCSI interface to reduce the cost of implementation at remote

    of f ices, w hi le the D2D4 00 0 of fers a choice of iSCSI or 4 G b FC.

    The HP Storag eW orks Virtual Libra ry Systems are all 4 G b SAN -attach devices w hich rang e in native

    user capacity f rom 4.4 TB to over a petabyte with the VLS90 00 and VLS12 0 00 EVA G ateway.

    Hardware compression is avai lable on the VLS6000, 9000 and 12000 models, achieving even

    higher capacit ies. The VLS9000 and VLS12000 use a mult i-node architecture that allows the

    performa nce to scale in a linear fashion. W ith eight nodes, these devices can sustain a throughp ut of

    up to 48 00 M B/ sec at 2:1 data comp ression, provid i ng the SAN hosts can supply d ata at th is rate.

    HP Virtual Libra ry Systems will dep loy the HP Accelerated ded uplica tion technolog y.

    2 4

  • 7/29/2019 understanding deduplication

    25/28

    Summary

    Data d eduplica tion technology represents one of the most signif icant stora ge enha ncements in recent

    years, pro mising to reshap e future da ta protection and disaster recovery solutions. Deduplica tion

    offers the ability to store more on a given amount of storage and enables replication using low-

    ba ndw idth links, both of w hich im pro ve cost effectiveness.

    HP offers two comp lementary ded uplica tion technologies for di fferent customer needs:

    A ccelerated ded uplica tion (with ob ject-level differencing) for hig h-end enterprise customers w horequire:

    Fastest possible b ackup performa nce Fastest r estore M ost scalable solution in terms of performa nce and ca pa city M ulti-node low b andw idth repl icat ion Highest deduplication ratios W ide range o f rep lica tion mode ls

    Dynamic deduplication (with hash-based chunking) for mid size organizations and remote off icesthat require:

    Low er cost and a smaller footprint An integrated deduplication appliance with lights-out operation Backup a ppl icat ion and data type ind ependent for max imum f lexib i l i ty W ide range o f rep lica tion mode ls

    This whitepap er expla ined how dedupl icat ion technologies of HP work in p ract ice, the pros and cons

    of each ap proach, w hen to choose a pa rticu lar type a nd the type of low -band w idth repl ication

    mod els HP pla ns to supp ort.

    The H P Virtual Libra ry System (VLS) incorpo rate A ccelerated d edup lication technology that scales for

    larg e mult i-nod e systems and delivers high-performa nce d eduplica tion for enterprise customers.

    HP D2D (Disk to Disk) Backup Systems use Dyna mic d eduplica tion technology that provid es a

    signif icant price advantage over our competitors. The combination of HP patents allowing optimal

    RA M usag e (RAM footpr int) with min ima l new hash values being generated on similar backup

    streams. HP D2D backup systems with integrated deduplication set a new price point for

    deduplication devices.

    2 5

  • 7/29/2019 understanding deduplication

    26/28

    A pp endix A G lossary o f Terminolog y

    Source-based Deduplication

    W here data i s deduplica ted in the host(s) prio r to transmission o ver the storage netw ork. This

    general ly tends to be a p ropr ietary app roach.

    Targ et-based Deduplication

    This is where the data is deduplicated in a Target device such as a virtual tape library and is

    ava ilab le to all hosts using that targ et device.

    Hashing

    This is a rep roduci ble method o f turning some kind of da ta into a (relatively) small number that may

    serve as a d ig i ta l " f ingerpr int" of the data.

    Chunks

    This is a m ethod of b reaking do w n a d ata stream i nto segments (chunks), and o n each chunk the

    hashing algorithm is run.

    SHA-1

    Secure hashing a lg or i thm 1 . For example SHA-1 can enab le a 4 K chunk of data to be uniquely

    represented by a 20-byte hash value.

    O bject-Level Differ encing

    Is a g eneral IT description that describes a pro cess that has an i ntimate know ledge of the da ta that it

    is handl ing dow n to log ical format level. O bject-level d i f ferencing ded upl icat ion means the

    deduplication process has an intimate knowledge of the backup application format, the f i le types

    being ba cked up (for examp le, W indo w s f i le system, excha nge f i les, and SQ L files). This intima te

    knowledge allows f i le comparisons at a byte level to remove duplicated data.

    Box-to-Box

    Replica tion from a Source to Destination i n one d irection.

    Active-ActiveReplica tion from a Source device on Site A to a Targ et Device on Site B and vic e versa.

    Many-to-One

    Replica tion from mult iple sources to a single destination device.

    Deduplication ra tio

    The reduction in stora ge req uired for a ba ckup (after several other backup s have taken p lace). Figures

    between 10 :1 and 30 0:1 have been quoted b y d if ferent vendors. The ratio is h ighly dependent on:

    Rate of change of d ata ( for example, 1 0 % of the data in 1 0 % of the fi les) Retention period of backups Efficiency of deduplication technology implementationSpa ce Reclama tion

    W ith all deduplica tion devices t ime is required to free up space that w as used by the duplica ted data

    and return it to a free pool Because this can be quite t ime consuming in tends to occur in off peak

    periods.

    Post Processing

    This is where the ded uplica tion is do ne A FTER the backup comp letes to ensure there is no w ay the

    dedupl icat ion process can slow dow n the backup a nd increase the backup w indow required.

    2 6

  • 7/29/2019 understanding deduplication

    27/28

    2 7

    In-Line

    This is where the ded uplica tion p rocess takes place REAL TIM E as the b ackup is actually taking pla ce.

    Depending on implementations this may or may not slow the backup process down.

    Multi- thread

    W ithin HP O bject Level differencing the compa re and spa ce reclamation p rocesses are run wi th

    multiple paths simultaneously to ensure faster execution times.

    Mult i -node

    HP VLS90 0 0 and VLS12 00 0 products scale to of fer very h igh performance levels up to e ight nodes

    can run in para l le l , g iv ing throughput capa bi l i t ies up to 48 00 M B/ sec at 2:1 compression rat io . This

    mult i-nod e architecture is fundam ental to H Ps Accelerated ded uplica tion technology b ecause it allow s

    maximum processing p ow er to be a ppl ied to the dedupl icat ion p rocess.

    A pp endix B Dedupl ication comp ared to other data

    reduction technologies

    Technology descrip tion Pro Con Comments

    Deduplication A d va nc ed

    technique for efficiently

    storing data by referencing

    existing blocks of data that

    have been previously

    stored, and only storing

    new data that is unique.

    Two fold benefits

    Space savings of between

    10 and 10 0 :1 be ing quo ted

    Further benefit of low

    bandw idth repl icat ion

    Can s low backup down i f

    not implemented efficiently.

    Hash ba sed technologies

    may not scale to 1 00 s of TB

    O bject Level differencing

    technologies need to be

    mul ti format aw are w hich

    takes time to eng ineer

    Deduplication is by far the

    most impressive disk

    storage reduction

    technology to emerge over

    recent years.

    Implementation varies by

    vendors. Benchmarking

    highly recommended

    Single Instancing Is r ea lly

    dedupl icat ion at a f i le level

    Avai lable a s part of the

    M icrosoft fi le system and a s

    a feature of the file system

    of a N etapp f i ler . System

    based approa ch to space

    savings

    W i l l not el iminate

    redundancy w i th in a f i le ,

    only i f two fi les are exactly

    the same

    For example adding fi les to

    a PST fi le, or adding a

    slide to a presentation.

    Limi ted use

    Arr ay-based snapshots

    capture changed blocks on

    a disk LUN

    Used primari ly for fast rol l-

    back to a consistent state

    using image recovery

    not real ly focused on

    storage efficiency.

    Does not el iminate

    redundant data for the

    changed blocks

    Cap tures any change ma de

    by the file system

    example does not

    distinguish between real

    da ta and de leted / f ree

    space on disk

    W el l establ ished. General ly

    used for quick recovery to

    a know n point in t ime

    Incremental Forever

    backups re cr ea te a fu ll

    restore ima ge fro m just one

    ful l backup and lots of

    incrementals

    M inimizes the need for

    frequent ful l backups and

    hence al lows for smaller

    backup w indows

    M ore focused at t ime

    savings than really at space

    savings

    G eneral ly only wo rks wi th

    fi le system ba ckups not

    databa se based backups

    Compression softw a re o r

    hardware

    Fast (i f done in hardware),

    slower i f done in software.

    W el l establ ished a nd

    understood

    M aximum space savings

    are general ly 2 :1

    Can be used in a ddi t ion to

    dedupl icat ion

  • 7/29/2019 understanding deduplication

    28/28

    For mo re informa tion

    w w w . hp .c om / g o / ta p e

    w w w . hp .c om / g o / D 2 D

    w w w . h p .c o m/ g o / V LS

    ww w.h p . co m/ g o / d e d up l ica tio n

    HP Storag eW orks custom er success stori es

    C op yr ig ht 2 0 0 8 H ew let t-Pack a rd De ve lo p me nt C om p a ny , L.P. The in fo rm a tio ncontained herein is subject to change without notice. The only warranties for HPproducts and services are set forth in the express warranty statementsaccompa nying such products and services. N othing herein should be construed asconstituting a n add it ional wa rranty. HP shall not be l iab le for technical or editoria lerrors or omissions contained herein.

    Linux is a U.S. registered trademark of Linus Torvalds. M icrosoft and W indow s areU.S. registered trademarks of M icrosoft Corporat ion. UN IX is a registered

    http://www.hp.com/go/tapehttp://www.hp.com/go/D2Dhttp://www.hp.com/go/VLShttp://www.hp.com/go/deduplicationhttp://www.hp.com/go/deduplicationhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://www.hp.com/go/deduplicationhttp://www.hp.com/go/VLShttp://www.hp.com/go/D2Dhttp://www.hp.com/go/tape