CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has...

19
CA ARCserve r16.0 - Data Deduplication Frequently Asked Questions

Transcript of CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has...

Page 1: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

CA ARCserve r16.0 - Data Deduplication

Frequently Asked Questions

Page 2: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Table of Contents For any backup on to Deduplication device, how many files does ARCserve generate and what do they contain?

..................................................................................................................................................................................4

Can I create Deduplication device on Non-NTFS volume? .......................................................................................... 4 Can I Configure data and index path on the same volume? ....................................................................................... 4 Can I submit multiplexing jobs to Dedupe device? ..................................................................................................... 4 Where does Deduplication happen in ARCserve? ...................................................................................................... 4 Will the high CPU utilization impact backup throughput? .......................................................................................... 5

Even though I take a backup of a small file size, (say 10MB), the data file size created on the disk is 512MB for

the first time backup. Why is it so? ............................................................................................................................ 5

What is the minimum size of hash file and how is it calculated? Does the size of hash file calculated based on

the number of files or size of source data? ................................................................................................................ 5 What happens when a Deduplication Device is backed up and at the same time some sessions are being backed

up to the same Deduplication device? ....................................................................................................................... 5 Which are the agents for which Deduplication cannot be performed effectively and why? ..................................... 6

Is there any way to say the backed up data on to Deduplication Device is safe? Is there any way to guarantee

that the data is accurate (if MD5 algorithm gets same hash for different blocks then the backup will be

successful but the data is not accurate)? ................................................................................................................... 6 Can I configure a Deduplication device on a remote machine? .................................................................................. 6 How does purge concept work for Deduplication device? If I configure to purge a session after 30 minutes of

backup job, will this happen immediately after 30 minutes? ..................................................................................... 6 When does hash_ToPurge change to hash_InPurge? ................................................................................................ 7 When data is purged, how is fragmentation handled? ............................................................................................... 7 Do you support “Overwrite same media” option for Deduplication device? .............................................................. 7 How a hash key is built by MD5 algorithm and is this key unique? ............................................................................. 7 Can I delete a particular session from a Deduplication device? .................................................................................. 8 What is the difference between normal incremental/differential and Deduplication incremental/differential

backup jobs? Is deduplication supported only for the full backup jobs? ..................................................................... 8

CA ARCserve r16.0 Data Deduplication FAQs Page 2

Page 3: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

How does Deduplication support optimization and how does optimization happen? Is optimization supported

for all types of backups on to Deduplication Device? ................................................................................................. 8 Do you support optimization during tapecopy? ......................................................................................................... 9 Do you support compression and encryption backups on to Deduplication devices? ............................................... 9

I have some sessions which were encrypted sessions. Can I use tapecopy to copy such sessions on to

Deduplication device? ............................................................................................................................................... 9

I have backup sessions on my tape and these sessions are from older version agents. Can I tape copy these

sessions on to Deduplication device? ........................................................................................................................ 9

How does data migrate from Deduplication Device to a Tape library? Will it create original data and put in temp

location and then copy the temp to tape? ................................................................................................................. 9

I have a 1GB file in D:\drive on machine A and I have the same 1 GB file in E:\drive on same machine A. Is

Deduplication possible in this case? ........................................................................................................................ 10 I have 50 client agent machines in my environment. What is the best solution to submit backup on to the

Deduplication device in the final destination? ........................................................................................................ 10

For the subsequent backups on to Deduplication, can I know any information related to number of new chunks

inserted, number of duplicate blocks found and total amount that is written on to the Deduplication device? 10 Is the reclaim of disk space from a purged session a time consuming process? ..................................................... 10 I got an error “Deduplication Device internal error”? What does it mean? ............................................................ 11 The folder structure for data and index locations has a structure like 000/000/0... What is the reason for such

structure and what are its benefits? ....................................................................................................................... 11 Deduplication devices cannot be assigned to media pools. So, how can GFS job still be submitted to

Deduplication device without media pool? ............................................................................................................. 11 How to handle hash collision in Deduplication backups? ........................................................................................ 12 What is Global Deduplication? Where can I find the logs of Global Deduplication? ............................................... 12 What are the limitations of Global Deduplication? ................................................................................................. 14 The space occupied by the Deduplication Device is high. There are old redundant sessions in the Data folder………15 Are there any tools available for troubleshooting Deduplication……….…………………………………………………………………..17 Deduplication as a staging device and Data Migration – What is the best practice to follow so that there are no redundant unpurged sessions occupying space in the Dedupe Device…………………………………………………………………….18 Would the source file type impact deduplication compression? Why is there a low compression on a specific backup source ?..................................................................................................................................................19 CA ARCserve r16.0 Data Deduplication FAQs Page 3

Page 4: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

For any backup on to Deduplication device, how many files does ARCserve generate and what do they contain? Answer: ARCserve generates 3 files - .data file in the data folder and .hash and .ref in the index folders. The data file contains the actual source data and .hash files contain all the hash entries and .ref contains the reference counters to the data and the offset to the data.

Can I create Deduplication device on Non-NTFS volume?

Answer: No. Deduplication devices are supported only on NTFS volumes. Even for the classic FSD, it is recommended to configure on NTFS volume. Currently, this is not blocked for FSD considering it is compatible with the previous behavior. Actually, an error message is displayed on GUI if a user creates the Deduplication device on the non-NTFS volume.

Can I Configure data and index path on the same volume?

Answer: Yes, you can configure data and index on the same volume but not in same path. It is recommended to configure data and index files on different volumes. For example, the Index File path should reside on a disk with a fast seek time, such as a solid state disk and Data files should be configured on a Disk having high I/O speed.

Can I submit multiplexing jobs to Dedupe device?

Answer: No, you cannot submit multiplexing jobs on to File system Device, Deduplication device, RAID, NAS, OLO or Removable Drive groups.

Where does Deduplication happen in ARCserve? Answer: At the Server Here, the costly MD5 processing is performed on a bunch of backup servers without taxing the agent machines. Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server side Deduplication is chosen because there is no need to charge all the agents in order

to work with Deduplication.

CA ARCserve r16.0 Data Deduplication FAQs Page 4

Page 5: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Will the high CPU utilization impact backup throughput?

Answer: When the first backup happens, 2 backup streams per CPU provide a throughput which is

15% less than non-Deduplication throughput.

When the second backup happens because of optimization, the throughput is much higher (almost

twice) than the non-Deduplication throughput. The CPU utilization is also low if the backup is not

happening for the first time. For sessions which cannot be optimized like (SQL, RMAN stream based sessions) the CPU utilization

will remain high.

Even though I take a backup of a small file size, (say 10MB), the data file size created on the disk is

512MB for the first time backup. Why is it so?

Answer: To reduce fragmentation, Tape Engine pre-allocates 1GB by default and this value can be

changed by registry. If empty space is more than 512MB, the size will be reduced by 512MB. So, even

though only 10MB is saved into a separate session, the file size is 512MB. This was the behavior with

r12.5. There is an enhancement from R15 onwards, where it will be 10 MB only in the above case. This

means we will truncate the data file as the actual size of the valid data at the end of the backup job.

What is the minimum size of hash file and how is it calculated? Does the size of hash file calculated

based on the number of files or size of source data?

Answer: For every hash file, the minimum size is 4.78MB. SIS.dll pre-allocates 4MB+800KB in every allocation of hash file. This is for improving the performance and de-fragmentation. This behavior is normal and by design.

The size of hash file is based on the size of backup. The relationship between the size of hash file and size of backup is: It increases 4MB+800KB to hash file for every 100000 hash entries (let's call it as CHUNK and it is 4.8MB); every hash entry represents 8KB ~ 32KB data. In backup, the average value is 20KB;

So for 1TB data, there are about 1TB/20KB==53687092 hash entries. And 53687092/100000=537 chunks, 537*4.8MB==2.5GB.

The above is just an example. In real backup, it may be greater or lesser than 2.5GB.

What happens when a Deduplication Device is backed up and at the same time some sessions are

being backed up to the same Deduplication device?

Answer: If there are sessions being backed up to the Deduplication Device when the Deduplication Device itself is being backed up to another destination device, the active sessions might be incomplete on the new destination device. For those sessions, the merge operations fail. But there is no data loss and no inconsistency.

CA ARCserve r16.0 Data Deduplication FAQs Page 5

Page 6: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Which are the agents for which Deduplication cannot be performed effectively and why? Answer:

Deduplication is not very effective for Oracle RMAN backup because of the special way of populating the

RMAN session data. This is to say that Deduplication can’t find the redundant data even though the

same source is repeatedly backed up.

Deduplication is not very effective for AS400 backup because of the special way of populating the

AS400 session data. This is to say that Deduplication cannot find the redundant data even though the

same source is backed up repeatedly.

For Netware sessions if the source path is exceeding 8 then Deduplication does not happen.

Is there any way to say the backed up data on to Deduplication Device is safe? Is there any way to guarantee that the data is accurate (if MD5 algorithm gets same hash for different blocks then the backup will be successful but the data is not accurate)?

Answer: There are two options in ARCserve - Media Assure & scan and Compare utility. With media

Assure and scan it can be said that the data is safe and with compare utility it can be said that the

destination data is accurate. Can I configure a Deduplication device on a remote machine?

Answer: Yes, you can configure a Deduplication device on any remote Windows machine. To do this, provide the security credentials for the remote machine.

Deduplication data path and index path could as follows:

Both are local path

One is local path and the other is remote path

Or both are remote path. If both are 2 remote paths, they must have the same access credential

How does purge concept work for Deduplication device? If I configure to purge a session after 30 minutes of backup job, will this happen immediately after 30 minutes?

Answer: For one Deduplication Device session, there are 3 files - .hash and .ref files in index folder, and .data file in the data folder. There are 2 steps for purging a Deduplication Device session: Step 1: The .hash file will be renamed as .hash_ToPurge and the session record will be removed from the database. After this step this session can't be restored or scanned. The data file and ref file will exist in the same folder as before.

Step2: The hash_ToPurge file will be renamed as hash_InPurge and then deleted. If there are no other sessions to Dedupe, based on this session, the ref and data file will be deleted, else, the ref file and data file will not be deleted.

To check whether one Deduplication Device session was purged or not, check the index folder to find if the hash file was deleted /renamed or not. There is no need to check the data file.

CA ARCserve r16.0 Data Deduplication FAQs Page 6

Page 7: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

When does hash_ToPurge change to hash_InPurge? Answer: Tape Engine checks the Deduplication Device every 6 hours. Once the hash file is renamed as .hash_ToPurge, when the check comes, Tape Engine renames the hash file as .hash_InPurge. It, then calculates the metadata reference in ref file. If all the references are 0, it means, there is no other session referring this session and so the Tape Engine deletes the 3 files of this session. If any reference is not 0, the data file and ref file will be retained. The default time interval is 6 hours. However, this can be reduced by the following registry setting: Create DWORD PurgeIntervalForDDD and set as 3600 (ie. 1 hour)

HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\CA ARCServe Backup\Base\TapeEngine\DEBUG\

PurgeIntervalForDDD =3600 (decimal)

When data is purged, how is fragmentation handled? Answer: There are two types of disk reclamations - Delayed Disk Reclamation and Expedited Disk Reclamation. If Delayed Disk Reclamation is enabled, reclamation is performed when the number of holes >= 25% The ref files maintain the count of each hash. When a session is purged, the ref count of each hash in that session is reduced.

During this process if some ref counts become 0 the corresponding data is not deleted right away but remembered as holes.

Whenever the number of holes is >= 25% of the number of chunks in the data file, unused space is reclaimed.

The above steps take care of the fragmentation problem. But for Expedited Disk Reclamation, reclamation is performed so long as the number of holes > 0.

Do you support “Overwrite same media” option for Deduplication device?

Answer:

1. Deduplication device does not support "overwrite same Media name". Backup job will always append to Deduplication device irrespective of the options selected. "Overwrite Same Media Name or Blank Media" or "Overwrite Same Media Name, or Blank Media First, then Any Media". 2. If you want to format the content of Deduplication device, please format it manually in

ARCserve Manager. How a hash key is built by MD5 algorithm and is this key unique?

Answer: ARCserve uses MD5 as hash algorithm. To avoid MD5 conflict, a "weak hash" and data length is added to the hash value (verify the data by these 3 values). So the total length of hash value now is 24 bytes (16 bytes MD5, 4 bytes weak hash and 4 bytes length value). But if there is a conflict (which is rare) ARCserve cannot work appropriately. In this case, since the hash is same, ARCserve cannot find out that these 2 data blocks are actually different.

As every hash algorithm has a probability of conflict the only way to fix the conflict is to compare with

raw data (using compare utility in ARCserve after backup). But the performance of compare data is unacceptable with backup job. CA ARCserve r16.0 Data Deduplication FAQs Page 7

Page 8: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Can I delete a particular session from a Deduplication device? Answer: Yes, you can delete a particular session from the Deduplication device by going to restore by session view and selecting the session and clicking delete selected records option. What is the difference between normal incremental/differential and Deduplication incremental/differential backup jobs? Is deduplication supported only for the full backup jobs?

Answer: In normal incremental/Differential backup to FSD, ARCserve backups up whole file which is changed from the time the last backup was taken and backs it up on the disk, whereas for Deduplication incremental/Differential backup only the changed content in the file is backed up and not the entire file.

For example, assume the first backup was a full backup i.e. try to backup 5 files of 1 GB each i.e. 5 GB. Assume that there are no redundancies and hence 5 GB is written on to the disk. If there are redundancies then less than 5GB may also be written.

Assume that that .1 GB has changed for a file and now when the next incremental backup happens it tries to backup 1.1 GB. During comparison of the hashes of this backup against the full backups hashes it is found that the majority of the backups will be duplicated. Hence, ONLY 0.1 GB might be written to the disk.

Assume that .1 GB changes every day and the next full backup is done on the 5th day. Which means on the 5th day the size of the total data is 5.5 GB. Now when the Deduplication happens within this full backup, the current hashes are compared against the last full backup hashes. After comparison it may be found that 5 GB of data is redundant. So only .5 GB data might be written to disk during this backup.

How does Deduplication support optimization and how does optimization happen? Is optimization

supported for all types of backups on to Deduplication Device?

Answer: When the MD5 calculations are happening, the CPU utilization shoots up and it is understandable that this happens during the first backup. But doing MD5 calculation for every backup even if there are no changes in files is questionable. Hence, the concept of optimization i.e. MD5 calculations for only the changed files.

Optimization mechanism:

First, Tape engine component loads the catalog file of its last session and checks the below information:

Number of files/directories processed = [11], size of data = [102400] KB. Number of unmodified files/directories = [11], size of data = [102400] KB. Number of modified files/directories = [0], size of data = [0] KB. Number of newly inserted files/directories = [0], size of data = [0] KB. Number of deleted files/directories = [0], size of data = [0] KB. Total amount of data for which MD5 was calculated = [1100] KB Total amount of data for which MD5 was not calculated = [101363] KB

CA ARCserve r16.0 Data Deduplication FAQs Page 8

Page 9: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Based on the above information Tape engine knows on which data to calculate MD5 again. The main advantage of implementing optimization in Deduplication is to get better throughputs for subsequent backup and also it is CPU overhead if MD5 is implemented all the time, so with the Optimization feature CPU overhead can be overcome.

Optimization feature is only supported for Windows File system backups and it is not supported for

Stream based backups like SQL, Oracle etc. Do you support optimization during tapecopy?

Answer: Optimization is a feature at Deduplication Group level. So it does not matter whether it is a

backup job or it is tapecopy job, the optimization feature supports it.

Do you support compression and encryption backups on to Deduplication devices? Answer:

a. If you are using only Deduplication Device in Destination device and no other device in the staging phase or if you are using FSD/tape in staging and Deduplication in destination then no type of encryption or compression is supported. ARCserve displays a relevant pop up stating Deduplication device is in final destination, so compression encryption is not supported. But if you still try to submit job with these options the job will be submitted without considering encryption and compression. b. If you are using a Deduplication device in Staging phase and FSD/Tape in destination phase then encryption and compression can happen only on the server side during the migration phase.

I have some sessions which were encrypted sessions. Can I use tapecopy to copy such sessions on to Deduplication device? Answer: No. Such encrypted sessions will be skipped in tapecopy job.

I have backup sessions on my tape and these sessions are from older version agents. Can I tape

copy these sessions on to Deduplication device? Answer: No. ARCserve does not support tapecopy of older version agents on to Deduplication device. Even if you try to take a backup of the older version agents on to Deduplication device, it is not allowed.

How does data migrate from Deduplication Device to a Tape library? Will it create original data and put

in temp location and then copy the temp to tape? Answer: No data is copied to temp location and then on to the tape. Data can be directly copied from the data file by referencing the hash and reference files to migrate it on to a tape. This way undedup of the source data is obtained. CA ARCserve r16.0 Data Deduplication FAQs Page 9

Page 10: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

I have a 1GB file in D:\drive on machine A and I have the same 1 GB file in E:\drive on same machine A. Is Deduplication possible in this case? Answer: No. Deduplication is possible only when the root path is same from the previous session. In the above the root path for one file is D:\ and root path for the other file is E:\, so Deduplication cannot happen within these sessions. And also Deduplication refers only the last session of that root directory. I have 50 client agent machines in my environment. What is the best solution to submit backup on to the Deduplication device in the final destination? Answer: The best way to submit backup on to the Deduplication device in the final destination is to select all client agent machines and submit a backup on to Deduplication device with multiple streams and it is recommended that it should not exceed more than 6 streams. The reason is each stream consumes 110MB of memory and if you submit all the streams simultaneously then there are chances for the job failing with errors like shared memory and unable to close session. If we submit stream by stream then GUI has a mechanism of checking the available memory but if you submit all in a single shot then this might cause problems.

For the subsequent backups on to Deduplication, can I know any information related to number of new chunks inserted, number of duplicate blocks found and total amount that is written on to the Deduplication device? Answer: The tape log provides information as displayed below:

**************SIS STATISTICS FOR THIS RUN START************** Num of duplicates found in this run = 9548 Num of new chunks inserted in this run = 36

Mean chunk size = 22 KB DATA THAT WAS SUPPOSED TO BE WRITTEN IN THIS RUN = 198016 KB DATA THAT WAS ACTUALLY WRITTEN IN THIS RUN = 816 KB COMPRESSION as percent = 100 COMPRESSION as ratio= 242 **************SIS STATISTICS FOR THIS RUN END************** Is the reclaim of disk space from a purged session a time consuming process? Answer: No. It is not a time consuming process. In general, the reclaim time for reclaiming 10GB will be around 1 minute, and this can be obtained from the tape log as shown below: File size = 163053568 KB File size on disk = 163053568 KB After reclaimed: File size = 163053568 KB File size on disk = 153181504 KB

Number of Holes reclaimed in this run = 471477 Size of data occupied by the holes = 10036037 KB

Size of data actually reclaimed in this run = 9872064 KB Time taken to reclaim = 64704 msecs

CA ARCserve r16.0 Data Deduplication FAQs Page 10

Page 11: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

The above is true for only Expedited Disk Reclamation. But, Delayed Disk Reclamation is a time consuming process.

I got an error “Deduplication Device internal error”? What does it mean?

Answer: This message is usually displayed when: a. Session header gets corrupted. b. There might be some hash, ref files missing for the current backups. c. Data file got corrupted due to some reason. d. Corresponding session is not there on the device. e. Data base is updated with wrong status of the session.

It is recommended to review the tape log for that particular duration and take necessary steps.

The folder structure for data and index locations has a structure like 000/000/0... What is the reason for such structure and what are its benefits?

Answer: Prior to r12.5 there was no Data Deduplication but only FSDs were the storage disk devices in ARCserve. There can be only 65656 sessions that can be stored in FSDs earlier. For a Deduplication device it is very much possible to have more than 65656 sessions in a span of time. If there is a single folder with huge number of sessions then there might arise performance problems while browsing the sessions from restore manager. To avoid any such performance problems the structure - 000/000/0… is followed. In this structure in the last sub folder, if it reaches 999 sessions the next session will be created in subfolder with name 1 and so on. In this way we can store up to 1000*1000*1000*1000 = 1000000000000 sessions without any performance problems.

Deduplication devices cannot be assigned to media pools. So, how can GFS job still be submitted to

Deduplication device without media pool? Answer: When you select a Deduplication device as the destination device in a GFS or Rotation job in a non-staging operation, media pool is not used and media will never be overwritten. Data is written to format media in the Deduplication device group, if one exists. If one does not exist, blank media is formatted with the current data and time. When you select a Deduplication device as the destination device in a GFS or Rotation job in a staging operation, the behavior of the staging phase is not changed, but the migration phase will never use a media pool and never overwrite media. Data is appended to formatted media in the Deduplication device group if one exists. If not, blank media is formatted with the current date and time.

CA ARCserve r16.0 Data Deduplication FAQs Page 11

Page 12: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

How to handle hash collision in Deduplication backups? Answer: Add a registry switch to control for data comparison during Deduplication backup. By default the registry key is not set and the current behavior of Deduplication is followed.

If that key is set, during the Deduplication backup, if there is a duplicate hash, do the following:

Read the data corresponding to the previous hash.

Do a memory comparison of current data with previous data.

If the data matches, continue doing whatever it does.

If the data doesn't match, it means there is a signature collision.

In this situation output an error message in Activity Log that a signature collision has occurred. And then write the current data to the disk and point the current hash to the current ref, instead of pointing the current hash to the previous ref. This ensures the signature collision is detected and also backups are continued appropriately. The following is the registry key that needs to be set: Tape engine->config->CatchSignatureCollision

What is Global Deduplication? Where can I find the logs of Global Deduplication? Answer: In Deduplication backup job, only current session is compared with the previous session having same root directory. But for sessions of different system volumes, there may be lots of duplicate data. Therefore, Global Deduplication is introduced to dedupe the duplicated data from system volume sessions. Global Deduplication is not a backup or post backup process. It is just a Scheduled Tasks at some level.

There is a separate log for Global Deduplication under ARCserve home folder with name gdd.log and

see the following information is displayed when global Deduplication occurs: The following is an example of two machines c:\ volume backups

---------- Begin of Phase 1 ---------- Initialize memory pool/queue > Session: 1

> Type: 42 > Root directory: \\PALRA04-QA51\C:\gdd > NewPrefix: 0

BHash file is already there. Skip this session. > Session: 2 > Type: 42 > Root directory: \\BRI6-ECNODE2\C:\gdd1 >

NewPrefix: 1 GDD process: Session: 2,Type: 42,Root directory: \\BRI6-ECNODE2\C:\gdd1

CA ARCserve r16.0 Data Deduplication FAQs Page 12

Page 13: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Sis_GDD_Scan>> Sis_GDD_Scan<< Create a thread[121c] for writer.

Get Zero hash value, quit thread. INF: A_HASH: MAX boundaries[23], Natural boundaries[647]

---------- End of Phase 1 ----------

---------- Begin of Phase 2 ----------

Initialize memory pool Launch BHash writer.

Enumerate all BHash files

> BHash file: h:\gddindex1\000\000\0\sis_1055452626_0000000001.bhash > BHASH count: 669

> Status: 0 > BHash file: h:\gddindex1\000\000\0\sis_3610723846_0000000002.bhashn > BHASH count: 670

> Status: 1 Process BHash files completed Process temporary file completed

Stop UDT file writer... Stopped. ---------- End of Phase 2 ----------

---------- Begin of Phase 3 ---------- > Process UDT: h:\gddindex1\temp\sis_3610723846_0000000002.udt Lock global mutex... Locked. Prefix in [h:\gddindex1\backingup.sis]:

Prefix in [h:\gddindex1\restoring.sis]: Prefix in [h:\gddindex1\purging.sis]: Append prefix[sis_1055452626] to purging.sis. Append prefix[sis_3610723846] to purging.sis. Unlock global mutex. Holes in session [2] is more than 25%[27466/27576], insert it to reclaim.sis. Lock global mutex... Locked. Prefix in [h:\gddindex1\purging.sis]: sis_1055452626:

sis_3610723846: < Total Deduplicationd: 506MB

---------- End of Phase 3 ----------

CA ARCserve r16.0 Data Deduplication FAQs Page 13

Page 14: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Please find below the explanation of 3 phases in the Global Deduplication:

Phase1: While performing Deduplication, hashes will be calculated for each session based on the session’s data. Global Deduplication again calculates hashes on top of the hashes calculated in the Deduplication process. These hashes are called bhashes. Global Deduplication writes these Bhashes into .bhashn files. There will also be some .bhash files, which indicates that the data in the corresponding session is already global de-duped.

Phase2: In this phase, first all the BHASHs from the .bhash files are fed into the STL (standard template library) map. After entering all the bhashes from .bhash files, it will enter the BHASHs from .bhashn files. While doing this it will check whether this bhash is already present in the STL map. If a BHASH is already in map and this is from a .bhashn file, it means this is a duplicated data block. Global Deduplication will record the source hash entries, source ref entries and target ref entries information into updating file (.udt). In this way it will process all the .bhashn files. Phase3:

In this phase, the Global Deduplication processes the .udt files created in phase2 and updates the duplicate hash entries with the corresponding target ref entries.

What are the limitations of Global Deduplication?

Answer: The limitation of Global Deduplication is, it supports only windows C:\volume sessions. For other sessions Global Deduplication is not supported.

We have an enhancement with Global Deduplication on Oracle RMAN sessions with R15 onwards. The reason for the low dedupe ratio on Oracle RMAN sessions is as below:

For Oracle RMAN sessions, all the root directories are like \\Machine Name\Oracle, even if they come from different tablespaces or databases. As we know, ASBU deduplication compares the data against the same root directory. For the case of Oracle RMAN session, it might wrongly compare data against another tablespace/ database. It is because all Oracle RMAN sessions of the same source machine are with the same root directory name. Our solution is that we support Global Deduplication for Oracle RMAN sessions.

Global dedupe will handle Oracle RMAN sessions too Global dedupe will not compare system volume sessions against Oracle RMAN sessions.

Because there will be nothing duplicated between these kinds of sessions. This means that we do

global dedupe across all system volume sessions then we do global dedupe across all Oracle RMAN

sessions.

CA ARCserve r16.0 Data Deduplication FAQs Page 14

Page 15: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Global dedupe will not compare Oracle RMAN sessions which come from different servers. Oracle

RMAN sessions from different servers might not contain too much identical data. This means that

we will separately do global dedupe for Oracle RMAN sessions for every source server

In summary, we can now dedupe data for Oracle RMAN sessions by leveraging global deduplication.

The space consumed by the Deduplication Device is high. There are old redundant sessions in the Data

Folder occupying this space. What action to be taken?

Answer:

There should not be any pending Migration jobs, as the dedupe sessions will not be purged until migration has taken place. If there are pending migration jobs, then as a workaround, check ‘Do not Copy Data’

CA ARCserve r16.0 Data Deduplication FAQs Page 15

Page 16: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Make sure that you check “purge failed and cancelled sessions from the disk”

From the Miscellaneous tab of the Deduplication Purge Policy if the DDD is used as destination

From the Miscellaneous tab of the Deduplication Staging Policy, if the DDD is used as Staging

device

Important point to note is not to allow data migration jobs that failed due to any reason to accumulate as HOLD jobs. This will prevent the respective backup sessions not be purged, as the data is not yet migrated. A workaround here is to delete the Data Migration Jobs on HOLD. But here again, best practice is to identify the cause of the data migration jobs failing and to correct them.

CA ARCserve r16.0 Data Deduplication FAQs Page 16

Page 17: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

While the data migration problem is being troubleshooted, it could be necessary to release disk space with a manual copy for such sessions to final destination using the Tapecopy command and then deleting the specific source sessions

There are 2 ways of doing this: - Delete the sessions from the Restore by Session - ca_devmgr –purge (command line tool) can be used to delete dedupe sessions

IMPORTANT: Never manually delete any deduplication index or data file or the whole device can be corrupted

Are there any tools available for troubleshooting Deduplication?

Answer: The following tools are available and can be obtained by contacting CA Support.

1. Deduplication tool

Usage:

Install the vc ++ redistributable package which is a pre-requisite for the dedupe tool.

Extract the DeDupeTool.12.9.2011.zip, and run the DeDupeTool.exe

Click on ‘Reclaim’ Button to reclaim space occupied by the sparse hole files

The ‘Overall’ and ‘Detail’ button when clicked will give generate a report log file in the dedupe tool

folder that can be used for analysis

Also a fix T16C973 which incorporates the sis.dll can be applied to reclaim space occupied by sparse files.

CA ARCserve r16.0 Data Deduplication FAQs Page 17

Page 18: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

2. Migdbinfo tool This tool is used to troubleshoot dedupe issues since it reports if there are non-migrated sessions and the due time of the migration and purge Usage: a. Save attachment to MigDBInfo.exe. b. Copy this file to the server that ASDB is installed. c. Open DOS prompt. d. Change directory to saved folder. e. Run command as below. MigDBInfo <Tapename> [RandomID] f. If you want to save the result, please re-direct data to a text file. For example, MigDBInfo.exe "11/28/11 6:52 PM" 0x35DF > tr1.txt.” Here is a sample output , see the line in red, to be checked to see when a migration was expected

and if it has been completed or not.

If the due time has passed many days but the migration is still not done then this indicates a problem

to be checked in activity log , jobqueue log or Tape log.

Sample output:

Tape:

ID:147 Name:DEDUPE RID:0E53 Seq:1 SN: Type:0 FirstFormat:2012-09-04

14:46:18 LastFormat:2012-09-04 14:46:18 Expired:1900-01-01 00:00:00 Destory:1900-01-01

00:00:00 BlockSize:512 Pool: Status:0 BackupType:0 TapeFlag:0x10000100

Session:

ID:451 NO:1 Job:1052 start:2012-09-04 14:46:32 end:2012-09-04 14:46:32 Total:46kB

CompSize:64kB TapeSize:64kB Method:FULL_CLR_ARCHIVE Flag:0x00001080 QFABlockNum:0

SrcPathID:1073741825 Root:[C:]

Migration status:

Due:2012-09-04 14:47:48 prune:2012-09-04 14:51:48 exec:2012-09-04 14:46:28 BJobID:4991500

MJobID:1051 Flag:0x00200010 IsMigrated:Ready To be Migrated [0] StagingFlag:0x00200010 DM

Group:WRONG Tape:04/09/12 12:47 ID:0xFB3F SN: Pool: SrcGroup:PGRP3

Note: Here IsMigrated:Ready To be Migrated [0] means the the session was not migrated yet

CA ARCserve r16.0 Data Deduplication FAQs Page 18

Page 19: CA ARCserve r16.0 - Data Deduplication Frequently Asked ... · Deduplication at the server side has the advantage that Agent servers which are production servers are not taxed. Server

Deduplication as a staging device and Data Migration – What is the best practice to follow so that there are no redundant unpurged sessions occupying space in the Dedupe Device Answer: If you use a deduplication device for staging then pay special attention to any failure during the data migration. If a data migration job fails it will spawn a makeup job in “hold” status expecting backup administrator to remove the cause of the failure and put the job in “ready” to complete the migration. In the backup policies >> Miscellaneous tab you can unselect “create makeup job If data migration fails” , the option is set by default. If unchecked then you have to manually take care of migrations using tapecopy but the purge is not allowed for non-migrated sessions. If the option is kept set then customer cannot ignore makeup jobs otherwise non-migrated sessions will not be purged and the deduplication device will be filled up with redundant unpurged sessions. If the data migration makeup job is just ignored then this is a big risk for the whole deduplication device. It’s also possible that a data migration job cannot even start due to unavailable destination and clicking on the job it should be possible to access the DM job status window, which will show as “pending”. In such cases the backup administrator should fix the issue , usually checking if the DM job status group and tape requested is available in the library and the correct pool. If you do not wish to run this data migration job, then select “Do not copy” option. This will clear the pending Data Migration job and allow those respective sessions to be purged as per the purge policy set. So it’s critical to know that not following data migration results to ensure they are all completed successfully, can lead to non-migrated sessions accumulating and the deduplication disk size growing up due to failure to reclaim disk space. Would the source file type impact deduplication compression ? Why is there a low compression on a specific backup source ? Answer: http://23.23.252.151/index.php?View=entry&EntryID=3043