RapidCDC: Leveraging Duplicate Locality to Accelerate ... · CDC Chunking Becomes a Bottleneck §...

Post on 03-Aug-2020

2 views 0 download

Transcript of RapidCDC: Leveraging Duplicate Locality to Accelerate ... · CDC Chunking Becomes a Bottleneck §...

Wayne State UniversityCluster and Internet Computing Laboratory

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication

Fan Ni*fan@netapp.com

ATG, NetApp

Song Jiangsong.jiang@uta.edu

UT Arlington

* The work was done when he was a Ph.D. student at UT Arlington

Data is Growing Rapidly

§ Many of the data needs to be stored for preservation and processing.§ Efficient data storage and management has become a big challenge.

From storagenewsletter.com

2

The Opportunity: Data Duplication is Common

§ Sources of duplicate data:– The same files are stored by multiple users into the cloud. – Continuously updating of files to generate multiple versions. – Use of checkpointing and repeated data archiving.

§ Significant data duplication has been observed for both backup and primary storage workloads.

3

The Deduplication Technique can Help

Logical Physical

File1File2

File1

File2SHA1( ) = SHA1( )

When duplication is detected (using fingerprinting):

Only one copy is stored:

§ Benefits– Storage space– I/O bandwidth– Network traffic

§ An important feature in commercial storage systems.– NetApp ONTAP system– Dell-EMC Data Domain system

§ Two critical issues:– How to deduplicate more data?– How to deduplicate faster? 4

Chunking andfingerprinting

Remove duplicate chunks

Deduplicate at Smaller Chunks …

… for higher deduplication ratio § Two potentially major sources of cost in the deduplication:

– Chunking– Fingerprinting

§ Can chunking be very fast?5

Fixed-Size Chunking (FSC)

HOWAREYOU?OK?REALLY?YES?NO File A

HOWAREYOU?OK?REALLY?YES?NO File B

§ FSC: partition files (or data streams) into equal- and fixed-sized chunks.– Very fast!

§ But the deduplication ratio can be significantly compromised.– The boundary-shift problem.

6

Fixed-Size Chunking (FSC)

§ FSC: partition files (or data streams) into equal- and fixed-size chunks.– Very fast!

§ But the deduplication ratio can be significantly compromised.– The boundary-shift problem.

HOWAREYOU?OK?REALLY?YES?NO File A

HOWAREYOU?OK?REALLY?YES?NO File B H

7

Content-Defined Chunking (CDC)

HOWAREYOU?OK?REALLY?YES?NO File A

HOWAREYOU?OK?REALLY?YES?NO File B H

§ CDC: determines chunk boundaries according to contents (a predefined special marker).– Variable chunk size.– Addresses boundary-shift problem

§ Assume the special marker is ‘?’

88

The Advantage of CDC

§ Real-world datasets include two-week’s google news, Linux kernels, and various Docker images.

§ CDC’s deduplication ratio is much higher than FSC.§ However, CDC can be very expensive.

Google-n

ews

Linux-

tar

Cassand

raRedi

s

Debian-

docker

Neo4j

Wordpre

ssNode

js0

4

8

12

16

20

24

28

32

36

40D

edup

licat

ion

ratio

CDC FSC

9

CDC can be Too Expensive!

HOWAREYOU?OK?REALLY?YES?NO File A

HOWAREYOU?OK?REALLY?YES?NO File B H

Assume the special marker is ‘?’

§ The marker for identifying chunk boundaries must – be evenly spaced out with a controllable distance in between.

§ Actually the marker is determined by applying a hash function on a window of bytes.– E.g., hash(“YOU?”) == pre-defined-value

§ The window rolls forward byte-by-byte and the hashing is applied continuously. 10

CDC Chunking Becomes a Bottleneck

§ Chunking time > 60% of the CPU time.§ I/O bandwidth is not fully utilized.§ The bottleneck shifts from the disk to CPU.

1111

Linux-tardr=4.06

Redisdr=7.17

Neo4jdr=19.04

0

20

40

60

80

100

Tim

e(%

)

Fingerprinting

Chunking

Breakdown of CPU time

Linux-tardr=4.06

Redisdr=7.17

Neo4jdr=19.04

0

20

40

60

80

100

Tim

e(%

)

CDC Chunking Becomes a Bottleneck

§ Chunking time > 60% of the CPU time.§ I/O bandwidth is not fully utilized.§ The bottleneck shifts from the disk to CPU.

1212

Fingerprinting

Chunking

I/O Idle

Breakdown of CPU time Breakdown of IO time

I/O Busy

Linux-tardr=4.06

Redisdr=7.17

Neo4jdr=19.04

0

20

40

60

80

100

Tim

e(%

)

CDC Chunking Becomes a Bottleneck

§ Chunking time > 60% of the CPU time.§ I/O bandwidth is not fully utilized.§ The bottleneck shifts from the disk to CPU.

1313

I/O Busy

Fingerprinting

Chunking

I/O Idle

Breakdown of CPU time Breakdown of IO time

Efforts on Acceleration of CDC Chunking

§ Make hashing faster– Example functions: SimpleByte, gear, and AE– More likely to generate small chunks

• increasing size of metadata cached in memory for performance

§ Use GPU/multi-core to parallelize the chunking process– Extra hardware cost– Substantial efforts to deploy – The speedup is bounded by hardware parallelism.

§ Significant software/hardware efforts, but limited performance return

1414

We proposed RapidCDC that …

§ is still sequential and doesn’t require additional cores/threads.

§ makes the hashing speed almost irrelevant.

§ accelerates the CDC chunking often by 10-30 times.

§ has a deduplication ratio the same as regular CDC methods.

§ can be adopted in an existing CDC deduplication system by adding 100~200 LOC in a few functions.

1515

The Path to the Breakthrough

Unique Chunks in the Disk

16

The Path to the Breakthrough

Fingerprint

Matched!

17

The Path to the Breakthrough

Fingerprint

Matched !15KB

15KB

Confirm it !

18

The Path to the Breakthrough

Fingerprint

Matched!15KB

15KB

10KB

9KB 20KB

12KB

12KB

7KB

19

The Path to the Breakthrough

Fingerprint

Matched !

16KB

20

The Path to the Breakthrough

Fingerprint

Matched !Fingerprint

Matched !

7KB16KB

P

21

The Path to the Breakthrough

Fingerprint

Matched !Fingerprint

Matched !

Fingerprint

Matched !

20KB16KB 7KB

P P

22

The Path to the Breakthrough

Fingerprint

Matched !Fingerprint

Matched !

Fingerprint

Matched !

Fingerprint

Matched !

Fingerprint

Matched !

Palmost always happens !

16KB 7KB 20KB

P P P

23

Duplicate Locality§ Duplicate locality: if two of chunks are duplicates, their next chunks (in their

respective files or data stream) are likely duplicates of each other.

§ Duplicate chunks tend to stay together.

10 20 40 80 90# of files

0

20

40

60

80

100

Per

cent

age

ofch

unks

(%)

All duplicate chunks

Duplicate chunk immediately following another duplicate chunk

24(Debian)

Duplicate Locality§ Duplicate locality: if two of chunks are duplicates, their next chunks (in their

respective files or data stream) are likely duplicates of each other.

§ Duplicate chunks tend to stay together.

10 20 40 80 90# of files

0

20

40

60

80

100

Per

cent

age

ofch

unks

(%)

All duplicate chunks

Duplicate chunk immediately following another duplicate chunk

25

RapidCDC: Using Next Chunk in History as a Hint

+s2=

When FP(B1) == FP(A1):

<FP1, s2> <FP2, s3> <FP3, s4> <FP4, …>

B1 B2 B3…

…File A

File B

P1P0 P2 P3 P4

+s3=B4

+s4=

A2A1 A3 A4

Offset in file:

§ History recording: whenever a chunk is detected, its size is attached to its previous chunk (fingerprint);

§ Hint-assisted chunking: whenever a duplication is detected, use the history chunk size as a hint for the next chunk boundary.

§ Regular CDC is used for chunking until a duplicate chunk (e.g., B1) is found

26

More Design Considerations …

27

§ A chunk may have been followed with chunks of different sizes– Maintain a size list

§ Validation of Hinted Next Chunk Boundaries– Four alternative criterions with different efficiency and

confidencesØ FF (fast-forwarding only)Ø FF+RWT (Rolling window Test)Ø FF+MT (Marker Test)Ø FF+RWT+FPT (Fingerprint Test)

§ Please refer to the paper for detail.

Evaluation of RapidCDC

§ Prototype: based on a rolling-window-based CDC system.– Using Rabin/Gear as rolling function for rolling window computation.– Using SHA1 to calculate fingerprints.

§ Three disks with different speed are tested.– SATA Hard disk: 138 MB/s and 150MB/s for sequential read/write. – SATA SSD: 520 MB/s and 550MB/s for sequential read/write.– NVMe SSD: 1.2 GB/s and 2.4G/s for sequential read/write.

28

§ Chunking speedup correlates to the deduplication ratio.§ Deduplication ratio is little affected (except for one very

aggressive validation criterion).

Synthetic Datasets: Insert/Delete

29

1000 2000 5000 10000 20000# of modifications

0

1

2

3

4

5

6

7

Dedu

plica

tion

ratio

Regular

FF+RWT+FPT

FF+RWT

FF+MT

FF

1000 2000 5000 10000 20000# of modifications

0

1

2

3

4

5

6

Spee

dup

FF+RWT+FPT

FF+RWT

FF+MT

FF

Spee

dup

Ded

uplic

atio

n ra

tio# of modifications # of modifications

Debian Neo4j Wordpress Nodejs0

5

10

15

20

25

30

Spee

dup

FF+RWT+FPT

FF+RWT

FF+MT

FF

Real-world Datasets: Chunking Speed

§ Chunking speedup approaches deduplication ratio.§ Negligible deduplication ratio reductions (if any).

33XFaster!

Debian Neo4j Wordpress Nodejs0

10

20

30

40

Ded

uplic

atio

nra

tio

Regular

FF+RWT+FPT

FF+RWT

FF+MT

FF

30

Spee

dup

Ded

uplic

atio

n ra

tio

Conclusions

§ RapidCDC represents a disruptively new approach to improve CDC chunking speed.

§ It increases chunking speed by up to 33X without loss of deduplication ratio.

§ Its adoption in an existing CDC deduplication system does not require any major change of its current operation flow.

§ Its implementation in any existing CDC deduplication systems requires minimal code changes (100-200 lines of C code in our prototype)

§ A prototype implementation is available at https://github.com/moking/rapidcdc

31