WON ET AL.: MUCH: MULTITHREADED CONTENT-BASED FILE … · 2015. 6. 8. · variable size chunking...

0018-9340 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TC.2014.2322600, IEEE Transactions on Computers

WON ET AL.: MUCH: MULTITHREADED CONTENT-BASED FILE CHUNKING 1

MUCH: Multithreaded Content-Based FileChunking

Youjip Won, Kyeongyeol Lim and Jaehong Min

Abstract—In this work, we developed a novel multithreaded variable size chunking method, MUCH, which exploits the multicorearchitecture of the modern microprocessors. The legacy single threaded variable size chunking method leaves much to bedesired in terms of effectively exploiting the bandwidth of the state of the art storage devices. MUCH guarantees chunkinginvariability: the result of chunking does not change regardless of the degree of multithreading or the segment size. This isachieved by inter and intra-segment coalescing at the master thread and Dual Mode Chunking at the client thread. We developedan elaborate performance model to determine the optimal multithreading degree and the segment size. MUCH is implementedin the prototype deduplication system. By fully exploiting the available CPU cores (quad-core), we achieved up to ×4 increasein the chunking performance (MByte/sec). MUCH successfully addresses the performance issues of file chunking which is oneof the performance bottlenecks in modern deduplication systems by parallelizing the file chunking operation while guaranteeingChunking Invariability.

Index Terms—Content-based Chunking, Deduplication, Multithread

F

1 Introduction

I t is known that in most of the modern infor-mation systems, only 3% of the data is newly

added to the storage and 5% of the existing datais updated[1]. Reducing the data size plays a keyrole in filesystems, backup systems[2], web proxy[3],and even small storage devices[4], [5], [6]. Elaboratetechniques, e.g., compressing the data[7], eliminatingthe redundant information within or between files(deduplication)[1], storing only updated parts of data(delta encoding)[8], have been developed to effec-tively address the original objective of reducing thedata size.

File chunking and duplication detection are two es-sential components in the deduplication In this work,we dedicate our efforts in eliminating the perfor-mance overhead in variable size chunking. Numeroustechniques have been proposed for faster duplicationdetection: creating efficient key-value store[9], [10],arranging the fingerprints with respect to the accesslocality in deduplication[2], [11], [6], and minimiz-ing cache miss penalty in in-memory database[12].Variable size chunking is very CPU intensive andchunking speed is far behind the I/O bandwidth of themodern storage devices. For example, a single coreCPU (2.80 GHz, core i5) can chunk the data streamat 97 MByte/sec[11]. Meanwhile, most of the modernstorage devices offer faster than 500 MByte/sec I/Obandwidth[11], [13]. The speed of variable size chunk-ing is merely 1/5 of that of the storage devices.

• Y. Won, K. Lim, J, Min is with the Division of Computer Science andEngineering, Hanyang University, Seoul 133-791, Korea.E-mail: {yjwo|lkyeol}@hanyang.ac.kr, [email protected]

We carefully argue that relatively little attention hasbeen paid on improving the speed of variable sizechunking despite its significant performance impli-cations. The efficiency of variable size chunking be-comes more important as the storage becomes faster.File chunking mechanisms need to be devised toeffectively exploit the bandwidth of the underlyingI/O interface. Augmenting hardware logic (ASIC orFPGA) for chunking may be a possible solution. How-ever, this will work only for dedicated deduplicationsystems, e.g., VTL[14], and will raise cost, standard-ization, and usability issues in general purpose com-puters (desktop PCs, servers).

In this work, we exploited the multiple cores inmodern CPUs for file chunking. Most of the com-mercially available CPUs have two or more cores.Recently, even the smartphones[15] are loaded withquad-core CPUs. The objective of this work is todevelop a file chunking algorithm that effectivelyexploits the multiple cores of modern CPUs. MUCHis implemented in our prototype deduplication sys-tem, PRUNE [16] and is in commercial deduplicatedbackup product [17].

2 RelatedWorkThere are a number of areas where deduplica-tion techniques play a key role: distributed andshared file systems[18][19], backup[2], peer-to-peerfile systems[20], and web-proxy servers[3]. Recently,deduplication techniques have been employed in theSSD controllers to reduce the amount of writing tothe NAND page[4], [5], [6]. In backup systems, anumber of techniques are traditionally used to savenetwork bandwidth and storage spaces: incremen-tal backup, compression, and delta encoding. The




incremental backup[21] technique is used to avoidredundant backup of unchanged files. Compressionis widely used to reduce the size of the data[22].Delta encoding is effective when changes are small.It is used in many applications including sourcecontrol[23] and backup[7]. Douglis et al. proposedredundancy elimination at block level (REBL), whichis a combination of block suppression, delta encoding,and compression. It needs to examine every pair offiles to detect redundancy across the multiple files[8],[7].

Policroniades et al. examined the effectiveness ofvariable size chunking and fixed size chunking usingwebsite data, different data profiles in academic data,source codes, compressed data, and packed files. Theyshowed that variable size chunking yields the bestdeduplication ratio[22]. Tang et al. introduced thechunk size control algorithm, TTTD (two thresholds,two divisors)[24]. Meister et al. compared the influ-ence of different chunking approaches[25].

In variable size chunking, Rabin’s algorithm[26] iswidely used to establish the chunk boundaries[14],[19], [2]. Despite the computational simplicity of Ra-bin’s algorithm, variable size file chunking is still avery CPU intensive operation[11], [1] since moduloarithmetic needs to be performed in every byte posi-tion of a file. Min et al. developed INC-K (IncrementalModulo-K) algorithm[11] to improve the computa-tional complexity of Rabin’s algorithm. INC-K is analgebraically optimized version of Rabin’s algorithm.

Most works, if not all, on deduplication aregrounded on the assumption that two differentdata blocks are very unlikely to yield the samefingerprints[19], [18]. Henson [27], however, statedthat this probability may not be valid since thebacked up files lasts much longer then hardware andtherefore the probability of fingerprint conflicts canbe much longer than we can expect. The computerhardware is upgraded every three to five years[28].The data should be preserved for much longer thanthat1.

A number of in-memory data structures, such asBloom filter[31], skip list[32] and etc, are employedto maintain fingerprints in a more compact man-ner. Bloom filter is widely used in distributed filesystems[33], deduplication systems[2], and peer-to-peer file systems[20].

To expedite the fingerprint lookup, a number ofnew search structures have been proposed. Hamiltonet al. proposed ”Commonality Factoring”, fingerprint-ing the fingerprints [34]. SISL (Stream-Informed Seg-ment Layout) stores the fingerprints in the same orderin which they are generated to preserve locality of fin-gerprint lookup[2]. Lillibriged et al. proposed a sam-pling technique that reduces the disk I/O with a tolera-

1The financial and medical records should be preserved forseven[29] to 30 years[30].

ble level of penalty on the deduplication ratio[14]. Indistributed backup, it is unlikely that a sequence offingerprints bears any locality since the fingerprintsgenerated from individual backup nodes are inter-leaved. Bhagwat et al. proposed a technique calledExtreme Binning to handle distributed backup[35]. Minet al.[16] proposed index partitioning to limit the sizeof a single fingerprint lookup structure and maintaina set of fingerprints with multiples of such. Debnathet al. proposed to use an SSD as a key-value store forfingerprint lookup and developed a new key-valuestore structure optimized for SSDs, chunkstash[10]. Anumber of works proposed to protect the importantchunks via ECC and redundant group[36], [37].

Yang Zhang et al. proposed to use distributedsystems for deduplication[38]. They partitioned theincoming data stream using fixed size chunking andsent the partitions to storage nodes. Each node adoptsmultithreading to generate MD-5 based fingerprint. Afew works proposed to exploit GPU to expedite thededuplication[39], [40].

None of the above mentioned works actually ad-dressed the issue of improving the chunking per-formance. Multithreaded variable size file chunkingproposed in this study compliments the prior worksin deduplication. Multi-core chunking algorithm pro-posed in this work makes the variable size chunkingfaster (up to 400% in the current state of the artquad-core CPU) by exploiting the existing CPU cores.MUCH enables the deduplication system to wellexploit the performance of the underlying storagedevices.

The rest of the paper is organized as follows.Section 3 provides the background information onMUCH. Section 4 explains the concept of multi-threaded chunking. Section 5 describes the perfor-mance model for MUCH. Section 6 provides an in-depth analysis of the performance model of MUCH.In section 7, we present the results of the physicalexperiments. Section 8 concludes the paper.

3 Problem Assessment: Overhead of VariableSize File Chunking3.1 Synopsis: DeduplicationDeduplication consists of four key technical ingre-dients: (i) file chunking, (ii) duplication detection,(iii) enhancing reliability for deduplicated data, and(iv) information life cycle management. There aretwo types of chunking: variable size chunking[19],[14] and fixed size chunking[19], [14]. Duplicationdetection is a process of determining whether a givenchunk is redundant or not. In this phase, the dedu-plication process generates a summary information(fingerprint) for a chunk and compares the fingerprintagainst a set of the existing fingerprints. In dedupli-cated data, certain data blocks are shared by multiplefiles, where loss cannot be tolerated. A number of




Fig. 1. Deduplication process

works proposed to adjust the size of parity/checksumand the degree of replication based on the numberof references to the respective chunk[11]. InformationLifecycle Management (ILM) is the process of migrat-ing the files to the next lower level in the storagehierarchy. If the data is old enough and has not beenaccessed for a certain amount of period, it is removedfrom the storage and archived. Deduplication addsanother dimension of complexity in Information Life-cycle Management[41]. While the deduplication effortis about exploiting the redundancies along the spatialaxis (within a file or among the files) or along the tem-poral axis (across the different versions of the samefile), ILM in deduplication is about reconstructing theeliminated redundancies. The essence of the issue hereis how to efficiently scan the reference counters of thedata blocks in the deduplicated storage[11].

3.2 Fixed Size ChunkingThere are two approaches in partitioning a file intochunks: fixed size chunking and variable size chunk-ing. In fixed size chunking, a file is partitioned intofixed size units, e.g., 8 KByte blocks. It is simple,fast, and computationally very cheap. A number ofproceeding works have adopted fixed-size chunkingfor backup applications[42] and for large-scale filesystems[18]. However, when a small amount of con-tent is inserted to or deleted from the original file,the fixed size chunking may generate a set of chunksthat are entirely different from the original ones eventhough most of the file contents remain intact.

3.3 Variable Size ChunkingVariable size chunking partitions a file based onthe content of the file, not the offset. Variablesize chunking is relatively robust against the inser-tion/deletion of the file. The BSW (Basic Sliding Win-dow) algorithm[19] is widely used in variable sizechunking. Fig. 2 explains the BSW algorithm. TheBSW algorithm establishes a window of byte streamstarting from the beginning of a file. It computes asignature, which is a hash value of byte stream in the

Fig. 2. The basic sliding window algorithm

window region. If the signature matches the prede-fined bit pattern, the algorithm sets the chunk bound-ary at the end of the window. After each comparison,the window slides one byte position and computeshash function again. Since the window slides onebyte position at a time, we can use incremental hashfunctions, e.g., Rabin’s algorithm[26] and INC-K[11],which are much cheaper than cryptographic hashfunctions such as SHA-1 and MD5.

Let us briefly explain the Rabins algorithm. For agiven byte string, b1, · · ·, bn, we need to obtain a sig-nature for each substring {b1, · · ·, bβ}, {b2, · · ·, bβ+1}, {b3, · ··, bβ+2}, · · ·. In Rabins algorithm, the signature is repre-sented R f (b1, b2, ···, bβ) = (b1pβ−1+b1pβ−2+···+bβ) mod M,where β and M are irreducible polynomial. When wescan the byte string and compute Rabins algorithmover the sliding window, Rabins signature values canbe computed from the previous ones in incrementalfashion. Eq. 1 illustrates Rabins signature computationformula, which uses the previous signature values.It requires only one shift operation, one additionoperation, and one modulo operation.

R f (b1, b2, · · ·, bβ) =((R f (bi−1, · · ·, bi+β−2 − bi−1pβ−1)p + bi+β−1) mod M

(1)

Min et al. algebraically optimized the Rabins algo-rithm with respect to the register size[16] and pro-posed INC-K algorithm.

Even though we use incremental hash functions forchunking, variable size chunking is still a computa-tionally intensive task. To reduce the computationalcomplexity, most variable size chunking approachesestablish the lower and the upper bounds on thechunk size. When the minimum chunk size is set,e.g., at 2 KByte, chunking module skips computinga signature for the chunks smaller than the minimumchunk size. If the average chunk size is 10 KByte,establishing the minimum chunk size at 2 KByte caneliminate approximately 20% of the signature compu-tation. The length of the target bit pattern governsthe average chunk size. If the target bit pattern is 13bits long, the probability that a given pattern matchesthe target corresponds to 2−13 and the average chunksize approximately corresponds to 8 KByte. Min etal. developed an analytical model that calculates theoptimal minimum and maximum chunking sizes[16].

We physically examined the I/O bandwidth of thestorage devices and the speed of variable size chunk-




Fig. 3. Read and chunking bandwidth

Fig. 4. System organization of multithreaded chunking,MUCH

ing. We used three storage devices with differentbandwidths: RAMDISK (DDR3 DRAM, 1333 MHz),SSD (OCZ RevoDrive 3 X2), and RAID 0 (threeHDDs). In variable size chunking, the average chunksize was set at approximately 8 KByte (13 bit signaturesize) with the minimum and the maximum chunksizes being 2 KByte and 8 KByte, respectively. Fig. 3illustrates the results. Read bandwidths of RAMDISK,SSD, and RAID 0 (three HDDs) yielded 4.9 GByte/sec,660 MByte/sec, and 280 MByte/sec, respectively. Onthe other hand, the speed of variable size chunkingwas 97 MByte/sec for all storage devices. In variablesize chunking, we used INC-K algorithm[11], which isalgebraically optimized version of Rabins algorithm.The variable size chunking operation saturates theCPU and the speed of variable size chunking is farslower than the I/O bandwidths of the storage devices.

4 MUCH: Multithreaded Variable Size Chunk-ing

4.1 Organization

The basic idea of multithreaded chunking (MUCH) issimple and straight forward. MUCH partitions a fileinto small fractions called segments and distributesthem to the chunker threads. The chunker threadpartitions the allocated fraction of the file into chunks.There are two types of threads in MUCH: the masterthread and the chunker thread. The master threadpartitions a file into segments, distributes them tochunker threads, collects the chunking results, andperforms post-processing, if necessary. The chunkerthread is responsible for partitioning a given segmentinto chunks, applying variable size chunking. MUCH

(a) Original file (b) Single thread chunking

(c) Two threads chunking

Fig. 5. Multithreaded Chunking Anomaly

works in an iterative manner. Fig. 4 illustrates theorganization of MUCH, multithreaded chunking sys-tem.

Despite its conceptual simplicity, MUCH requireselaborate technical treatments to be effective. The firstissue is to guarantee chunking invariability: The resultof chunking should not be affected by the number of chunkerthreads. The legacy Basic Sliding Window protocol[19],under multithreading, will produce different chunks,depending on the degree of multithreading. We callthis situation Multithread Chunking Anomaly. To ad-dress this issue, we developed Dual Mode Chunking,which enables us to guarantee chunking invariabil-ity. We formally proved that Dual Mode Chunkingguarantees Chunking Invariability. The second issue ishandling of small files. When a file is small, multi-corechunking may not be able to exploit the computingcapabilities of all CPU cores. This issue is particularlyimportant when the files for deduplication are mostlyless than a few KBytes, e.g., Linux source tree. Wedeveloped a technique called Dynamic Segment SetPrefetching to handle this issue.

4.2 Multithread Chunking Anomaly

Most existing chunking algorithms establish the lowerand upper bounds on chunk size to expedite thechunking process[19]. This is to avoid too frequentchunking and to avoid chunk sizes growing indef-initely. Establishing the lower and upper boundson the chunk size complicates the problem whenchunking becomes multithreaded. Fig. 5 illustrates anexample. In the original file, there are four prospec-tive chunk boundaries, A, B, C, and D (Fig. 5a). Inchunking, the minimum chunk size is set at 2 KByte.With single threaded chunking, chunk boundaries areset at A, B, C, and D (Fig. 5b). We labeled the threechunks as C1, C2, and C3. Now, we chunked thefile with two chunker threads. Fig. 5c illustrates thesituation. In Fig. 5c, the file is partitioned into S1 andS2, with E being the segment boundary. Each segmentis to be processed by the respective thread. In S1,chunk boundary is established at A. Chunk boundaryis also set at E because the chunker thread reachesthe end of the segment. In S2, chunker thread skips




Fig. 6. Dual Mode Chunking

the first 2 KByte (minimum chunk size), then startschunking. The problem is that chunk boundary B is1.5 KByte off from the beginning of S2, and therefore,is undetected. In S2, the first chunk boundary is set atC. Depending on whether the master thread coalescesthe last chunk of S1 or the first chunk of S2, thefinal result corresponds to either {CAE,CEB,CBC,C3} or{CAE,CEC,C3}. Neither of these is identical to the resultof single threaded chunking, {C1,C2,C3}. This problemoccurs because of the lower and upper bounds on thechunk size. We call this phenomenon MultithreadedChunking Anomaly.

Normally, computer hardware is upgraded everythree to five years[28]. Multithreading degree andthe segment size need to be updated to properlyreflect the characteristics of the underlying hardware,e.g., the number of CPU cores, storage bandwidth,etc, and to exploit the hardware capabilities. Multi-threaded chunking algorithm should be guaranteed togenerate the identical chunks on the same input fileindependent of the number of chunker threads andthe segment size. We call this constraint ”ChunkingInvariability”.

Definition A given multithreaded chunking algo-rithm satisfies chunking invariability if it generatesthe identical chunks independent of multithreadingdegree and the segment size.

4.3 Dual Mode Chunking

Though the upper and lower bounds on the chunksize cause Multithreaded Chunking Anomaly, remov-ing these bounds is not a viable option becausethat will increase the computational overhead. Wedeveloped a novel method to guarantee ChunkingInvariability: Dual Mode Chunking.

In Dual Mode Chunking, the chunker thread startschunking at slow mode and then switches to acceler-ated mode when certain conditions are met. In slowmode, chunker thread does not impose the mini-mum nor the maximum chunk size restrictions and,therefore, chunking proceeds slowly. In acceleratedmode, chunking algorithm enforces the lower andupper bounds on the chunk size to expedite chunking.Chunking in the accelerated mode is the same as thelegacy single threaded chunking.

Chunker thread works in slow mode until it finds achunk whose size is larger than the minimum chunksize and smaller than or equal to the maximum chunksize - minimum chunk size. If this chunk is the first

(a) Marker = C2 (b) Marker = C4 (c) Marker = C2

Fig. 7. Examples of marker chunks

chunk in the segment, the chunker thread does notswitch modes and keeps chunking in the slow mode.This chunk is called the Marker Chunk. The markerchunk is defined in Definition 4.3.

Definition A marker chunk is a chunk that satisfies thefollowing three conditions:

i) It is generated in the slow mode.ii) Chunk size is larger than Cmin and smaller than

or equal to Cmax −Cmin, where Cmin and Cmax arethe minimum and the maximum chunk sizes,respectively.

iii) It is not the first chunk in a segment.

Fig. 7 illustrates examples of marker chunks. Let usassume that the minimum and the maximum chunksizes are 2 KByte and 64 KByte, respectively. In Fig. 7a,the first chunk, C1, is larger than 2 KByte and smallerthan 64 KByte. However, C1 cannot be a marker chunkbecause it is the first chunk in the segment (violationof condition(iii)). In Fig. 7b, C4 is a marker chunkbecause C1, C2, and C3 are smaller than Cmin(violationof condition (ii)). In Fig. 7c, C2 is a marker chunkbecause C1 is larger than the maximum chunk size,64 KByte (violation of condition (ii)).

4.4 Chunk MarshallingSome chunks generated in the slow mode can besmaller than the minimum chunk size or larger thanthe maximum chunk size. The master thread performspost-processing so that the final results are identicalto the results obtained from single threaded chunking.We call this post-processing activity chunk marshalling.In this phase, the master thread performs two typesof tasks. The first is chunk coalescing. The masterthread coalesces a certain chunk, with the next one,if preceding one is smaller than the minimum size.Chunk coalescing repeats until the newly createdchunk becomes larger than or equal to the minimumsize. The second task is chunk splitting. If a chunk islarger than the maximum chunk size, it is partitionedinto two chunks so that the size of the first chunkis the same as the maximum chunk size. The masterthread splits the chunk recursively until the remainingchunk size is smaller than or equal to the maximumchunk size.

As a result of chunk marshalling, the sizes of allchunks are guaranteed to lie between the minimumand the maximum chunk size. Remember that forall chunks created from the chunker threads, the lastchunk in the slow mode is the marker chunk. Themarker chunk plays a critical role in guaranteeing




that, as a result of chunk marshalling, the sizes ofall chunks lie between the minimum and maximumchunk sizes. A marker chunk is larger than or equal tothe minimum chunk size and smaller than or equal tothe (maximum - minimum) chunk size. When a certainchunk, ci, is coalesced with the next one, ci+1, and ci+1is a marker chunk, the newly created chunk cannotbe larger than the maximum size since ci < cmin andci+1 ≤ cmin + cmax, where cmin and cmax denote theminimum and the maximum chunk sizes.

Theorem 1: Dual Mode Chunking and chunk mar-shalling guarantees chunk invariability.

Proof: Refer to Appendix B.

4.5 Handling Small Files: Dynamic Segment SetPrefetchingWhen the master thread fetches a fraction of a filefrom disk to memory, the remaining portion of a filemay be smaller than the size of a segment set. Wecall this phenomenon Segment Set Fragmentation. Withthe fragmented segment set, either only the subset ofthe available threads are allocated to the segments orthe master thread partitions the fragmented segmentset into smaller units than a segment to distributethem to all threads. When the segment size becomessmaller, the overhead of Dual Mode Chunking andchunk coalescing may overwhelm the performancegain obtained from parallelizing the chunking oper-ation.

We developed an effective technique, Dynamic Seg-ment Set Prefetching, to address the Segment Set Frag-mentation problem. The master thread checks if thenext segment set is fragmented before it reads thesegment set from the storage. If the next segmentset is not fragmented, the master thread proceedsand fetches the segment set. Otherwise, the masterthread fetches the current segment set as well asall remaining data. Then, the master thread allocatesequal amounts of data to each chunker thread.

5 Performance Modeling5.1 Signature Generation Overhead in SingleThreaded ChunkingLet p be the probability that the chunk boundarycondition is met at a given byte position. For example,if the chunk boundary condition is that the signaturematches predefined bit pattern of 10 bit, p correspondsto 1

210 . Let S be the chunk size. The average chunk sizein single threaded chunking can be formulated as inEq. 2.

E[S] =cmax∑

i=cmin

i(1 − p)i−1p

=cmin(1 − p)cmin−1 +1p

(1 − p)cmin − (1 − p)cmax (1p+ cmax)

(2)

TABLE 1Modeling parameters: p=2−13, cmin=2 KB, cmax=64 KB,

λ=14 nsec, δ=4 msec, ζ=8 usec, w=48 byte

Symbol Meaninglacc no. of signatures in accelerated modelslow no. of signatures in slow modenacc average no. of chunks in accelerated modenslow average no. of chunks in slow mode

sa average chunk size in accelerated modesb average chunk size in slow modep signature match probability

cmin minimum chunk sizecavg average chunk sizecmax maximum chunk sizecmark upper bound of marker chunkλ time taken to generating single signatureδ disk seek timeζ time taken to merging two chunksw window size

Braw disk bandwidthM segment sizeN number of threads

Let M and cmin be the file size and the minimumchunk size. In single threaded chunking, the expectednumber of signatures generated, ls, can be formulatedas in Eq. 3.

lslow =M − ME[S]

· cmin (3)

For each chunk, the chunker thread skips the firstcmin byte and then computes signatures for all bytepositions until the chunk boundary is set. The numberof chunks in a file is M

E[S] .M

E[S] cmin corresponds to thenumber of byte positions where signatures are notcomputed.

5.2 Signature Generation in MultithreadedChunking

We computed the number of signature generations inmultithreaded chunking. Multithread chunking has anotion of a segment. Let M be the size of a segment.Let lMUCH be the number of signature generationsfor a segment. Let lslow and lacc be the number ofsignature generations in slow mode and in acceleratedmode, respectively. In slow mode, signatures are com-puted in all byte positions. The number of signaturegenerations in the chunker thread of MUCH can beformulated as in Eq. 4.

lMUCH = lslow + lacc =M − M − lslow

E[S]· cmin (4)

The number of signature generations in acceleratedmode can be obtained in the same manner as in Eq. 3.However, the size of the accelerated mode region isthe size of a segment subtracted by the size of the slowmode region. The number of signature computations




in accelerated mode, la, can be formulated as in Eq. 5.

lacc = (M − lslow) − M − lslow

E[S]· cmin (5)

Now, we delve into finding the expected size of theslow mode region. In slow mode, if a marker chunkis found, chunking mode changes from the slow tothe accelerated mode. Let pm be the probability that achunk satisfies the chunk size requirement of markerchunk in Definition 4.3, i.e., the chunk size is largerthan or equal to the minimum chunk size and smallerthan or equal to (max size - min size). Then, pm canbe formulated as in Eq. 6.

pm =

cmax−cmin∑i=cmin

(1 − p)i−1p

=(1 − p)cmin−1 − (1 − p)cmark , where cmark = cmax − cmin(6)

We computed the expected number of chunks in slowmode. Let P(M = i) be the probability that ith chunk isthe marker. From the definition of the marker chunk,the first chunk cannot be the marker. Then, P(M =1) = 0, P(M = 1) = pm, . . .. The probability that theith chunk is the marker is formulated as P(M = i) =(1− pm)i−2pm, i ≥ 2. The expected number of chunks inthe slow mode region can be formulated as E[M] =∑

i≥2 iP(M = i) and can be summarized as in Eq. 7.

E[M] =1

pm+ 1 (7)

There is another interpretation of Eq. 7. The proba-bility that there are i non marker chunks at slow modecan be formulated as (1−pm)i−1pm. Then, the expectednumber of non marker chunks in the slow moderegion can be formulated as

∑∞i=1 i · (1 − pm)i−1pm =

1pm

.In the slow mode region, there is one marker chunkand 1

pmnumber of non marker chunks on average.

We computed the expected chunk size in slowmode. Let Sb and Sm be the size of a non-marker chunkand the size of a marker chunk in the slow moderegion, respectively. The probability that the size of achunk is i, given that it is a marker, is a conditionalprobability. Therefore P(Sm = i) can be formulated asP(Sm = i) = P(S=i)

pm=

(1−p)i−1p(1−p)cmin−1−(1−p)cmark

. The expected sizeof a marker chunk, E[Sm], can be formulated as inEq. 8. Details of derivations are in Appendix A.

E[Sm] =1

(1 − p)cmin−1 − (1 − p)cmark·(cmin(1 − p)cmin−1

+1p

(1 − p)cmin − (cmax − cmin +1p

)(1 − p)cmax+cmin

)(8)

In a similar manner, we computed the expected sizeof non-marker chunks in slow mode. The probabilitythat the size of a non-marker chunk is i correspondsto P(Sb = i) = P(S=i)

((1−p)w−1−(1−p)cmin−1)+((1−p)cmark−(1−p)M+2) =

(1−p)i−1p((1−p)w−1−(1−p)cmin−1)+((1−p)cmark−(1−p)M+2) . A signature is com-puted for every w byte window. In our implementa-tion, the window size is set to 48 Byte. The expectedsize of non-marker chunks, E[Sb], in slow mode canbe formulated as in Eq. 9. Details of derivations arein Appendix A.

E[Sb] =(w(1 − p)w−1 +

1p

(1 − p)w − (1 − p)cmin−1(1p+ cmin − 1)

+ (cmark + 1)(1 − p)cmark +1p

((1 − p)cmark+1 − (1 − p)M+1)

− M(1 − p)M+1)

· 1((1 − p)w−1 − (1 − p)cmin−1) + ((1 − p)cmark − (1 − p)M+2)

(9)

In slow mode, the chunker thread computes sig-natures for every byte position and, therefore, thenumber of signature generations equals to the sizeof the slow mode region. The number of chunks inthe slow mode region corresponds to 1

pmand the last

chunk in the slow mode region is the marker chunk.1p + w − 1 is the average size of the first chunk of asegment in slow mode chunking. Finally, the numberof signature generations in the slow mode region, lslow,can be formulated as in Eq. 10.

lslow = (1

pm− 1)E[Sb] + E[Sm] + (

1p+ w − 1) (10)

Comparing Eq. 3 and Eq. 4, MUCH introducesadditional signature generations, lslow

E[S] · cmin. This ad-ditional overhead is caused by Slow Mode chunk-ing, which does not impose the minimum nor themaximum size requirements on the chunk sizes. Withlarge segments, this overhead becomes less significantsince relatively smaller portion of the segments isprocessed in Slow Mode chunking. The normalizedmultithreaded chunking overhead can be denoted by

γ =lMUCH − lsingle

lsingle=

lslowE[S] · Cmin

lsingle. We visualized the nor-

malized multithreaded chunking overhead in Fig. 8.For segments larger than 200 KByte, the overhead isless than 4%. This additional overhead is offset by theperformance improvement obtained from chunkingmultiple segments in a parallel manner.

We verified the accuracy of our model Eq. 4 throughphysical experiments. We examined the length of slowmode region and the number of chunks in slow modeunder varying target pattern sizes. We set the mini-mum and the maximum chunk sizes at 2 KByte and64 KByte, respectively. The degree of multithreadingis 4 and the file size is 1 GByte. Fig. 9 illustrates theresults. Fig. 9a and Fig. 9b illustrate the length of theslow mode region and the number of chunks in theslow mode region from the analytical models (Eq. 10)and from physical experiments. As shown in Fig. 9,




0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

100 200 300 400 500 600 700 800 900 1000

The

val

ue o

f γ

Segment size (KByte)

Fig. 8. Normalized multithreaded chunking overhead

1234567

10 11 12 13 14 15# of

sig

gen

erat

ions

(104 )

Pattern size(bit)Theory

Exp

(a) No. of signaturegenerations

0 1 2 3 4 5 6 7 8

10 11 12 13 14 15

# of

chu

nks

Pattern size(bit)Theory

Exp

(b) No. of Chunks

Fig. 9. Slow mode model validation: File size = 250MByte, cmin=2 KByte, cmax=64 KByte, segment size =100 KByte

the performance model in Eq. 10 precisely models thereal world behavior.

Let N be the number of threads. The number ofsignature generations in each thread corresponds tolMUCH/N. Then, the normalized performance improve-ment from multithreaded chunking can be formulatedas lMUCH/N · lsingle.

5.3 Chunking Performance in MUCH

Three components constitute the overhead of multi-core chunking: (i) chunking overhead, Tchunk, (ii) timeto fetch a segment set from the disk overhead, Tdisk,and (iii) chunk marshalling overhead, Tmarshall. Chunk-ing is performed by the chunker thread while segmentset fetching and chunk marshalling are performedby the master thread. We developed a performancemodel for multicore chunking. Without loss of gener-ality, we computed the overhead for a single segmentset.

Let λ be the time to compute one signature. Then,the time to chunk a segment can be modeled as inEq. 11. where M is the size of a segment.

Tchunking =λ · lMUCH

=λ · (lslow + lacc)

=λ · (M − (M − lslow) · cmin

E[S])

(11)

With M byte segment and N chunker threads,the segment set fetching overhead, Tdisk, correspondsto Eq. 12, where δ and Braw are the average disk

overhead (seek and rotational latency) and the maxi-mum disk bandwidth, respectively. We assumed thata single segment is not fragmented, which is not anunreasonable assumption given the aggressive blockpre-allocation policy of modern filesystems[43] andthe modeling scheme of existing disk schedulingefforts[44].

Tdisk = δ +M ·NBraw

,Braw = sequential read bandwidth

(12)In chunk marshalling, the number of coalescing

operations is bounded by the total number of chunksgenerated in the slow mode. Tclsc = ζ · 1

pm·N, ζ = time

to coalesce two segments. Finally, the overhead ofmarshalling, Tmarshall, corresponds to Tmshl = Tclsc+Tsplit,where the time for split operation, Tsplit, is negligible.The chunking bandwidth of MUCH can be formu-lated as in Eq. 13.

BMUCH =M ·N

max{Tchunking,Tdisk

}+ Tmarshall

(13)

0

50

100

150

200

250

300

350

400

450

10 20 50 100200

300400

500600

8001000

12001400

16001800

2000

Ch

un

kin

g b

an

dw

idth

(MB

yte

/Se

c)

Segment size(KByte)

1thread2threads

4threads8threads

16threads32threads

Fig. 10. Chunking bandwidths under varying segmentsizes: Bmax = 280MByte/sec, δ = 4msec, cmin = 2KByte,cmax = 64KByte, target pattern size: 13 bit, λ = 14nsec

6 Storage speed and The number of CPU coresIt is not practical to increase the segment size be-yond a certain threshold value since the additionalperformance gain becomes marginal. To address thisissue, we modeled the theoretical upper bound onchunking bandwidth with infinite size segment andthen obtained the segment size which achieves ρfraction of the ideal chunking speed.

Let B∗chunk and B∗IO be the ideal bandwidth when theoverall throughput is governed by the CPU and I/O,respectively. They can be formulated as in Eq. 14 andEq. 15, respectively.

B∗chunking =E[S]

ζ + λ(E[S] − Cmin)·N (14)

B∗IO = Braw (15)

Let ρ be the ratio between the actual bandwidthand the ideal bandwidth, i.e., ρ =

Bchunking

B∗chunkingor ρ = BIO

B∗IO. Let

Mchunking(ρ) and MIO(ρ) be the segment sizes to achieve




100 200

400

600

800

1000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99

Seg

men

t siz

e (K

Byt

e)

ρ

(a) CPU bound

0 5

10 15 20 25 30 35 40

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99

Seg

men

t siz

e (M

Byt

e)

ρ

(b) Disk bound

Fig. 11. The optimal segment size

utilization ρ for CPU bound systems and IO boundsystems, respectively. Then, Mchunking(ρ) and MIO(ρ)can be formulated as in Eq. 16 and Eq. 17, respectively.

Mchunking(ρ) =ρ(E[S](ζ + ζ

Pm) − lslow(ζ − λCmin))

(1 − ρ)(ζ + λ(E[S] − Cmin))(16)

MIO(ρ) =ρBraw(δ +Nζ 1

PM)

N(1 − ρ)(17)

Fig. 11 illustrates the segment sizes under varyingnumber of threads. When a system is CPU bound(Fig. 11a), the segment size is not significantly gov-erned by the number of threads. When the system isI/O bound, it is not necessary to increase the segmentset size beyond a certain threshold point. The size ofa single segment decreases as the number of threadsin multicore chunking increases.

Based on Eq. 13, we can compute the effective num-ber of threads to saturate a given I/O interface. Weassumed that the time to generate a single signaturecorresponds to 14 nsec. It is obtained from Intel Corei5 CPU (2.80 Ghz). We used INC-K algorithm[1]. Upto an IO bandwidth of 100 MByte/sec, single thread issufficient to fully exploit the IO bandwidth. For SATA2.0 and SAS 300 interfaces, we needed to allocate fourchunker threads. For SATA 3.0 (600 MByte/sec) drive,we needed to allocate three threads. For FC overoptical fiber (2 GByte/sec), we needed 32 cores; Weneeded to exploit a different approach, e.g., GPGPU,instead of relying on CPU cores, to fully utilize theunderlying IO bandwidth.

7 Experiment7.1 Experiment SetupWe implemented MUCH in a prototype deduplicatedbackup system, PRUNE[11]. We examined the ef-fectiveness of multicore chunking in three differentdata sets: 2 GByte ISO image, a mixture of differentsized files (totaling 200 GByte), and Linux source tree(totaling 341 MByte). Table 2 summarizes the charac-teristics of the data sets used in our experiment. Weexamined MUCH under three storage devices withdifferent performance characteristics: RAMDISK, highperformance SSD, and RAID 0 based storage device

TABLE 2Datasets

Set Size Type Size Avg.(MB) (MB)

ISO img. 2GB iso 2048

Mixed 200GB

iso 66764.8 551.7rar 48742.4 18.1avi 61030.4 1245.5

Files zip 13516.8 281.6jpg 4300.8 1.9exe 7372.8 37.8cab 3276.8 16.1all 204853.1 36.9

Linux 341.2MB *c,*h 341.2 0.011

TABLE 3Specifications of three storage devices

Type Vendor ModelRAMDISK SAMSUNG DDR3, 1333 MHz

SSD OCZ REVODRIVE3X2(PCI-E, 480GByte)

RAID 0 Seagate ST31000528AS x 3EA(SATA2, 1Tbyte HDD)

TABLE 4Various statistical characteristics of filesystems

Average Medianweb server 207.7 KByte 7.1 KByte

e-learning server 173.3 MByte 1.5 KByteDB server 29.9 KByte 7.3 KByte

mail server 306.6 KByte 2.8 KByte

that consists of three hard disk drives (Table 3). Weran MUCH on Intel Core i5 (2.80 GHZ, quad-core)with 4 GByte of DRAM and Linux 2.6.32 with EXT4filesystem.

Yoon[45] performed an extensive survey on variousstatistical characteristics of filesystems on differentservers: web server, e-learning server, DB server, andmail server. The median and average file size of web-server, e-learning server, DB server and mail servercorrespond to 7.1 KByte, 1.5 KByte, 7.3 KByte, and2.8 KByte, respectively. The average file size of webserver, e-learning server, DB server, and mail servercorrespond to 207.7 KByte, 173.3 MByte, 29.9 KByte,and 306.6 KByte, respectively. It is worth noting that e-learning server and the mail server have heavy tailedfile size distribution. Table 4 summarizes the statistics.

7.2 Chunking Performance: Large Sized FilesIn chunking, the overhead of open() and close() sys-tem calls is non-trivial and becomes more significantas the underlying storage device gets faster. When the




0

50

100

150

200

250

300

350

400

10 50 200 400 600 800

Chu

nkin

g sp

eed(

Mby

te/S

ec)

Segment size(Kbyte)

number of threads

1(exp)2(exp)3(exp)4(exp)

1(theory)2(theory)3(theory)4(theory)

(a) In memory

0

50

100

150

200

250

300

350

400

10 50 200 400 600 800

Chu

nkin

g sp

eed(

Mby

te/S

ec)

Segment size(Kbyte)

number of threads



(b) In SSD

0

50

100

150

200

250

300

350

400

10 50 200 400 600 800

Chu

nkin

g sp

eed(

Mby

te/S

ec)

Segment size(Kbyte)

number of threads



(c) In RAID 0

Fig. 12. Chunking bandwidths with ISO image

0

50

100

150

200

250

300

350

50 100 500 1000

Chu

nkin

g sp

eed(

Mby

te/S

ec)

Segment size(Kbyte)

1thread2thread

3thread4thread

(a) SSD

0

50

100

150

200

250

300

350

50 100 500 1000

Chu

nkin

g sp

eed(

Mby

te/S

ec)

Segment size(Kbyte)

1thread2thread

3thread4thread

(b) RAID 0

Fig. 13. Chunking bandwidths with mixed files

file is large, e.g., tar image of the filesystem snapshot,the overhead of opening and closing the file is notsignificant. When the file is small, e.g., Linux sourcefiles, the overhead of opening and closing a file consti-tutes a dominant fraction of the overall chunking timeand the chunking performance degrades. We testedthe chunking performance under three data sets withdifferent file sizes. First, we chunked a large file (2GByte) varying the segment sizes from 10 KByte to800 KByte. Fig. 12a, Fig. 12b, and Fig. 12c illustratethe chunking bandwidths under varying segment sizefor RAMDISK, SSD, and RAID 0, respectively. Itplots the results from theoretical model and physicalexperiments. We can see that our analytical modelaccurately estimates the chunking bandwidths. Withone thread, chunking speed can reach approximately96 MByte/sec for all three storage devices. Sequentialread bandwidths of RAMDISK (DRAM), SSD, andRAID 0 corresponded to 4.7 GByte/sec, 660 MByte/sec,and 280 MByte/sec, respectively. Storage bandwidthsare significantly under utilized.

For RAMDISK and SSD, chunking speed reached340 MByte/sec, which is still only 7% and 51% ofthe sequential read speeds of RAMDISK and SSD,respectively. When the data was in RAID 0, we couldfully exploit the storage bandwidth (280 MByte/sec).

It is not possible to fully exploit the I/O bandwidthsof the modern high performance storage devices,even when all the CPU cores are fully exploited.

Recent emergence of PCIe attached high-end SSDswith multiple GByte/sec sequential I/O bandwidth[13]will only widen the chasms between the storage speedand the chunking speed. A possible remedy for thisproblem is to utilize GPGPU to chunk files and/or touse dedicated hardware[4], [39], [40].

In all three storage devices, chunking speed in-creased with segment size. When the file resides in amemory, increasing the segment size quickly saturatesthe CPU. When the file is on a disk, we need 400KByte segment size to fully exploit the underlyingstorage (Fig. 12c).

7.3 Chunking Performance: Medium Sized FilesWe examined the chunking performance withmedium sized files. The total file size was 200GByte. These sets of files were created to mimicthe multimedia data server that harbors relativelylarge files, e.g., images, music files, video files,executables, etc. There were a total of 5546 files andthe average file size was 36.9 MByte. Fig. 13a andFig. 13b illustrate the results. For the SSD, with fourchunker threads and 500 KByte segment size, thechunking speed reaches 350 MByte/sec. Chunkingspeed increased by 350% with four chunker threadscompared to single thread. It is still 70% of theSSD bandwidth. For RAID 0, by using four chunkerthreads with 1 MByte segment, the chunking speedreached 250 MByte/sec. It is 98% of the sustainedsequential read bandwidth.

7.4 Chunking Performance: Small Sized FilesWe tested the chunking bandwidth of MUCH usingthe Linux source tree. The objective of this experimentis to examine the effectiveness of MUCH when the fileis small. The average file size is 11 KByte and 85% offiles are smaller than 20 KByte.

Fig. 14a, Fig. 14b, and Fig. 14c illustrate the resultsof our experiment. As we can see, when the file issmall, increasing the segment size does not improvethe chunking performance. This is because when we




partition a single file that is smaller than tens ofKByte to multiple segments, each segment will yieldonly a few chunks, e.g., one or two chunks. Thecoalescing overhead constitutes a significant fractionof the entire chunking overhead and, therefore, thebenefit of employing parallelism becomes marginal.

7.5 Scalability

We examined the effects of parallelism. In three datasets, we increased the number of chunker threads tofour and examined the chunking bandwidths. Chunk-ing consists of three components: (i) reading a setof segments, (ii) chunking, and (iii) chunk coalesc-ing. MUCH aims at improving the overall chunkingperformance by parallelizing the second component:chunking. On the other hand, if the storage deviceis slow, the time to read a set of segments willconstitute a dominant fraction of chunking and, there-fore, the benefit of employing multithreading becomesmarginal. If the file is small (or if the segment sizeis small), the benefit of employing multithreadingbecomes marginal since relatively significant fractionof chunking overhead consists of chunk coalescing.Since I/O and chunk coalescing cannot be parallelized,as the fraction of these components gets larger, MUCHbecomes less scalable. Fig. 15 illustrates the resultsof the experiment. The X and Y axis denote thenumber of chunker threads and the chunking band-widths, respectively. As the average file size getslarger, MUCH becomes more scalable. When the fileis small (Fig 16c), employing multiple threads barelyhelps the chunking performance in an SSD and RAID0 storage devices.

7.6 Effect of Dynamic Segment Set Prefetching

We examined the relationship between the file sizeand the effects of Dynamic Segment Set Prefetching.We varied the file sizes from 100 KByte to 4 MByte.In this experiment, MUCH used four threads witha segment size of 100 KByte. We used four threads.Segment size was 50 KByte. Fig. 17 illustrates theresults. When a file is in the memory, DSSP signif-icantly improves chunking bandwidth. With a 500KByte file, chunking speed increased by 50% by usingDSSP. The effects of DSSP become less significant asthe file size increases. However, even for a 3.7 MBytefile, using DSSP increased the chunking bandwidth bymore than 10%. Given that most of the files we useare less than 1 MByte, DSSP is a critical mechanismto expedite the chunking process.

Fig. 18 and Fig. 19 illustrate the effects of DSSPfor a set of mixed files and a set of Linux sourcefiles, respectively. DSSP is more effective when thefiles are relatively small and when each of the filesconsists of a small number of segments. For Linuxsources, when the files were stored in the memory,

0

50

100

150

200

250

300

350

1 2 3 4Chu

nkin

g sp

eed

(MB

yte/

Sec

)

Number of threadsw/ DSSP w/o DSSP

(a) In SSD

0

50

100

150

200

250

300

350

1 2 3 4Chu

nkin

g sp

eed

(MB

yte/

Sec

)


(b) In RAID

Fig. 18. Effect of multithreading degree in DSSP, mixedfiles

e.g., RAMDISK, DSSP improved the chunking per-formance by twofold. For Linux sources, when thefiles were in RAID storage, the performance benefitof DSSP was approximately 10%.

7.7 Multicore Chunking in SmartphonesRecently introduced high-end smartphones areequipped with four cores[46] or eight cores[47]. Weexamined the performance of the proposed multicorechunking algorithm in the mobile platform. Weported MUCH algorithm on a commercially availablesmartphone (Samsung Galaxy S3, Android 4.1.2,Exynos 4412 1.4 Ghz quad-core). The files werestored in the internal flash based-storage (/datapartition, internal eMMC, Samsung KMVTU000LM).We measured the chunking performances whilevarying the number of cores. In chunking a largefile (ISO Image, 2GByte), the chunking bandwidthincreased from 20 MByte/sec to 60 MByte/sec whenwe increased the number of cores used in chunkingfrom one to four in the Smartphone CPU (Exynos4412). When the file is small (Linux source tree,Fig. 20b), the performance improvement frommulticore chunking is not as significant as in thecase of large files since the open() and close()overhead constitutes a significant fraction of theoverall chunking latency. Fig. 20 illustrates the resultsof our experiment. Followings are the results of theexperiment. By utilizing all cores in quad-core CPU,we increased the chunking performance by x4 andachieved up to 350 MByte/sec chunking performance.We increased the storage bandwidth utilization from20% to 70%.

8 ConclusionIn this work, we proposed a novel multicore chunkingalgorithm, MUCH, which parallelizes the variable sizechunking. To date, most of the existing works ondeduplication focus on expediting the redundancydetection process, while less attention has been paidon how to make the file chunking faster. We foundthat variable size chunking is computationally veryexpensive and is a significant bottleneck in the overall




0

20

40

60

80

100

120

140

10 50 100 500 1000

Chu

nkin

g sp

eed

(MB

yte/

Sec

)


1thread2threads

3threads4threads

(a) In RAMDISK

0

20

40

60

80

100

120

140

10 50 100 500 1000

Chu

nkin

g sp

eed

(MB

yte/

Sec

)


1thread2threads

3threads4threads

(b) In SSD

0

20

40

60

80

100

120

140

10 50 100 500 1000

Chu

nkin

g sp

eed

(MB

yte/

Sec

)


1thread2threads

3threads4threads

(c) In RAID 0

Fig. 14. Chunking bandwidths with Linux source tree

0 50

100 150 200 250 300 350

1 2 3 4Chu

nkin

g sp

eed

(MB

yte/

Sec

)

Number of threadsRAMDISK SSD RAID 0

(a) ISO image, M =800KByte

0

50

100

150

200

250

300

350

1 2 3 4Chu

nkin

g sp

eed

(MB

yte/

Sec

)

Number of threadsSSD RAID 0

(b) Mixed files, M =1000KByte

0

20

40

60

80

100

120

1 2 3 4Chu

nkin

g sp

eed

(MB

yte/

Sec

)

Number of threadsRAMDISK SSD RAID 0

(c) Linux source tree, M =50KByte

Fig. 15. Chunking bandwidths vs. multithreading degree (M: segment size)

0

20

40

60

80

100

1 2 3 40.0

0.2

0.4

0.6

0.8

1.0

Chu

nkin

g sp

eed(

Mby

te/S

ec)

Nor

mal

ized

Chu

nkin

g sp

eed

Number of threadsRAMDISK

SSDRAID 0

RAMDISK(y2)SSD(y2)

RAID 0(y2)

(a) ISO image

0

20

40

60

80

100

1 2 3 4 0.0

0.2

0.4

0.6

0.8

1.0

Chu

nkin

g sp

eed

(MB

yte/

Sec

)

Nor

mal

ized

chu

nkin

g sp

eed

Number of threadsSSD

RAID 0SSD(y2)

RAID 0(y2)

(b) Mixed files

0

10

20

30

40

50

1 2 3 4 0.0

0.2

0.4

0.6

0.8

1.0C

hunk

ing

spee

d (M

Byt

e/S

ec)

Nor

mal

ized

chu

nkin

g sp

eed

Number of threadsRAMDISK

SSDRAID 0

RAMDISK(y2)SSD(y2)

RAID 0(y2)

(c) Linux source tree

Fig. 16. Scalability of multithreading

50 100 150 200 250 300 350 400

100 300 500 700 900 1100 1300

Chu

nkin

g sp

eed

(MB

yte/

Sec

)

File size (KByte)w/ DSSP (Exp)

w/o DSSP (Exp)w/ DSSP (Theory)

w/o DSSP (Theory)

(a) In memory

50 100 150 200 250 300 350 400

100 300 500 700 900 1100 1300Chu

nkin

g sp

eed

(MB

yte/

Sec

)

File size (KByte)w/ DSSP w/o DSSP

(b) In SSD

50 100 150 200 250 300 350 400

100 300 500 700 900 1100 1300Chu

nkin

g sp

eed

(MB

yte/

Sec

)

File size (KByte)w/ DSSP w/o DSSP

(c) In RAID 0

Fig. 17. Performance of DSSP vs. file size

0

20

40

60

80

100

120

140

1 2 3 4Chu

nkin

g sp

eed

(MB

yte/

Sec

)


(a) In memory

0

5

10

15

20

25

30

35

1 2 3 4Chu

nkin

g sp

eed

(MB

yte/

Sec

)


(b) In SSD

0

5

10

15

20

25

30

35

1 2 3 4Chu

nkin

g sp

eed

(MB

yte/

Sec

)


(c) In RAID

Fig. 19. Effect of the number of threads in DSSP, Linux source tree




0

10

20

30

40

50

60

70

80

10 50 200 400 600 800

Chu

nkin

g sp

eed

(MB

yte/

Sec

)


1thread2threads

3threads4threads

(a) ISO image

0

1

2

3

4

5

10 50 100 500 1000

Chu

nkin

g sp

eed

(MB

yte/

Sec

)


1thread2threads

3threads4threads

(b) Linux source tree

Fig. 20. MUCH in a smartphone (Galaxy S3, Exynos4412 quad-core (1.4GHz), Android 4.1.2, SamsungKMVTU000LM eMMC (16GB))

deduplication process. The performance gap betweenthe CPU chunking speed and the I/O bandwidth isexpected to become wider with the recent introduc-tion of faster I/O interconnections, and faster storagedevices.

Incorporating this technology advancement, i.e., in-crease in the number of CPU cores and emergence offaster storage devices, we developed a parallel chunk-ing algorithm, which aims at making the variablechunking speed on par with the storage I/O band-widths. We found that the legacy variable size chunk-ing algorithm yields a different set of chunks if theparallelism degree changes, a phenomenon referred toas Multithreaded Chunking Anomaly. We propose a mul-ticore chunking algorithm, MUCH, which guaranteesChunking Invariability. We developed a performancemodel to compute the segment size that maximizesthe chunking bandwidth while minimizing the mem-ory requirement. Through extensive physical exper-iments, we showed that the performance of MUCHscales linearly with the number of cores. In quad-core CPUs, MUCH brings a 400% performance in-crease when the storage device is sufficiently fast. Thebenefits of MUCH are evident when it chunks largefiles, e.g., tar images of filesystem snapshot, at highperformance storage. MUCH successfully increasesthe chunking performance with the factor being ashigh as the number of available CPU cores withoutany additional hardware assistance.

AcknowledgmentsThis work is supported in part by IT R&D programMKE/KEIT (No. 10041608, Embedded System Soft-ware for New-memory based Smart Device).

References[1] Y. Won, R. Kim, J. Ban, J. Hur, S. Oh, and J. Lee, “Prun: Elim-

inating information redundancy for large scale data backupsystem,” in Proceedings of IEEE International Conference onComputational Sciences and Its Applications(ICCSA ’08), Perugia,Italy, March 2008.

[2] B. Zhu, K. Li, and H. Patterson, “Avoiding the disk bottleneckin the data domain deduplication file system,” in FAST’08:Proceedings of the 6th USENIX Conference on File and StorageTechnologies, Berkeley, CA, USA, 2008, pp. 1–14.

[3] M. Mitzenmacher, “Compressed bloom filters,” IEEE/ACMTrans. on Networking, vol. 10, no. 5, pp. 604–612, October 2002.

[4] J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. Lee, S. Kang,Y. Won, and J. Cha, “Deduplication in ssds: Model and quan-titative analysis,” in Proceedings of Mass Storage Systems andTechnologies (MSST), 2012 IEEE 28th Symposium on. PacificGrove, CA, USA: IEEE, April 2012, pp. 1–12.

[5] F. Chen, T. Luo, and X. Zhang, “Caftl: A content-aware flashtranslation layer enhancing the lifespan of flash memory basedsolid state drives,” in Proceedings of the 9th USENIX conferenceon File and storage technologies, San Jose, CA, USA, February2011, pp. 77–90.

[6] A. Gupta, R. Pisolkar, B. Urgaonkar, and A. Sivasubramaniam,“Leveraging value locality in optimizing nand flash-basedssds,” in Proceedings of the 9th USENIX Conference on File andStorage Technologies, San Jose, CA, USA, February 2011, pp.91–104.

[7] M. Ajtai, R. Burns, R. Fagin, D. Long, and L. Stockmeyer,“Compactly encoding unstructured input with differentialcompression,” Journal of the ACM(JACM), vol. 49, no. 3, pp.318–367, May 2002.

[8] F. Douglis and A. Iyengar, “Application-specific delta-encoding via resemblance detection,” in Proceedings of USENIX2003, San Antonio, TX, USA, June 2003.

[9] H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky, “Silt:A memory-efficient, high-performance key-value store,” inProceedings of the Twenty-Third ACM Symposium on OperatingSystems Principles. ACM, 2011, pp. 1–13.

[10] B. Debnath, S. Sengupta, and J. Li, “Chunkstash: speeding upinline storage deduplication using flash memory,” in Proceed-ings of the 2010 USENIX conference on USENIX annual technicalconference, Boston, MA, USA, June 2010, pp. 215–229.

[11] Y. Won, J. Ban, J. Min, J. Hur, S. Oh, and J. Lee, “Efficient indexlookup for de-duplication backup system,” in Proceedings ofModeling, Analysis and Simulation of Computers and Telecom-munication Systems, 2008. MASCOTS 2008. IEEE InternationalSymposium on, Baltimore, MD, USA, September 2008, pp. 1–3(Poster Paper).

[12] A. Badam and V. S. Pai, “Ssdalloc: hybrid ssd/ram memorymanagement made easy,” in Proceedings of the 8th USENIXconference on Networked systems design and implementation.USENIX Association, 2011, pp. 16–16.

[13] Ocztechnology, “Ocz revo3 x2 product sheet,”http://www.ocztechnology.com/.

[14] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise,and P. Camble, “Sparse Indexing: Large Scale, Inline Dedupli-cation Using Sampling and Locality,” in Proceedings of the 7thUSENIX Conference on File and Storage Technologies (FAST’09),San Francisco, CA, USA, February 2009, pp. 111–124.

[15] C. Van Berkel, “Multi-core for mobile phones,” in Proceedingsof the Conference on Design, Automation and Test in Europe, Nice,France, March 2009, pp. 1260–1265.

[16] J. Min, D. Yoon, and Y. Won, “Efficient deduplication tech-niques for modern backup operation,” Computers, IEEE Trans-actions on, vol. 60, no. 6, pp. 824–840, 2011.

[17] MacroImpact, “Sanique smart vault,”http://www.macroimpact.com/subpage.php?p=m14.

[18] B. Hong, D. Plantenberg, D. Long, and M. Sivan-Zimet, “Du-plicate data elimination in a SAN file system,” in Proceedings ofthe 12th NASA Goddard, 21st IEEE Conference on Mass StorageSystems and Technologies (MSST 2004), Greenbelt, MD, USA,April 2004, pp. 301–314.

[19] A. Muthitacharoen, B. Chen, and D. Mazieres, “A low-bandwidth network file system,” SIGOPS Oper. Syst. Rev.,vol. 35, no. 5, pp. 174–187, 2001.

[20] P. Reynolds and A. Vahdat, “Efficient Peer-to-Peer KeywordSearching,” LECTURE NOTES IN COMPUTER SCIENCE, vol.2672, no. 2, pp. 21–40, 2003.

[21] A. Chervenak, V. Vellanki, and Z. Kurmas, “Protecting filesystems: A survey of backup techniques,” in Proceedings of JointNASA and IEEE Mass Storage Conference, vol. 3, College Park,MA, USA, March 1998.




[22] C. Policroniades and I. Pratt, “Alternatives for detecting re-dundancy in storage systems data,” in Proceedings of the 2004USENIX Annual Technical Conference, Boston, MA, USA, June2004, pp. 73–86.

[23] W. Tichy, “Rcs: a system for version control,” Software Practice& Experience, vol. 15, no. 7, pp. 637–654, July 1985.

[24] K. Eshghi and H. Tang, “A framework for analyzing and im-proving content-based chunking algorithms,” Hewlett-PackardLabs Technical Report TR, vol. 30, 2005.

[25] D. Meister and A. Brinkmann, “Multi-level comparison ofdata deduplication in a backup scenario,” in SYSTOR ’09:Proceedings of SYSTOR 2009: The Israeli Experimental SystemsConference. Haifa, Israel: ACM, May 2009, pp. 1–12.

[26] M. Rabin, Fingerprinting by random polynomials. Center for Re-search in Computing Techn., Aiken Computation Laboratory,Univ., 1981.

[27] V. Henson, “An analysis of compare-by-hash,” in Proceedingsof the 9th Workshop on Hot Topics in Operating Systems (HotOSIX), Lihue, Hawaii, USA, May 2003, pp. 13–18.

[28] S. Hanselman and M. Pegah, “The wild wild waste: e-waste,”in Proceedings of the 35th annual ACM SIGUCCS Fall Conference,Lake Buena Vista, FL, USA, October 2007, pp. 157–162.

[29] “Final rule: Retention of records relevant to audits and re-views, securities and exchange commission, 17 cfr part 210,[release nos. 33-8180; 34-47241; ic-25911; fr-66; file no. s7-46-02], rin 3235-ai74, retention of records relevant to audits andreviews.”

[30] W. H. Roach Jr, Medical records and the law. Jones & BartlettPublishers, 2006.

[31] B. H. Bloom, “Space/time trade-offs in hash coding withallowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422–426,1970.

[32] W. Pugh, “Skip lists: a probabilistic alternative to balancedtrees,” Communications of the ACM, vol. 33, no. 6, pp. 668–676,1990.

[33] S. Ghemawat, H. Gobioff, and S. Leung, “The Google filesystem,” ACM SIGOPS Operating Systems Review, vol. 37, no. 5,p. 43, 2003.

[34] J. Hamilton and E. Olsen, “Design and implementation ofa storage repository using commonality factoring,” in MassStorage Systems and Technologies, 2003 (MSST 2003). Proceedingsof 20th IEEE/11th NASA Goddard Conference on. San Diego,CA, USA: IEEE, April 2003, pp. 178–182.

[35] D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge, “Ex-treme binning: Scalable, parallel deduplication chunk-basedfile backup,” in Proceedings of the 17th IEEE/ACM InternationalSymposium on Modelling, Analysis and Simulation of Computerand Telecommunication Systems(MASCOTS 2009), London, UK,September 2009.

[36] D. Bhagwat, K. Pollack, D. D. E. Long, T. Schwarz, E. L.Miller, and J.-F. Paris, “Providing high reliability in a minimumredundancy archival storage system,” in MASCOTS ’06: Pro-ceedings of the 14th IEEE International Symposium on Modeling,Analysis, and Simulation, Monterey, CA, USA, September 2006,pp. 413–421.

[37] C. Liu, Y. Gu, L. Sun, B. Yan, and D. Wang, “R-admad: highreliability provision for large-scale de-duplication archivalstorage systems,” in ICS ’09: Proceedings of the 23rd interna-tional conference on Supercomputing, San Francisco, CA, USA,September 2009, pp. 370–379.

[38] Y. Zhang, Y. Wu, and G. Yang, “Droplet: A distributed solu-tion of data deduplication,” in Grid Computing (GRID), 2012ACM/IEEE 13th International Conference on. IEEE, 2012, pp.114–121.

[39] Z. Tang and Y. Won, “Multithread content based file chunkingsystem in cpu-gpgpu heterogeneous architecture,” in Proceed-ings of Data Compression, Communications and Processing (CCP),2011 First International Conference on, Palinuro, Italy, Jun 2011,pp. 58–64.

[40] C. Kim, K.-W. Park, and K. H. Park, “Ghost: Gpgpu-offloadedhigh performance storage i/o deduplication for primary stor-age system,” in Proceedings of the 2012 International Workshopon Programming Models and Applications for Multicores andManycores. ACM, 2012, pp. 17–26.

[41] R. Sudarsan, S. Fenves, R. Sriram, and F. Wang, “A productinformation modeling framework for product lifecycle man-

agement,” Computer-Aided Design, vol. 37, no. 13, pp. 1399–1411, 2005.

[42] S. Quinlan and S. Dorward, “Venti: A new approach toarchival storage,” in Proceedings of FAST ’02, Monterey, CA,USA, January 2002, pp. 89–101.

[43] A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, A. Tomas, andL. Vivier, “The new ext4 filesystem: current status and futureplans,” in Proceedings of the Linux Symposium, vol. 2. Citeseer,2007, pp. 21–33.

[44] J. Bruno, J. Brustoloni, E. Gabber, B. Ozden, and A. Silber-schatz, “Disk scheduling with quality of service guarantees,”in Multimedia Computing and Systems, 1999. IEEE InternationalConference on, vol. 2. IEEE, 1999, pp. 400–405.

[45] Y. D. Young, “Empirical study for efficient deduplicationmethods.” Hanyang Univ., 2011.

[46] SAMSUNG. Samsung galaxy s3 tech specs.http://www.samsung.com/africa en/consumer/mobile-phone/mobile-phone/smart-phone/GT-I9305RWDXFA-spec.

[47] ——. Samsung galaxy s4 tech specs.http://www.samsung.com/africa en/consumer/mobile-phone/mobile-phone/smart-phone/GT-I9500ZWEXFA-spec.

Youjip Won received the BS and MS de-grees in computer science from the SeoulNational University, Korea, in 1990 and 1992,respectively. He received the PhD degree incomputer science from the University of Min-nesota, Minneapolis, in 1997. After receivingthe PhD degree, he joined Intel as a serverperformance analyst. Since 1999, he hasbeen with the Division of Computer Scienceand Engineering, Hanyang University, Seoul,Korea, as a professor. His research interests

include operating systems, file and storage subsystems, multimedianetworking, and network traffic modeling and analysis.

Kyeongyeol Lim received the BS degreein Information & Communication from Baek-seok University in 2011. He is currentlyworking toward the MS degree in the Di-vision of Computer Science and Engineer-ing, Hanyang University, Seoul, Korea. Hisresearch interests include deduplication sys-tems and operating systems.

Jaehong Min Jaehong Min received the BSand MS degrees in computer science fromHanyang University, Seoul, Korea, in 2008and 2010, respectively. He is interested in filesystems and operating systems. After gradu-ation, he worked on developing deduplicationbased backup software as software engineerat MacroImpact Co. for three years. He isworking on designing software architectureat Samsung Electronics.




Appendix AExpected Chunk Size

E[Sm] =cmax−cmin∑

i=cmin

i · P(Sm = i)

=


i · (1 − p)i−1p(1 − p)cmin−1 − (1 − p)cmark

=p

(1 − p)cmin−1 − (1 − p)cmark


i · (1 − p)i−1

=1

(1 − p)cmin−1 − (1 − p)cmark· (cmin(1 − p)cmin−1

+1p

(1 − p)cmin − (cmax − cmin +1p

)(1 − p)cmax+cmin )

(18)

E[Sb] =cmin−1∑

i=w

iP(Sb = i) +M∑

i=cmark+1

iP(Sb = i)

=

cmin−1∑i=w

i · (1 − p)i−1p

((1 − p)w−1 − (1 − p)cmin−1) + ((1 − p)cmark − (1 − p)M+2)

+

M∑i=cmark+1

i · (1 − p)i−1p


=1


(w(1 − p)w−1 +1p

(1 − p)w − (1 − p)cmin−1(1p+ cmin − 1) + (cmark + 1)

(1 − p)cmark +1p

((1 − p)cmark+1 − (1 − p)M+1) −M(1 − p)M+1)

(19)

Appendix BProof of Chunking InvariabilityWe prove that Dual Mode Chunking and Inter-Segment Chunk Coalescing guarantee chunk invari-ability. Let b(c) and e(c) be the begin and end of achunk and is represented by the file offset.

Definition For chunk c and segment S, c ⊂ S if b(c) ≥b(S) and e(c) ≤ e(S).

Definition Two chunks ci and c j are identical and isdenoted by ci = c j if b(ci) = b(c j) and e(ci) = e(c j).

Theorem 2: Let S be a segment. Let B = {b1, . . . , bxl}and C = {c1, . . . , cn} be a set of chunks generated byslow mode chunking and single threaded chunkingfrom S. If we apply chunk coalescing to B, then for aset of coalesced chunks, B∗, B∗ = C.We will skip the proof of lemma 2.

Theorem 3: For a given file, let M = {m1,m2, . . . ,ml}and C = {c1, c2, . . . , cn} be a set of chunks createdby multithreaded chunking algorithm, MUCH, andsingle threaded chunking, respectively. Then, M = C.

Proof:If a file consists of one segment, the theorem holds

from Lemma 2. We assume that a file consists of morethan one segment. The proof is by induction on the

(a) Case 1 (b) Case 2

(c) Case 3

Fig. 21. Cases of theorem

number of chunks in a segment. ci ⊂ Si means thatchunk ci belongs to segment Si. When ci lies acrosstwo segments, Sk and Sk+1, cL

i = ci∩Sk and cRi = ci∩Sk+1.

Basis: m1 = c1.In MUCH, the first segment in a file is processed

in accelerated mode as in the case of single threadedchunking. Hence the claim.

Induction: Assume that ci = mi, i = 1, . . . , k. We willshow ck+1 = mk+1. We categorize the situations intothree cases based on the relative position of ck andck+1 in a segment (Fig. 21).

Case 1: ck ⊂ S j and ck+1 ⊂ S j for ∃S j.

Assume that mk+1 is created in accelerated mode.Since b(mk+1) = b(ck+1) and single and multithreadchunking algorithms use the same algorithm, thechunking results should be identical. Assume thatmk+1 consists of the chunks from the slow mode.Intra-segment chunk coalescing coalesces the chunksin the slow mode until the resulting chunk satisfiesthe minimum and the maximum chunk size bounds.If e(mk+1) < e(ck+1), e(ck+1) is not the first byte offsetwhich matches the bit pattern where the chunk size islarger than or equal to the minimum chunk size. Con-tradiction. If e(mk+1) > e(ck+1), Intra-Segment ChunksCoalescing should be set the end position of mk+1 ate(ck+1). Contradiction. Hence, when mk+1 consists ofthe chunks from the slow mode, mk+1 = ck+1. Hencethe claim.

Case 2: ck ⊂ S j and ck+1 ∩ S j+1 , ϕ ∃ j.

Assume ck+1 , mk+1. Then, there are two cases:e(mk+1) < e(ck+1), and e(mk+1) > e(ck+1). Assumee(mk+1) < e(ck+1). Let mb

1, ...,mbl denote the chunks

generated in slow mode for S j+1 with mbl being the

marker chunk. Master thread performs chunk coa-lescing mb

1,mb2, ...,m

bp so that the newly created chunk

becomes larger than or equal to the minimum chunksize. Then, mk+1 = mL

k+1 ∪ mb1 ∪ . . . ∪ mb

p, p ≤ l. Then,size(mL

k+1 ∪ mb1 ∪ ... ∪ mb

p) ≥ cmin and size(mLk+1 ∪ mb

1 ∪... ∪ mb

p−1) < cmin should hold. From definition of




single threaded chunking, e(ck+1) is the first positionafter b(ck+1) so that size(ck+1) ≥ cmin. Contradiction.Therefore, e(mk+1) ≥ e(ck+1) should hold.

Assume e(mk+1) > e(ck+1). mLk+1∪mb

1∪ ...∪mbp ≡ mk+1

and e(mk+1) = e(mbp). Then, e(mb

L) < e(ck+1). e(ck+1) isthe first position after b(ck+1) which satisfies the chunkboundary condition and e(ck+1) − b(ck+1) ≥ cmin.

From the definition of Chunk Coalescing,size(mL

k+1 ∪ mb1 ∪ . . . ∪ mb

p) ≥ cminsize(mLk+1 ∪

mb1 ∪ . . . ∪ mb

p−1 < cmin) should hold. Sincee(mk+1) > e(ck+1), e(ck+1) = e(mb

m),∃m ≤ p. Then,ck+1 = mL

k+1 ∪ mb1 ∪ . . .mbm,m < p and size(ck+1) ≥ cmin.

This contradicts size(mLk+1 ∪ mb

1 ∪ . . . ∪ mbp−1 < cmin.

Hence, e(mk+1) ≤ e(ck+1). From e(mk+1) ≤ e(ck+1) ande(mk+1) ≥ e(ck+1), e(ck+1) = e(ck+1) and hence the claim.

Case 3: ck ∩ S j , ϕ and ck ∩ S j+1 , ϕ and ck+1 ⊂ S j+1.

Let us establish a new segment S′j+1 such thatb(S′j+1) = b(ck+1). Then, both ck+1 and mk+1 become thefirst chunks of S′j+1. If mk+1 is in the accelerated moderegion, mk+1 = ck+1. If mk+1 is in the slow mode region,from Lemma 2, ck+1 = mk+1.

From case 1, 2 and 3, mk+1 = ck+1 should hold andhence the claim. ■

WON ET AL.: MUCH: MULTITHREADED CONTENT-BASED FILE … · 2015. 6. 8. · variable size chunking...

Documents

Transcript of WON ET AL.: MUCH: MULTITHREADED CONTENT-BASED FILE … · 2015. 6. 8. · variable size chunking...