Cbfs Report

8/3/2019 Cbfs Report

1/61


2/61

ABSTRACT

File systems abstract raw disk data into file and directories thereby providing

an easy-to-use interface for user applications to store and retrieve persistent information.

Most common file systems that exist today treat file data as opaque entities, and they do not

utilize information about their contents to perform useful optimizations. For example, todays

file systems do not detect files with same data. In this project, we design and implement a new

file system that understands the contents of the data it stores, to enable interesting

functionality. Specifically we focus on detecting and eliminating duplicate data items across

files. Eliminating duplicate data has two key advantages: first, the disk just stores a single

copy of a data item even if multiple files share it, thereby saving storage space. Second, the

disk I/O required for reading and writing copies of the same data are eliminated, thereby

improving performance. To implement this, we plan to work on the existing Linux Ext2 file

system and reuse part of its source code. Through our implementation we hope to demonstrate

the utility of eliminating duplicate file data both in terms of space savings and performance

improvements.


3/61

TABLE OF CONTENTS

1. INTRODUCTION

2. MOTIVATION

3. BACKGROUND

3.1 . File System Background

3.2. Overview of Linux VFS

3.3 . Layout of the Ext2 File System

3.4 . Content Based File System

4 DESIGN

4.1 Overall structure

4.2 Detecting Duplicate Data

4.3 Tracking Content Information

4.4 Online Duplicate Elimination

4.5 DISCUSSION

i.Concept in LBFS

ii.File-level content checking

5 IMPLEMENTATION

5.1 Data Structures

5.2 HASH TABLE OPERATIONS

5.2.1. Initializing the hash table

5.2.2. Compute Hash

5.2.3. Add Entry

5.2.4. Check Duplicate

5.2.5. Remove Entry

5.2.6. Free Hash Table

5.3 Read Data flow

5.4 WRITE DATA FLOW

5.4.1 Overall Write Flow

5.4.2 Unique Data (New Block)


4/61

5.4.3 Unique Data(Overwrite Existing Block)

5.4.4 Duplicate Data(New Block)

5.4.5 Duplicate Data(Overwrite)

5.4.6 Delete Data

5.5 Duplicate Eliminated Cache

5.6 Code snippets

6 EVALUATION

6.1 Correctness Check

6.2 Performance Check

6.3 Overall Analysis

7 RELATED WORK

8 FUTURE WORK

9 REFERENCES


5/61

INTRODUCTION :

A file system is a method for storing and organizing files and the data they contain to

make it easy to find and access them. The present file systems in Linux are content-oblivious.

They do not know about the nature of data present in the disk blocks. Therefore, even if two ormore blocks of data are are identical, they are stored as redundant copies in the disk. This is

the case dealt with in this project. The outcome is duplicate elimination in disks in the block-

level. Content hashing is the method used to compare the two blocks to be redundant.

MD5 algorithm is used to compute the hash values of each of the blocks in the disk. Before

each write to a disk, the hash value is calculated for the new write and this is compared to the

already existing hash values. If the same hash value is already present in the hash table, it

means the block is a duplicate block.

This project is implemented in the Linux kernel 2.6, wherein the existing ext2 file

system functions are modified accordingly to accomplish the goal of duplicate elimination.

The evaluation of this project is done in two phases namely the correctness check and the

performance check. A testing program which does exhaustive file system operations is

executed and the file system is found to be stable and also seen to produce expected results.

The performance check is done by evaluating the postmark results.

This report details the design and implementation of the content-based file system.

Since the main change is dont only in the write phase, each write scenarios are explained in

detail with the status of the hash table illustrated by diagrams both before and after the program

code segment execution.

MOTIVATION :

The existing ext2 file system in Linux is devoid of any information regarding the

data in the disk. It cannot distinguish between two blocks of data to be either unique or

identical. This means, two identical files will be stored in two separate locations in the disk

and thus require the space double the size of the file. This would not only result in wastage of

disk space but also increased I/O operations in reading both files separately, holding two

identical pages of data in the cache etc., Let us consider some of the example cases wherein

this disadvantage proves to be a bigger problem.

Consider the case wherein the virtual machines are used for testing purposes. When

Linux is installed in the virtual machine and if the host OS of the virtual machine is also Linux,

then, all the packages in Linux will first be stored in the disk and when the Linux is again

installed in the Virtual machine, again the same set of packages will be separately stored in the


6/61

disk. The average size of the whole packages in the Linux distribution will be around 5

Gigabytes of storage. Thus, disk space of about 10 Gigabytes of storage will be needed to

have the Linux virtual machine with the host OS as Linux. When the virtual machine runs

there will again be more I/O operations resulting in the degradation of performance. But if the

duplicate blocks are eliminated, considerable disk space of about 5 Gigabytes of storage can be

saved and there will be a considerable reduction in the number of I/O operations thereby

increasing the performance to a good extend.

Therefore, if by some means, the file system knows the content of the blocks in the

disk before it writes a new block, this disadvantage can very well be eliminated. Hashing is a

coming technique to generate a fixed-size unique hash for any arbitrary-sized input. Thus,

when the content of each data blocks in the disk are hashed, they can easily be compared with

one another and the file system can control the read and write operation accordingly. Content

hashing in this project is done by using the Message Digest Algorithm.

BACKGROUND :

1) File System Background:

A file system is a method for storing and organizing files and the data they contain

to make it easy to find and access them. More formally, it is a set of abstract data types that are

implemented for the storage, hierarchical organization, manipulation, navigation, access, andretrieval of data.

A disk has just a group of sectors and tracks. So, it can be able to just do operations

in the granularity of sectors. eg: read a sector, write to a sector etc. But we need to have a

hierarchical structure for maintaining the files. With just a disk present, this cannot be done

because, the hard disk is just a linear collection of bits(0s and 1s) arranged into tracks and

sectors. For this purpose file systems are used. With a file system, we have an interface

between the user programs and the hard disk. We tell the file system to write a file to the disk

and it is the file system that knows the disk structure and copies the blocks of data in the disk.

A file system treats the data from the disk as just a fixed-sized block that contains

information. But the file system has no semantic information pertaining to the data. They

treat all the data, may it be that of a file or a dentry or an inode in a single notion as a block of

data.


7/61

2) Overview of Linux VFS:

Linux comprises of the Virtual File system Switch (VFS) layer which lies

intermediate to the application and the various file systems. Every request to the disk, before

passing to the file system, goes through the VFS. It acts as a generalization layer to all theunderlying file systems. Its function is to locate the file system for a particular file from itsfile

objectand then map the functions to the file system-specific functions. Some of the common

functions like read and write work in the same pattern for most of the file systems. Therefore, a

generic function is available for these type of functions and the VFS layer maps the request to

these generic functions. There are four basic objects in the VFS namely,

Super Block Object

Inode Object

File Object

Dentry Object.

Consider a process P1, makes a read request for a file F1 that is stored in the disk partition

formatted with the Ext2 file system. Similarly process P2 requests for file F2 in Ext3 file

system. Both these requests get transferred to the corresponding system call (sys_read() in this

case). The system call handling routine transfers it to the VFS layer. The VFS layer in turntransfers the read request to the corresponding file systems read function. The VFS knows the

file system associated of any file by the file object that is passed to it. Some of the file systems

in turn maps the basic operation requests to the generic functions which ultimately carries out

the request and gives it to the layer above.

3) Page Cache:

The page cache is the main disk cache used by the Linux kernel. In most cases, the

kernel refers to the page cache when reading from or writing to disk. New pages are added to

the page cache to satisfy User Mode processes read requests. If the page is not already in the

cache, a new entry is added to the cache and filled with the data read from the disk. If there is

enough free memory, the page is kept in the cache for an indefinite period of time and can then

be reused by other processes without accessing the disk.


8/61

Fig 3.1 : Overview of Linux Filesystems

Similarly, before writing a page of data to a block device, the kernel verifies whether

the corresponding page is already included in the cache; if not, a new entry is added to the

cache and filled with the data to be written on disk. The I/O data transfer does not start

immediately: the disk update is delayed for a few seconds, thus giving a chance to the

processes to further modify the data to be written (in other words, the kernel implements

deferred write operations).

VFS

PROCESS 1 PROCESS 2 PROCESS 3 PROCESS 4

FILE 1 FILE 2 FILE 3 FILE 4

INODE OBJECT

FILE OBJECT

ENTRY OBJECT

SUPER BLOCK OBJECT

DISK

DISK CONTROLLER

EXT2 EXT3 NFS

READWRITE READ WRITE


9/61

Kernel code and kernel data structures don't need to be read from or written to disk.

Kernel designers have implemented the page cache to fulfill two main requirements:

Quickly locate a specific page containing data relative to a given owner. To take the

maximum advantage from the page cache, searching it should be a very fast operation.

Keep track of how every page in the cache should be handled when reading or writing

its content. For instance, reading a page from a regular file, a block device file, or a swap

area must be performed in different ways, thus the kernel must select the proper operation

depending on the page's owner.

The unit of information kept in the page cache is, of course, a whole page of data. A

page does not necessarily contain physically adjacent disk blocks, so it cannot be identified by

a device number and a block number. Instead, a page in the page cache is identified by an

owner and by an index within the owner's data usually, an inode and an offset inside the

corresponding file.

4)Buffer Pages:

In old versions of the Linux kernel, there were two different main disk caches: the

page cache, which stored whole pages of disk data resulting from accesses to the contents of the

disk files, and the buffer cache , which was used to keep in memory the contents of the blocks

accessed by the VFS to manage the disk-based file systems.

Starting from stable version 2.4.10, the buffer cache does not really exist anymore. In

fact, for reasons of efficiency, block buffers are no longer allocated individually; instead, they

are stored in dedicated pages called "buffer pages ," which are kept in the page cache.

Formally, a buffer page is a page of data associated with additional descriptors called

"buffer heads", whose main purpose is to quickly locate the disk address of each individual

block in the page. In fact, the chunks of data stored in a page belonging to the page cache are

not necessarily adjacent on disk.

Whenever the kernel must individually address a block, it refers to the buffer page

that holds the block buffer and checks the corresponding buffer head. Here are two common

cases in which the kernel creates buffer pages:

When reading or writing pages of a file that are not stored in contiguous disk blocks.

This happens either because the file system has allocated noncontiguous blocks to the file,

or because the file contains "holes".


10/61

When accessing a single disk block (for instance, when reading a superblock or an

inode block).

In the first case, the buffer page's descriptor is inserted in the radix tree of a regular

file. The buffer heads are preserved because they store precious information: the block deviceand the logical block number that specify the position of the data in the disk.

In the second case, the buffer page's descriptor is inserted in the radix tree rooted at

the address_space objectof the inode in the bdev special file system associated with the block

device. This kind of buffer pages must satisfy a strong constraint: all the block buffers must

refer to adjacent blocks of the underlying block device.

An instance of where this is useful is when the VFS wants to read the 1,024-byte

inode block containing the inode of a given file. Instead of allocating a single buffer, the kernel

must allocate a whole page storing four buffers; these buffers will contain the data of a group of

four adjacent blocks on the block device, including the requested inode block.

All the block buffers within a single buffer page must have the same size; hence, on

the 80 x 86 architecture, a buffer page can include from one to eight buffers, depending on the

block size.

When a page acts as a buffer page, all buffer heads associated with its block buffers

are collected in a singly linked circular list. The private field of the descriptor of the buffer page

points to the buffer head of the first block in the page; every buffer head stores in the

b_this_page field a pointer to the next buffer head in the list. Moreover, every buffer head

stores the address of the buffer page's descriptor in the b_page field. Figure 3.2 shows a

buffer page containing four block buffers and the corresponding buffer heads.

Because the private field contains valid data, the PG_private flag of the page is also set; hence,

if the page contains disk data and the PG_private flag is set, then the page is a buffer page.

Notice, however, that other kernel components not related to the block I/O subsystem use the

private and PG_private fields for other purposes.

4) Writing Dirty Pages to Disk

The kernel keeps filling the page cache with pages containing data of block devices.

Whenever a process modifies some data, the corresponding page is marked as dirty that is, its

PG_dirty flag is set.


11/61

Figure3.2 : A buffer page including four buffers and their buffer heads

Unix systems allow the deferred writes of dirty pages into block devices, because

this noticeably improves system performance. Several write operations on a page in cache

could be satisfied by just one slow physical update of the corresponding disk sectors.

Moreover, write operations are less critical than read operations, because a process is usually

not suspended due to delayed writings, while it is most often suspended because of delayed

reads. Thanks to deferred writes, each physical block device will service, on the average, many

more read requests than write ones.

A dirty page might stay in main memory until the last possible moment that is, until

system shutdown. However, pushing the delayed-write strategy to its limits has two major

drawbacks:

If a hardware or power supply failure occurs, the contents of RAM can no longer be

retrieved, so many file updates that were made since the system was booted are lost.

The size of the page cache, and hence of the RAM required to contain it, would have to

be huge at least as big as the size of the accessed block devices.

Therefore, dirty pages areflushed(written) to disk under the following conditions:

The page cache gets too full and more pages are needed, or the number of dirty pages

becomes too large.

Too much time has elapsed since a page has stayed dirty.

A process requests all pending changes of a block device or of a particular file to be

flushed; it does this by invoking a sync(), fsync(), orfdatasync() system call.

Buffer pages introduce a further complication. The buffer heads associated with each buffer

page allow the kernel to keep track of the status of each individual block buffer. The PG_dirty


12/61

flag of the buffer page should be set if at least one of the associated buffer heads has the

BH_Dirty flag set. When the kernel selects a dirty buffer page for flushing, it scans the

associated buffer heads and effectively writes to disk only the contents of the dirty blocks. As

soon as the kernel flushes all dirty blocks in a buffer page to disk, it clears the PG_dirty flag of

the page.

5) Layout of the Ext2 file system :

The first block in each Ext2 partition is never managed by the Ext2 file system,

because it is reserved for the partition boot sector. The rest of the Ext2 partition is split into

block groups , each of which has the layout shown in Figure XXXXXXXXXXX. As you will

notice from the figure, some data structures must fit in exactly one block, while others may

require more than one block. All the block groups in the file system have the same size and are

stored sequentially, thus the kernel can derive the location of a block group in a disk simply

from its integer index.

Figure 3.3 : Layouts of an Ext2 partition and of an Ext2 block group

Block groups reduce file fragmentation, because the kernel tries to keep the data blocks

belonging to a file in the same block group, if possible. Each block in a block group contains

one of the following pieces of information:

A copy of the file system's superblock

A copy of the group of block group descriptors

A data block bitmap

An inode bitmap

A table of inodes


13/61

A chunk of data that belongs to a file; i.e., data blocks

If a block does not contain any meaningful information, it is said to be free. As

seen from Figure 3.3 , both the superblock and the group descriptors are duplicated in each

block group. Only the superblock and the group descriptors included in block group 0 are usedby the kernel, while the remaining superblocks and group descriptors are left unchanged; in

fact, the kernel doesn't even look at them. When the e2fsck program executes a consistency

check on the file system status, it refers to the superblock and the group descriptors stored in

block group 0, and then copy them into all other block groups. If data corruption occurs and the

main superblock or the main group descriptors in block group 0 become invalid, the system

administrator can instruct e2fsck to refer to the old copies of the superblock and the group

descriptors stored in a block groups other than the first. Usually, the redundant copies store

enough information to allow e2fsck to bring the Ext2 partition back to a consistent state.

This figure 3.4 shows the actual mapping of the inode to the corresponding data

blocks in a single group.

|--Inode table---| |---Indirect blocks pointing to data blks---------| |---Data Blks----|

Fig 3.4 Inode pointers in Ext2 filesystem

As shown from the figure above, each entry in the inode table points to a specific

data block and the contents of the data blocks are never taken care of. Therefore, there exists

multiple copies of same information in many data blocks in the disk and more space is wasted

for this.


14/61

6) Data Blocks Addressing in EXT2:

Each nonempty regular file consists of a group of data blocks . Such blocks may be

referred to either by their relative position inside the file their file block number or by their

position inside the disk partition their logical block number.

Deriving the logical block number of the corresponding data block from an offset f

inside a file is a two-step process:

1. Derive from the offsetfthe file block number the index of the block that contains the

character at offsetf.

2. Translate the file block number to the corresponding logical block number.

Because Unix files do not include any control characters, it is quite easy to derive the

file block number containing thefth character of a file: simply take the quotient offand the file

system's block size and round down to the nearest integer.

For instance, let's assume a block size of 4 KB. Iff is smaller than 4,096, the

character is contained in the first data block of the file, which has file block number 0. Iffis

equal to or greater than 4,096 and less than 8,192, the character is contained in the data block

that has file block number 1, and so on.

This is fine as far as file block numbers are concerned. However, translating a file

block number into the corresponding logical block number is not nearly as straightforward,

because the data blocks of an Ext2 file are not necessarily adjacent on disk.

The Ext2 file system must therefore provide a method to store the connection

between each file block number and the corresponding logical block number on disk. This

mapping, which goes back to early versions of Unix from AT&T, is implemented partly inside

the inode. It also involves some specialized blocks that contain extra pointers, which are an

inode extension used to handle large files.

The i_block field in the disk inode is an array of EXT2_N_BLOCKS components

that contain logical block numbers. In the following discussion, we assume that

EXT2_N_BLOCKS has the default value, namely 15. The array represents the initial part of a

larger data structure, which is illustrated in Figure XXXX. As can be seen in the figure, the 15

components of the array are of 4 different types:

The first 12 components yield the logical block numbers corresponding to the first 12

blocks of the file to the blocks that have file block numbers from 0 to 11.


15/61

The component at index 12 contains the logical block number of a block, called

indirect block, that represents a second-order array of logical block numbers. They

correspond to the file block numbers ranging from 12 to b/4+11, where b is the file

system's block size (each logical block number is stored in 4 bytes, so we divide by 4 in the

formula). Therefore, the kernel must look in this component for a pointer to a block, and

then look in that block for another pointer to the ultimate block that contains the file

contents.

The component at index 13 contains the logical block number of an indirect block

containing a second-order array of logical block numbers; in turn, the entries of this

second-order array point to third-order arrays, which store the logical block numbers that

correspond to the file block numbers ranging from b/4+12 to (b/4)2+(b/4)+11.

Finally, the component at index 14 uses triple indirection: the fourth-order arrays store

the logical block numbers corresponding to the file block numbers ranging from

(b/4)2+(b/4)+12 to (b/4)

3+(b/4)

2+(b/4)+11.

Figure 3.5 : Data structures used to address the file's data blocks

In Figure 3.5, the number inside a block represents the corresponding file block number. The

arrows, which represent logical block numbers stored in array components, show how the


16/61

kernel finds its way through indirect blocks to reach the block that contains the actual contents

of the file.

Notice how this mechanism favors small files. If the file does not require more than 12 data

blocks, every data can be retrieved in two disk accesses: one to read a component in the i_blockarray of the disk inode and the other to read the requested data block. For larger files, however,

three or even four consecutive disk accesses may be needed to access the required block. In

practice, this is a worst-case estimate, because dentry, inode, and page caches contribute

significantly to reduce the number of real disk accesses.

Notice also how the block size of the file system affects the addressing mechanism, because a

larger block size allows the Ext2 to store more logical block numbers inside a single block.

Table 1 shows the upper limit placed on a file's size for each block size and each addressing

mode. For instance, if the block size is 1,024 bytes and the file contains up to 268 kilobytes of

data, the first 12 KB of a file can be accessed through direct mapping and the remaining 13-268

KB can be addressed through simple indirection. Files larger than 2 GB must be opened on 32-

bit architectures by specifying the O_LARGEFILE opening flag.

Table 1. File-size upper limits for data block addressing

Block size Direct 1-Indirect 2-Indirect 3-Indirect

1,024 12 KB 268 KB 64.26 MB 16.06 GB

2,048 24 KB 1.02 MB 513.02 MB 256.5 GB

4,096 48 KB 4.04 MB 4 GB ~ 4 TB


17/61

THE CONTENT-BASED FILE SYSTEM :

The content based file system employs a technique called content hashing to

compare the content of the blocks and find if they are duplicate. Message Digest Algorithm is

used to compute the hash value of each block. For every data blocks in the disk, thecorresponding hash value is calculated. At any given instant, the hash table will have one

entry for every valid data blocks in the disk. There are two hash table structures called the

checksum hash table that is indexed by the checksum field and the block hash table that is

indexed by the block number field. Both these structures point to the single copy of the hash

node.

In the context of the file system, there is no change in any of the inode data

structures. The only difference is that, the inode table will have multiple entries pointing to the

same block number in the disk. For eg. If a 50 Megabytes-sized file is created in the partition

having the content based file system, and if the file contains just 40 MB of unique

information in it and the remaining 10 MB of data are duplicate, then, the inode table for that

file will have the same number of pointers as with the fully-unique 50 MB file. But, the inode

table will have duplicate pointers for the remaining 10 MB and therefore, the essential space

that is occupied by the file is just 40 MB. In this way, disk space can be saved to a good

extend. When the file is read, the inode table is accessed in a normal way and reads the

corresponding data blocks. When a data block is already read from the disk and if it is presentin the buffer cache, then, if the same block is needed again, it doesnt need to invoke a disk

read again. It can just use the copy of data in the buffer cache. In this way, by the better cache

hit, performance can be enhanced during read operation. Therefore, for any operation like file

create, file overwrite, file append or truncation, before the disk write operation is invoked, the

checksum values are compared and only if there are no duplicate blocks, the new block is

written; otherwise, no new data blocks are written and the disk space remains the same . As

a result of this, at any given instant, the disk will be having only a single copy of any data and

there is no place for redundancy. In addition to efficient utilization of disk space, contenthashing also helps in maintaining the data integrity. In certain cases, this technique can reduce

considerable amount of disk space and also increase the performance of the operations.


18/61


19/61

codes. These codes dont lead to overhead and thus not too inefficient. But, these are not

collision resistant. This means, there may be duplicate blocks that are easily missed by the

error detection codes. Therefore, this technique cannot be relied upon to detect the duplicate

blocks. Considering these cases, collision-resistant hashing proves to be a more viable

method for accomplishing the goal of detecting duplicate blocks.

Collision Resistant Hashing :

Collision-Resistant hashing is a technique by which, a unique hash value is generated

for each unique content of the block. The Message Digest Algorithm 5(MD5) is used for this

purpose. The MD5 algorithm is widely used in many cryptographic applications and it is

found to be more collision resistant than its predecessors. It takes in the input of arbitrary

length and produces a MD5 hash of 128 bits. For any unique input, a unique MD5 hash is

produced. This is a one-way hashing technique wherein, from a given data, its corresponding

hash value can very well be calculated but given a MD5 hash, it is not possible to derive the

input data to it. We employ this technique to compare the contents of the block. As, the

MD5 hashing algorithm gives a unique hash for every unique input data, it can be said that no

identical blocks of data will give raise to a single hash value. Therefore, if the hash value is

calculated for every data blocks, then comparing the hash value would mean comparing the

actual blocks of data.

4.3 TRACKING CONTENT INFORMATION:

The hash table is used to track the content information pertaining to the various disk

data blocks. For every unique content in the disk, there is an entry in the hash table. In the

hash table, there is a checksum->block number mapping available. This mapping is used to

locate the duplicate blocks(if any) before any write operation takes place in the disk. The

overall structure of the hash table is illustrated by the following figure.


20/61

Figure : 4.3 Hash table structure

Checksum Hash Table Disk Blocks Block Hash Table

4.4 ONLINE DUPLICATE ELIMINATION:

WRITE SCENARIOS :

1) Unique Data (New Block):

This happens in the cases of file creation or a file append. Basically, a new block is allocated here and then the data is copied to it. Before writing the data to the

block, the hash table is checked and found to be a new data. Therefore, the new block

entry, its checksum value and a reference count of 1 is added to the hash table and the

normal write operation is continued to execute.

2) Unique Data (Existing Block):

This happens in the case of file modification. When a file is modified, one or

more blocks corresponding to the file get modified and before writing it to the disk, the

routine redundancy check is performed. In this case, the new modified checksum doesnt

find an entry in the hash table. But already there is an entry in the hash table for that block

but with a different checksum. Therefore, in this case, the reference count of that block is

decremented and a new block is allocated and the new contents are copied to it with

updating the hash table for that new block.

BLOCK 4

BLOCK 3

BLOCK 6

BLOCK 5

BLOCK 2

BLOCK 1B5 RCC1

B1 RCC2

B3 RCC3

B2 RCC4

B6 RCC5

B4 RCC6

B1 RCC2

B2 RCC4

B3 RCC3

B4 RCC6

B5 RCC1

B6 RCC5


21/61

3) Duplicate Data (New block):

This happens in the case of file creation or a file append wherein the efficiency

of content-based file system is utilized. In this case, before writing to a disk, the hash

table is checked for duplicate block and that corresponding block is mapped to thecorresponding inode pointer. The reference count of the block is incremented.

4) Duplicate Data (Existing Block):

This is a case where an existing block is modified and the now modified

contents are found to be duplicate. In this case, the reference count of the old block is

decremented and the reference count of the new block is incremented. Then, the

corresponding inode pointers are updated with the new block.

5) Delete Data:

This happens when a file is modified by deleting a part of its contents, or during

a file remove. In this case, the reference count alone is decremented and when the

reference count is 0, the block gets freed and the hash entry is removed.

DISCUSSION :

Comparison with LBFS :

LBFS is a network file system which conserves communication bandwidth

between clients and servers. It takes advantage of cross-file similarities. When transferring a

file between the client and server, LBFS identifies chunks of data that the recipient already has

in other files and avoids transmitting the redundant data over the network. Here, there is no

block-level redundancy check performed. Instead, some chunks of data are compared to check

duplication of data. Chunks are variable-sized blocks. Here, when a modification is made to

a shared data, the block size is made to increase and then necessary changes are made to the

kernel to handle that. This is more complex to implement when compared to the block-level

redundancy check that is performed in this project

File-Level Redundancy check :

Another technique that is synonymous with the block-level hashing is the file-level

redundancy checking. In this case, the whole file is compared and the redundant files are


22/61

eliminated. This usage of this technique is limited to the availability of redundant files in a

file system. Only if there are more than one redundant files in the disk partition, the advantage

can be felt. But in the case of block-level redundancy check, the duplicate blocks are

eliminated in inter-file environment. Even when two files are different, as a whole, the

duplicate elimination can be done for some of the redundant blocks in the two files alone. The

chances of having the advantage of this file system can be found in a more frequent manner

since there may be many blocks that are redundant and spread over different files.

Therefore, the block-level content hashing and redundancy elimination is a

beneficial method that brings about efficient disk space usage and also a better performance by

reducing the number of disk accesses and by attaining a better cache hit.


23/61

5. IMPLEMENTATION:

Figure : 5.0 : Control Logic of CBFS

CHECK DUPLICATE IN CHECKSUM TABLE

FOUND NOT FOUND

EXIT WITH SUCCESS

CHANGE PAGE MAPPING

FOUND

CHECK BLOCK HASH TABLE

UPDATE HASH TABLE

CHANGE INODE POINTER

NOT FOUND

ALLOCATE NEW BLOCK

REMOVE HASH ENTRY ADD HASH ENTRY


24/61

5.1 DATA STRUCTURES :

1) Hash Table Structure :

There are two hash table structures called checksum hash table (indexed by the

checksum field) and a block hash table (indexed by the block number field). Both these

hash table structures point to a single copy of the hash node. The following figure

illustrates the overall structure of the hash table.

Figure 5.1 : Structure of hash tables

2) Components of the hash table :

The hash table comprises of three field namely a checksum field of 128 bits, a

block number field (Logical block number of the disk) and the reference count of the block.

The block number field helps in locating the physical block of the disk. Therefore, for

every valid data block in the disk, there will be a hash entry in the hash table.

CHECKSUM 1 BLOCK 2 REFERENC

E

BLOCK 6 REF COUNT

CHECKSUM 2 BLOCK 5 REF COUNT


CHECKSUM 6



BLOCK

HASH ABLE

POINTER

INDEXED

BY BLOCK

CHECKSUM

HASH TABLE

POINTERINDEXED BY

CHECKSUM


25/61

5.2 HASH TABLE OPERATIONS:

There are two hash table pointers namely, checksum_htable and block_htable. The

operations that are associated with the hash table are namely,

1) Initializing the hash table:

Initializing the hash table involves creation of the hash table structure, and the hash

buckets. This is done at the time of mounting the content based file system. This is

done in the kernel function, cbfs_fill_super(). The code snippet for initializing the two

hash tables are given below :

struct cbfs_hash_table* cbfs_init_checksum_hash_table(int

(*hash) (void *)) {

int i;

struct cbfs_hash_table *newtable;

newtable = kmalloc(sizeof(struct cbfs_hash_table),

GFP_KERNEL);

BUG_ON(!newtable);

memset(newtable,0, sizeof(struct cbfs_hash_table));

newtable->buckets = (struct cbfs_hash_bucket *)

__get_free_pages(GFP_KERNEL,

get_order(NUM_CHECKSUM_BUCKETS *

sizeof(struct cbfs_hash_bucket)));

BUG_ON(!newtable->buckets);

memset(newtable->buckets,0, NUM_CHECKSUM_BUCKETS *

sizeof(struct cbfs_hash_bucket));

for (i=0; ibuckets[i].node_list);

newtable->hash = hash;

spin_lock_init(&newtable->lock);

return newtable;

}


26/61

struct cbfs_hash_table* cbfs_init_block_hash_table(int (*hash)

(void *)) {

int i;

struct cbfs_hash_table *newtable;

newtable = kmalloc(sizeof(struct cbfs_hash_table),

GFP_KERNEL);

BUG_ON(!newtable);

memset(newtable,0, sizeof(struct cbfs_hash_table));

newtable->buckets = (struct cbfs_hash_bucket *)

__get_free_pages(GFP_KERNEL,

get_order(NUM_CHECKSUM_BUCKETS *

sizeof(struct cbfs_hash_bucket)));

BUG_ON(!newtable->buckets);

memset(newtable->buckets,0, NUM_BLOCK_BUCKETS *

sizeof(struct cbfs_hash_bucket));

for (i=0; ibuckets[i].node_list);

newtable->hash = hash;

spin_lock_init(&newtable->lock);

return newtable;

}

2) Check duplicate:

This function, takes in the contents of the block and returns whether there is another

block already with the same contents. This function internally invoked

cbfs_compute_checksum() function and calculates the checksum field and then compares it

with the checksum fields in the hash table. Given below is the code snippet for

cbfs_check_duplicate_block() function :

long cbfs_check_duplicate_block(struct cbfs_hash_table

*checksum_htable, struct cbfs_hash_table *block_htable,

char *data, long new_blk_no) {


27/61

long err = 0;

char *checksum;

long blk_no;

struct cbfs_hash_node *node_c, *node_b;

checksum = cbfs_compute_checksum(data);

spin_lock(&checksum_htable->lock);

node_c = cbfs_node_lookup_by_checksum(checksum_htable,

checksum);

node_b = cbfs_node_lookup_by_block(block_htable,

new_blk_no);

if (!node_c && !node_b) {

cbfs_add_hash_entry(checksum_htable, block_htable,

checksum, new_blk_no);

err = new_blk_no;

goto out;

}

else if (!node_c && node_b){

err = -1;

}

else {

node_c->ref_count++;

blk_no = node_c->block_number;

err = blk_no;

goto out;

}

out:

spin_unlock(&checksum_htable->lock);

return err;

}

char* cbfs_compute_checksum(char *data) {

char *err = NULL;

char *checksum;

checksum = kmalloc(CHECKSUM_SIZE, GFP_KERNEL);

hmac(data, DATA_SIZE, KEY, KEY_SIZE, (void *)checksum);


28/61

err = checksum;

return err;

}

3) Add hash entry

This function, takes in the checksum value, block number and then adds the entry in

the hash table and initializes the reference count to 1. This is invoked when a new block is

allocated and it is not found to be duplicate. The code snippet for the

cbfs_add_hash_entry() function is given below :

void cbfs_add_hash_entry(struct cbfs_hash_table

*checksum_htable, struct cbfs_hash_table *block_htable, char

*checksum, long blk_no) {

struct cbfs_hash_node *newnode;

int checksum_bucket;

int block_bucket;

long *blk = &blk_no;

newnode = kmem_cache_alloc(cacheptr, GFP_KERNEL);

newnode->checksum = checksum;

newnode->block_number = blk_no;

newnode->ref_count = 1;

INIT_LIST_HEAD(&newnode->block_ptr);

INIT_LIST_HEAD(&newnode->checksum_ptr);

checksum_bucket = checksum_htable->hash((void *)checksum);

block_bucket = block_htable->hash((void *)blk);

list_add_tail(&newnode->checksum_ptr,

&checksum_htable-

>buckets[checksum_bucket].node_list);

checksum_htable->len++;

list_add_tail(&newnode->block_ptr, &block_htable-

>buckets[block_bucket].node_list);

block_htable->len++;

}

4) Remove hash entry


29/61

This function is for removing the hash entry if its reference count is 1 and for

decrementing the reference count otherwise. This is called inside the cbfs_free_block()

function. Therefore, for every block free in the file system, this function is invoked and

thus the hash table is kept updated at every instant. The function

cbfs_remove_hash_entry() is given below :

int cbfs_remove_hash_entry (struct cbfs_hash_table

*checksum_htable,

struct cbfs_hash_table

*block_htable,

long blk_no) {

int err = 0;

int checksum_bucket;

int block_bucket;

char *checksum=NULL;

long *blk = &blk_no;

struct list_head *pos1, *pos2;

struct cbfs_hash_node *node;

block_bucket = block_htable->hash((void *)blk);

spin_lock(&block_htable->lock);

list_for_each(pos1, &block_htable-

>buckets[block_bucket].node_list) {

node = list_entry(pos1, struct cbfs_hash_node,

block_ptr);

if (node == NULL) {

printk (\nNo node to free !!);

err=0;

goto out;

}

if(node->block_number == blk_no) {

if(node->ref_count == 1) {

checksum = node->checksum;

list_del(pos1);

block_htable->len--;

goto cs;

}


30/61

else {

node->ref_count--;

err = -1;

goto out;

}

}

}

goto out;

cs:

checksum_bucket = checksum_htable->hash((void

*)checksum);

list_for_each(pos2, &checksum_htable-

>buckets[checksum_bucket].node_list) {

node = list_entry(pos2, struct cbfs_hash_node,

checksum_ptr);

if(memcmp((void *)node->checksum, (void

*)checksum, CHECKSUM_SIZE) == 0) {

list_del(pos2);

kmem_cache_free(cacheptr, node);

checksum_htable->len--;

err = 0;

goto out;

}

}

out:

spin_unlock(&block_htable->lock);

return err;

}

5) Freeing the hash table:

This is for removing the entire hash table from the memory. This is done when

the CBFS module is removed from the kernel. This is invoked in the function,

cbfs_module_exit() function which will be called at the time of module remove. Given

below is the code snippet for the cbfs_hash_free() function :


31/61

void cbfs_hash_free(struct cbfs_hash_table *checksum_htable,

struct cbfs_hash_table *block_htable) {

int i;


struct list_head *pos1, *pos2, *n;

for(i=0;ibuckets[i].node_list) {

node=list_entry(pos1, struct cbfs_hash_node,

checksum_ptr);

list_del(pos1);

}

}

for(i=0;ibuckets[i].node_list) {

node=list_entry(pos2, struct cbfs_hash_node,

block_ptr);

list_del(pos2);

kmem_cache_free(cacheptr, node);

}

}

free_pages ((unsigned long)checksum_htable->buckets,

get_order

(NUM_CHECKSUM_BUCKETS * sizeof(struct

cbfs_hash_bucket)));

free_pages ((unsigned long)block_htable->buckets,

get_order

(NUM_CHECKSUM_BUCKETS * sizeof(struct

cbfs_hash_bucket)));

printk(\nBuckets freed);

kfree(checksum_htable);

kfree(block_htable);

printk(\nHash tables freed);

}


32/61

5.3 READ DATA FLOW :

Figure : 5.3 Read data flow

As illustrated in the above flow diagram, the read request travels through a series of layers and

functions and finally, ends up searching the inode pointers for the disk blocks to read. The

inode pointers will be identical for the redundant blocks and for those blocks, the disk read is

made only once and for every subsequent block reads, it will be available in the cache.

Therefore the performance is improved by this content based file system. Also, no extra

complexity is added with regard to the read function or the inode structure. This makes the

content based file system a more viable option.

5.4 WRITE DATA FLOW :

1)Unique Data (New block) :

This is the most common write scenario in a system, wherein, a file is either created or

appended. Here, a new block is allocated, the block address is added into the appropriate place

in the inode pointer and the, the content to be written are copied from the users address space

SYS_READ() FUNCTION

VFS LAYER (Maps the file system-specific read() function or the generic read() function)

GENERIC_FILE_READ()

CHECKS THE INODE POINTERS TO LOCATE

READS THE CORRESPONDING BLOCKS FROM

DISK


33/61

to the page that is mapped to that block. At this stage, the hash value for the new content that

is available in the page is calculated.

This hash value if looked up in the checksum hash table. Since this is a unique data, the

hash table lookup returns a miss. Therefore, the already allocated new blocks number,

the hash value of the newly written content and a reference count of 1 is added to the hash

table. After this, the buffer is marked as dirty and the disk write is carried out. These

processes are carried out by invoking the cbfs_check_duplicate_block() function.

SYS_WRITE() FUNCTION

VFS LAYER

GENERIC_FILE_WRITE()

GENERIC_FILE_BUFFERED_WRITE()

CBFS_COMMIT WRITE (Here the Hash Table Is

Checked For Duplicate Blocks. If Duplicate blocks found

Write Is Not Performed, Else Normal Write Is Done)

EXIT WITH SUCCESS


34/61


35/61


36/61

Once it finds this, the next step is to find whether the old version of the block had the same

contents as the current version. To find this, it computes checksum on the current contents,

and compares it with the checksum stored in the hash table against that block number. If

the checksums don't match, it means that the new block now contains a different piece of

data, and thus can no longer be identified by the old checksum

3)Duplicate Data (New block):

Another type of write scenario is where, a new block is to be written and the

content of the to-be-written block is already present in the disk. Here, a new block is

allocated, its address is added to the inode pointers and the new content is copied to the

page in memory that is mapped to the newly-allocated block. At this stage, the hash value

of the new content is calculated and a hash look up is made in the checksum hash table.

Since the content is already present in the disk, the checksum hash table will find the block

holding the content and returns its block number. This block number is the

mapped_block. After this is done, cbfs_free_branches() function is called which removes

the old block number from the inode pointers and then frees the already allocated block.

Then, the mapped block number is added to the inode pointer and then, the reference count

of the mapped block is incremented in the hash table.

BEFORE :

HASH TABLE INODE

12000

13050

15090

1407012000 1e$^%/*ui5*)kp2^#

14070 2

19004 1)fxr#$48%^$&)se%

16020 1x#$6778jhJ)*(^&^

1

14070

1160011600 1Io$%i$%^*iie+|@#

13050 1#^&(^*^JHs^784ld

15090 &*dshj$#%#ko()&^


37/61

AFTER :

HASH TABLE INODE

4)Duplicate Data (Overwrite existing block):

This is comparatively a rare and a complex case of write. This is the case where,

the already existing block is to be modified and the modified content happens to be already

present in the disk. Here the overwrite may or may not require allocation of the new

block. This can be found by checking for the block in the block hash table. Again, if it is

an already existing block, its content cannot be changed straight-forward because, it might

be a shared block. So, the hash table is looked up and the reference count is checked. If

the reference count is 1, then the new checksum value is computed and the hash table is

updated for the new checksum value for the same block. If the reference count is greater

than 1, then the duplicate block check is made. This is done by calculating the new

checksum and looking up the checksum hash table. Since the content is already present in

the disk, the checksum hash table will find the block holding the content and returns its

block number. This block number is the mapped_block.

Then, the old block number is removed from the inode pointers and the mapped block is

spliced to it. The reference count of the mapped block is also incremented.

12000

13050

14070

11600

12000

1407012000 2e$^%/*ui5*)kp2^#

14070 2

19004 1)fxr#$48%^$&)se%

16020 1x#$6778jhJ)*(^&^

11600 2Io$%i$%^*iie+|@#

13050 1#^&(^*^JHs^784ld

11600


38/61

BEFORE:

HASH TABLE INODE

AFTER

HASH TABLE INODE

5)Delete Data :

This is invoked in the function cbfs_free_block() which is responsible for freeing

the blocks of data. This will be called at the time of file truncation or a block remove.

The cbfs_free_block() invokes the cbfs_remove_hash_entry() function. This function

takes in the block number as argument and looks up the block hash table for the entry. It

12000

13050

14070

11600

12000

1160012000 2e$^%/*ui5*)kp2^#

14070 1

19004 1)fxr#$48%^$&)se%

16020 1x#$6778jhJ)*(^&^

11600 2Io$%i$%^*iie+|@#

13050 1#^&(^*^JHs^784ld

12000

13050

14070

11600

12000

1407012000 2e$^%/*ui5*)kp2^#

14070 2G^%HJF%&((YV**

19004 1)fxr#$48%^$&)se%

16020 1x#$6778jhJ)*(^&^

11600 1Io$%i$%^*iie+|@#

13050 1#^&(^*^JHs^784ld


39/61

finds the entry and checks the reference count. If the reference count is 1, then, the entry is

removed from the hash table and a 0 is returned to the cbfs_free_block() function. If the

reference count is greater than 1, then it is decremented and a 1 is returned to the

cbfs_free_block() function. The cbfs_free_block() function on receiving a 0, proceeds

with actually freeing the block by clearing the bit in the block bitmap. Similarly when a 1 is

received, the execution is stopped and the block is not actually freed.

5.5 DUPLICATE ELIMINATED CACHE:

In the block-level duplicate elimination done here, the hash table maintained is

a global table and therefore, the blocks belonging to all the files in that disk partition will

be present in the hash table. This means, two processes accessing two different files

having one or more shared block will have only a single copy of the block in the page

cache. This is because, when the first file accesses the shared block, it will be read from

the disk and made available in the page cache. Subsequently, when another process

accesses that shared block, would first check for the block in the page cache. Since there

will be a page corresponding to the shared block, already in the page cache, a disk read is

saved. In this way, the duplicate elimination is carried out in the page cache which

therefore helps in increasing the performance by reducing the disk accesses.

6) EVALUATION :

6.1) Correctness :

The correctness of the project is checked with a help of a testing program, which

exhaustively checks all the transactions a file system can operate on. The testing

program checks the file system state on different instances like copying a duplicate file,

copying a file with duplicate content, modifying a file which has shared blocks, etc.,

It is found that the testing program returned expected results and the file system remained

stable. Given below is the testing program and its output :

tester.c :

#include

#include

#include

#include

#include

#include

#include


40/61

#include

#include

#include

#define NUM_DATA 256

#define NUM_FILES 20

#define STAGE_SIZE 4096

#define FILE_SIZE 256

int main(int argc, char **argv) {

char **data;

int i=0, j=0,err=0,fp;

char filename[32];

data = (char **)malloc(sizeof(char*) * NUM_DATA);

for (i=0; i


41/61

printf("\nERROR: Write failed");

exit(1);

}

}

close(fp);

}

fp = open("nondup.dat", O_CREAT|O_WRONLY, 0644);

if (fp < 0) {

perror("tester");

printf("\nERROR: Cannot open file");

exit(1);

}

for (i=0; i


42/61


exit(1);

}

for (j=0; j


43/61

printf("\nERROR: Write failed");

exit(1);

}

close(fp);

}

sync();

printf("\nPerforming non-duplicate overwrites ... (dup-

>nondup)");

fflush(stdout);

for (i=0; i


44/61

printf("\nPerforming duplicate overwrites ... (nondup-

>dup)");

fflush(stdout);

fp = open("nondup.dat", O_CREAT|O_RDWR);

if (fp < 0) {

perror("tester");


exit(1);

}

for (i=0; i


45/61

6.2) Performance :

We conducted all tests on a Virtual Machine running on a 2.8 Ghz Celeron processor

with 1GB of RAM, a 80 GB Western Digital Caviar IDE disk. The operating system wasFedora Core 4 running a 2.6.15 kernel.

We tested the Content based file system using the postmark benchmark.

Postmark As an I/O-intensive benchmark that tests the worst-case I/O performance of

the file system, we ran Postmark [12]. Postmark stresses the file system by performing a series

of operations such as directory lookups, creations, and deletions on small files. Postmark has

three phases:

The file creation phase which creates a working set offiles,

The transactions phase, which involves creations, deletions, appends, and reads, and

The file deletion phase removes all files in the working set.

We configured Postmark to create 20000 files (between 512 bytes and 10KB) and perform

2,00,000 transactions. Figure 6.1 shows the results of Postmark on Ext2 and CBFS

Postmark results :


46/61

CODE SNIPPETS :

1) __CBFS_COMMIT_WRITE () :

static int

__cbfs_commit_write(struct inode *inode, struct page *page,

unsigned from, unsigned to) {

unsigned block_start, block_end;

int part = 0;

unsigned blocksize;

struct buffer_head *bh, *head;

sector_t iblock, block;

unsigned bbits;

long int allocated_block, mapped_block;void *paddr;

int err = -EIO;

unsigned long goal;

int offsets[4];

Indirect chain[4];

Indirect *partial;

int boundary = 0;

int depth = 0;

Ext2 CBFS

0

50

100

150

200

250

300

350

400

450

500

550

600

650

700

Total Time

Transaction

Time


47/61

blocksize = 1 i_blkbits;

bbits = inode->i_blkbits;

iblock = (sector_t)page->index index != iblock)) {

printk(KERN_WARNING "\nCalculation wrong!!");

BUG();

}

paddr = page_address(page);

for(bh = head = page_buffers(page), block_start = 0;

bh != head || !block_start;

iblock++, block_start=block_end, bh = bh-

>b_this_page) {

block_end = block_start + blocksize;

if (block_end = to) {

if (!buffer_uptodate(bh))

part = 1;

}

else if (S_ISREG(inode->i_mode)) {

if (buffer_new(bh))

clear_buffer_new(bh);

recheck:

depth = cbfs_block_to_path(inode, iblock,

offsets, &boundary);

cbfs_get_branch(inode, depth, offsets, chain,

&err);

allocated_block = (long) chain[depth-1].key;

mapped_block =

cbfs_check_duplicate_block(c_htable, b_htable, (char *) paddr,

allocated_block);

if (mapped_block == allocated_block)

goto out;

else if (mapped_block < 0) {


48/61

cbfs_free_branches(inode, chain[depth-

1].p, chain[depth-1].p+1, 0);

cbfs_get_block(inode, iblock, bh, 1);

goto recheck;

}

else {

goal = mapped_block;

cbfs_free_branches(inode, chain[depth-1].p,

chain[depth-1].p+1, 0);

cbfs_get_block_direct(inode, iblock, bh, 1,

goal);

}

}

else {

out:

set_buffer_uptodate(bh);

mark_buffer_dirty(bh);

}

}

if (bh->b_blocknr != pno_to_blockno(inode, page->index))

{

printk("\nBUG: page mapping is screwed up! %ld, and

%ld", (long int)bh->b_blocknr,

(long int)pno_to_blockno(inode, page-

>index));

}

/*

* If this is a partial write which happened to make all

buffers

* uptodate then we can optimize away a bogus readpage()

for

* the next read(). Here we 'discover' whether the page

went


49/61

* uptodate as a result of this (potentially partial)

write.

*/

if (!part)

SetPageUptodate(page);

return 0;

}

2) CBFS_GET_BLOCK_DIRECT():

int cbfs_get_block_direct(struct inode *inode, sector_t

iblock, struct buffer_head *bh, int create, unsigned long goal)

{

int err = -EIO;

int offsets[4];

Indirect chain[4];

Indirect *partial;

int boundary = 0;

int depth = 0;

int left;

depth = cbfs_block_to_path(inode, iblock, offsets,

&boundary);

if (depth == 0)

goto out;

reread:

partial = cbfs_get_branch(inode, depth, offsets, chain,

&err);

/* Simplest case - block found, no allocation needed */

if (!partial) {

got_it:

map_bh(bh, inode->i_sb, le32_to_cpu(chain[depth-

1].key));

if (boundary)

set_buffer_boundary(bh);


50/61

/* Clean up and exit */

partial = chain+depth-1; /* the whole chain */

goto cleanup;

}

/* Next simple case - plain lookup or failed read of

indirect block */

if (err == -EIO) {

cleanup:

while (partial > chain) {

brelse(partial->bh);

partial--;

}

out:

return 0;

}

/*

* Indirect block might be removed by truncate while we

were

* reading it. Handling of that case (forget what we've

got and

* reread) is taken out of the main path.

*/

if (err == -EAGAIN)

goto changed;

left = (chain + depth) - partial;

err = cbfs_alloc_branch_direct(inode, left, goal,

offsets+(partial-chain), partial);

if (err)

goto cleanup;

if (cbfs_use_xip(inode->i_sb)) {

/*

* we need to clear the block

*/


51/61

err = cbfs_clear_xip_target (inode,

le32_to_cpu(chain[depth-1].key));

if (err)

goto cleanup;

}

if (cbfs_splice_branch_direct(inode, iblock, chain,

partial, left) < 0) {

goto changed;

}

set_buffer_new(bh);

goto got_it;

changed:

while (partial > chain) {

brelse(partial->bh);

partial--;

}

goto reread;

}

3)CBFS_ALLOC_BRANCH_DIRECT:

static int cbfs_alloc_branch_direct(struct inode *inode,

int num, unsigned long

goal,

int *offsets,

Indirect *branch)

{

int blocksize = inode->i_sb->s_blocksize;

int n = 0;

int err = 0;

int i;

int parent;

parent = goal;


52/61

branch[0].key = cpu_to_le32(parent);

if (parent) for (n = 1; n < num; n++) {

struct buffer_head *bh;

/* Allocate the next block */

int nr = cbfs_alloc_block(inode, parent, &err);

if (!nr)

break;

branch[n].key = cpu_to_le32(nr);

/*

* Get buffer_head for parent block, zero it out and

set

* the pointer to new one, then send parent to disk.

*/

bh = sb_getblk(inode->i_sb, parent);

if (!bh) {

err = -EIO;

break;

}

lock_buffer(bh);

branch[n].bh = bh;

branch[n].p = (__le32 *) bh->b_data + offsets[n];

*branch[n].p = branch[n].key;

set_buffer_uptodate(bh);

unlock_buffer(bh);

mark_buffer_dirty_inode(bh, inode);

/* We used to sync bh here if IS_SYNC(inode).

* But we now rely upon generic_osync_inode()

* and b_inode_buffers. But not for directories.

*/

if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode))

sync_dirty_buffer(bh);

parent = nr;

}

if (n == num)

return 0;

/* Allocation failed, free what we already allocated */


53/61

for (i = 1; i < n; i++)

bforget(branch[i].bh);

return err;

}

4)CBFS_SPLICE_BRANCH_DIRECT():

static inline int cbfs_splice_branch_direct(struct inode

*inode,

long block,

Indirect chain[4],

Indirect *where,

int num)

{

struct cbfs_inode_info *ei = CBFS_I(inode);

int i;

/* Verify that place we are splicing to is still there

and vacant */

write_lock(&ei->i_meta_lock);

// if (!verify_chain(chain, where-1) || *where->p)

// goto changed;

/* That's it */

*where->p = where->key;

ei->i_next_alloc_goal = le32_to_cpu(where[0].key);

write_unlock(&ei->i_meta_lock);

/* We are done with atomic stuff, now do the rest of

housekeeping */

inode->i_ctime = CURRENT_TIME_SEC;


54/61

/* had we spliced it onto indirect block? */

if (where->bh)

mark_buffer_dirty_inode(where->bh, inode);

mark_inode_dirty(inode);

return 0;

changed:

write_unlock(&ei->i_meta_lock);

for (i = 1; i < num; i++)

bforget(where[i].bh);

for (i = 0; i < num; i++)

cbfs_free_blocks(inode, le32_to_cpu(where[i].key),

1);

return -EAGAIN;

}

5)CBFS_FREE_BLOCKS() :

void cbfs_free_blocks (struct inode * inode, unsigned long

block,

unsigned long count)

{

int i;

for (i = 0; i < count; i++) {

cbfs_free_block (inode, block, 1);

block++;

}

}

6)CBFS_FREE_BLOCK():


55/61

void cbfs_free_block (struct inode * inode, unsigned long

block,

unsigned long count)

{

struct buffer_head *bitmap_bh = NULL;

struct buffer_head * bh2;

unsigned long block_group;

unsigned long bit;

unsigned long i;

unsigned long overflow;

struct super_block * sb = inode->i_sb;

struct cbfs_sb_info * sbi = CBFS_SB(sb);

struct cbfs_group_desc * desc;

struct cbfs_super_block * es = sbi->s_es;


unsigned freed = 0, group_freed = 0;

int err = 0;

err = cbfs_remove_hash_entry(c_htable, b_htable, (long)

block);

if (err < 0) {

err = 1;

goto error_return;

}

if (block < le32_to_cpu(es->s_first_data_block) ||

block + count < block ||

block + count > le32_to_cpu(es->s_blocks_count)) {

cbfs_error (sb, "cbfs_free_blocks",

"Freeing blocks not in datazone - "

"block = %lu, count = %lu", block, count);

goto error_return;

}

cbfs_debug ("freeing block(s) %lu-%lu\n", block, block +

count - 1);

do_more:


56/61

overflow = 0;

block_group = (block - le32_to_cpu(es-

>s_first_data_block)) /

CBFS_BLOCKS_PER_GROUP(sb);

bit = (block - le32_to_cpu(es->s_first_data_block)) %

CBFS_BLOCKS_PER_GROUP(sb);

/*

* Check to see if we are freeing blocks across a group

* boundary.

*/

if (bit + count > CBFS_BLOCKS_PER_GROUP(sb)) {

overflow = bit + count - CBFS_BLOCKS_PER_GROUP(sb);

count -= overflow;

}

brelse(bitmap_bh);

bitmap_bh = read_block_bitmap(sb, block_group);

if (!bitmap_bh)

goto error_return;

desc = cbfs_get_group_desc (sb, block_group, &bh2);

if (!desc)

goto error_return;

if (in_range (le32_to_cpu(desc->bg_block_bitmap), block,

count) ||

in_range (le32_to_cpu(desc->bg_inode_bitmap), block,

count) ||

in_range (block, le32_to_cpu(desc->bg_inode_table),

sbi->s_itb_per_group) ||

in_range (block + count - 1, le32_to_cpu(desc-

>bg_inode_table),

sbi->s_itb_per_group))

cbfs_error (sb, "cbfs_free_blocks",

"Freeing blocks in system zones - "

"Block = %lu, count = %lu",

block, count);

for (i = 0, group_freed = 0; i < count; i++) {


57/61

if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi,

block_group),

bit + i, bitmap_bh->b_data)) {

cbfs_error(sb, __FUNCTION__,

"bit already cleared for block %lu", block

+ i);

} else {

group_freed++;

}

block++;

}

mark_buffer_dirty(bitmap_bh);

if (sb->s_flags & MS_SYNCHRONOUS)

sync_dirty_buffer(bitmap_bh);

group_release_blocks(sb, block_group, desc, bh2,

group_freed);

freed += group_freed;

if (overflow) {

block += count;

count = overflow;

printk("\nOverflow");

goto do_more;

}

error_return:

brelse(bitmap_bh);

release_blocks(sb, freed);

DQUOT_FREE_BLOCK(inode, freed);

}

7)INIT_CBFS_FS

static int __init init_cbfs_fs(void)

{

int err = init_cbfs_xattr();

c_htable = cbfs_init_checksum_hash_table(myhash_cs);


58/61

b_htable = cbfs_init_block_hash_table(myhash_b);

cacheptr = kmem_cache_create("nodespace", sizeof(struct

cbfs_hash_node), 0, 0, NULL, NULL);

if (err)

return err;

err = init_inodecache();

if (err)

goto out1;

err = register_file system(&cbfs_fs_type);

if (err)

goto out;

return 0;

out:

destroy_inodecache();

out1:

exit_cbfs_xattr();

return err;

}

8)EXIT_CBFS_FS:

static void __exit exit_cbfs_fs(void)

{

int result;

cbfs_hash_free(c_htable, b_htable);

unregister_file system(&cbfs_fs_type);

destroy_inodecache();

kmem_cache_destroy(cacheptr);

exit_cbfs_xattr();

}

SCREENSHOTS :


59/61


60/61


61/61

Cbfs Report

Documents

Transcript of Cbfs Report