Cbfs Report
-
Upload
venkatasiva-panditi -
Category
Documents
-
view
219 -
download
0
Transcript of Cbfs Report
-
8/3/2019 Cbfs Report
1/61
-
8/3/2019 Cbfs Report
2/61
ABSTRACT
File systems abstract raw disk data into file and directories thereby providing
an easy-to-use interface for user applications to store and retrieve persistent information.
Most common file systems that exist today treat file data as opaque entities, and they do not
utilize information about their contents to perform useful optimizations. For example, todays
file systems do not detect files with same data. In this project, we design and implement a new
file system that understands the contents of the data it stores, to enable interesting
functionality. Specifically we focus on detecting and eliminating duplicate data items across
files. Eliminating duplicate data has two key advantages: first, the disk just stores a single
copy of a data item even if multiple files share it, thereby saving storage space. Second, the
disk I/O required for reading and writing copies of the same data are eliminated, thereby
improving performance. To implement this, we plan to work on the existing Linux Ext2 file
system and reuse part of its source code. Through our implementation we hope to demonstrate
the utility of eliminating duplicate file data both in terms of space savings and performance
improvements.
-
8/3/2019 Cbfs Report
3/61
TABLE OF CONTENTS
1. INTRODUCTION
2. MOTIVATION
3. BACKGROUND
3.1 . File System Background
3.2. Overview of Linux VFS
3.3 . Layout of the Ext2 File System
3.4 . Content Based File System
4 DESIGN
4.1 Overall structure
4.2 Detecting Duplicate Data
4.3 Tracking Content Information
4.4 Online Duplicate Elimination
4.5 DISCUSSION
i.Concept in LBFS
ii.File-level content checking
5 IMPLEMENTATION
5.1 Data Structures
5.2 HASH TABLE OPERATIONS
5.2.1. Initializing the hash table
5.2.2. Compute Hash
5.2.3. Add Entry
5.2.4. Check Duplicate
5.2.5. Remove Entry
5.2.6. Free Hash Table
5.3 Read Data flow
5.4 WRITE DATA FLOW
5.4.1 Overall Write Flow
5.4.2 Unique Data (New Block)
-
8/3/2019 Cbfs Report
4/61
5.4.3 Unique Data(Overwrite Existing Block)
5.4.4 Duplicate Data(New Block)
5.4.5 Duplicate Data(Overwrite)
5.4.6 Delete Data
5.5 Duplicate Eliminated Cache
5.6 Code snippets
6 EVALUATION
6.1 Correctness Check
6.2 Performance Check
6.3 Overall Analysis
7 RELATED WORK
8 FUTURE WORK
9 REFERENCES
-
8/3/2019 Cbfs Report
5/61
INTRODUCTION :
A file system is a method for storing and organizing files and the data they contain to
make it easy to find and access them. The present file systems in Linux are content-oblivious.
They do not know about the nature of data present in the disk blocks. Therefore, even if two ormore blocks of data are are identical, they are stored as redundant copies in the disk. This is
the case dealt with in this project. The outcome is duplicate elimination in disks in the block-
level. Content hashing is the method used to compare the two blocks to be redundant.
MD5 algorithm is used to compute the hash values of each of the blocks in the disk. Before
each write to a disk, the hash value is calculated for the new write and this is compared to the
already existing hash values. If the same hash value is already present in the hash table, it
means the block is a duplicate block.
This project is implemented in the Linux kernel 2.6, wherein the existing ext2 file
system functions are modified accordingly to accomplish the goal of duplicate elimination.
The evaluation of this project is done in two phases namely the correctness check and the
performance check. A testing program which does exhaustive file system operations is
executed and the file system is found to be stable and also seen to produce expected results.
The performance check is done by evaluating the postmark results.
This report details the design and implementation of the content-based file system.
Since the main change is dont only in the write phase, each write scenarios are explained in
detail with the status of the hash table illustrated by diagrams both before and after the program
code segment execution.
MOTIVATION :
The existing ext2 file system in Linux is devoid of any information regarding the
data in the disk. It cannot distinguish between two blocks of data to be either unique or
identical. This means, two identical files will be stored in two separate locations in the disk
and thus require the space double the size of the file. This would not only result in wastage of
disk space but also increased I/O operations in reading both files separately, holding two
identical pages of data in the cache etc., Let us consider some of the example cases wherein
this disadvantage proves to be a bigger problem.
Consider the case wherein the virtual machines are used for testing purposes. When
Linux is installed in the virtual machine and if the host OS of the virtual machine is also Linux,
then, all the packages in Linux will first be stored in the disk and when the Linux is again
installed in the Virtual machine, again the same set of packages will be separately stored in the
-
8/3/2019 Cbfs Report
6/61
disk. The average size of the whole packages in the Linux distribution will be around 5
Gigabytes of storage. Thus, disk space of about 10 Gigabytes of storage will be needed to
have the Linux virtual machine with the host OS as Linux. When the virtual machine runs
there will again be more I/O operations resulting in the degradation of performance. But if the
duplicate blocks are eliminated, considerable disk space of about 5 Gigabytes of storage can be
saved and there will be a considerable reduction in the number of I/O operations thereby
increasing the performance to a good extend.
Therefore, if by some means, the file system knows the content of the blocks in the
disk before it writes a new block, this disadvantage can very well be eliminated. Hashing is a
coming technique to generate a fixed-size unique hash for any arbitrary-sized input. Thus,
when the content of each data blocks in the disk are hashed, they can easily be compared with
one another and the file system can control the read and write operation accordingly. Content
hashing in this project is done by using the Message Digest Algorithm.
BACKGROUND :
1) File System Background:
A file system is a method for storing and organizing files and the data they contain
to make it easy to find and access them. More formally, it is a set of abstract data types that are
implemented for the storage, hierarchical organization, manipulation, navigation, access, andretrieval of data.
A disk has just a group of sectors and tracks. So, it can be able to just do operations
in the granularity of sectors. eg: read a sector, write to a sector etc. But we need to have a
hierarchical structure for maintaining the files. With just a disk present, this cannot be done
because, the hard disk is just a linear collection of bits(0s and 1s) arranged into tracks and
sectors. For this purpose file systems are used. With a file system, we have an interface
between the user programs and the hard disk. We tell the file system to write a file to the disk
and it is the file system that knows the disk structure and copies the blocks of data in the disk.
A file system treats the data from the disk as just a fixed-sized block that contains
information. But the file system has no semantic information pertaining to the data. They
treat all the data, may it be that of a file or a dentry or an inode in a single notion as a block of
data.
-
8/3/2019 Cbfs Report
7/61
2) Overview of Linux VFS:
Linux comprises of the Virtual File system Switch (VFS) layer which lies
intermediate to the application and the various file systems. Every request to the disk, before
passing to the file system, goes through the VFS. It acts as a generalization layer to all theunderlying file systems. Its function is to locate the file system for a particular file from itsfile
objectand then map the functions to the file system-specific functions. Some of the common
functions like read and write work in the same pattern for most of the file systems. Therefore, a
generic function is available for these type of functions and the VFS layer maps the request to
these generic functions. There are four basic objects in the VFS namely,
Super Block Object
Inode Object
File Object
Dentry Object.
Consider a process P1, makes a read request for a file F1 that is stored in the disk partition
formatted with the Ext2 file system. Similarly process P2 requests for file F2 in Ext3 file
system. Both these requests get transferred to the corresponding system call (sys_read() in this
case). The system call handling routine transfers it to the VFS layer. The VFS layer in turntransfers the read request to the corresponding file systems read function. The VFS knows the
file system associated of any file by the file object that is passed to it. Some of the file systems
in turn maps the basic operation requests to the generic functions which ultimately carries out
the request and gives it to the layer above.
3) Page Cache:
The page cache is the main disk cache used by the Linux kernel. In most cases, the
kernel refers to the page cache when reading from or writing to disk. New pages are added to
the page cache to satisfy User Mode processes read requests. If the page is not already in the
cache, a new entry is added to the cache and filled with the data read from the disk. If there is
enough free memory, the page is kept in the cache for an indefinite period of time and can then
be reused by other processes without accessing the disk.
-
8/3/2019 Cbfs Report
8/61
Fig 3.1 : Overview of Linux Filesystems
Similarly, before writing a page of data to a block device, the kernel verifies whether
the corresponding page is already included in the cache; if not, a new entry is added to the
cache and filled with the data to be written on disk. The I/O data transfer does not start
immediately: the disk update is delayed for a few seconds, thus giving a chance to the
processes to further modify the data to be written (in other words, the kernel implements
deferred write operations).
VFS
PROCESS 1 PROCESS 2 PROCESS 3 PROCESS 4
FILE 1 FILE 2 FILE 3 FILE 4
INODE OBJECT
FILE OBJECT
ENTRY OBJECT
SUPER BLOCK OBJECT
DISK
DISK CONTROLLER
EXT2 EXT3 NFS
READWRITE READ WRITE
-
8/3/2019 Cbfs Report
9/61
Kernel code and kernel data structures don't need to be read from or written to disk.
Kernel designers have implemented the page cache to fulfill two main requirements:
Quickly locate a specific page containing data relative to a given owner. To take the
maximum advantage from the page cache, searching it should be a very fast operation.
Keep track of how every page in the cache should be handled when reading or writing
its content. For instance, reading a page from a regular file, a block device file, or a swap
area must be performed in different ways, thus the kernel must select the proper operation
depending on the page's owner.
The unit of information kept in the page cache is, of course, a whole page of data. A
page does not necessarily contain physically adjacent disk blocks, so it cannot be identified by
a device number and a block number. Instead, a page in the page cache is identified by an
owner and by an index within the owner's data usually, an inode and an offset inside the
corresponding file.
4)Buffer Pages:
In old versions of the Linux kernel, there were two different main disk caches: the
page cache, which stored whole pages of disk data resulting from accesses to the contents of the
disk files, and the buffer cache , which was used to keep in memory the contents of the blocks
accessed by the VFS to manage the disk-based file systems.
Starting from stable version 2.4.10, the buffer cache does not really exist anymore. In
fact, for reasons of efficiency, block buffers are no longer allocated individually; instead, they
are stored in dedicated pages called "buffer pages ," which are kept in the page cache.
Formally, a buffer page is a page of data associated with additional descriptors called
"buffer heads", whose main purpose is to quickly locate the disk address of each individual
block in the page. In fact, the chunks of data stored in a page belonging to the page cache are
not necessarily adjacent on disk.
Whenever the kernel must individually address a block, it refers to the buffer page
that holds the block buffer and checks the corresponding buffer head. Here are two common
cases in which the kernel creates buffer pages:
When reading or writing pages of a file that are not stored in contiguous disk blocks.
This happens either because the file system has allocated noncontiguous blocks to the file,
or because the file contains "holes".
-
8/3/2019 Cbfs Report
10/61
When accessing a single disk block (for instance, when reading a superblock or an
inode block).
In the first case, the buffer page's descriptor is inserted in the radix tree of a regular
file. The buffer heads are preserved because they store precious information: the block deviceand the logical block number that specify the position of the data in the disk.
In the second case, the buffer page's descriptor is inserted in the radix tree rooted at
the address_space objectof the inode in the bdev special file system associated with the block
device. This kind of buffer pages must satisfy a strong constraint: all the block buffers must
refer to adjacent blocks of the underlying block device.
An instance of where this is useful is when the VFS wants to read the 1,024-byte
inode block containing the inode of a given file. Instead of allocating a single buffer, the kernel
must allocate a whole page storing four buffers; these buffers will contain the data of a group of
four adjacent blocks on the block device, including the requested inode block.
All the block buffers within a single buffer page must have the same size; hence, on
the 80 x 86 architecture, a buffer page can include from one to eight buffers, depending on the
block size.
When a page acts as a buffer page, all buffer heads associated with its block buffers
are collected in a singly linked circular list. The private field of the descriptor of the buffer page
points to the buffer head of the first block in the page; every buffer head stores in the
b_this_page field a pointer to the next buffer head in the list. Moreover, every buffer head
stores the address of the buffer page's descriptor in the b_page field. Figure 3.2 shows a
buffer page containing four block buffers and the corresponding buffer heads.
Because the private field contains valid data, the PG_private flag of the page is also set; hence,
if the page contains disk data and the PG_private flag is set, then the page is a buffer page.
Notice, however, that other kernel components not related to the block I/O subsystem use the
private and PG_private fields for other purposes.
4) Writing Dirty Pages to Disk
The kernel keeps filling the page cache with pages containing data of block devices.
Whenever a process modifies some data, the corresponding page is marked as dirty that is, its
PG_dirty flag is set.
-
8/3/2019 Cbfs Report
11/61
Figure3.2 : A buffer page including four buffers and their buffer heads
Unix systems allow the deferred writes of dirty pages into block devices, because
this noticeably improves system performance. Several write operations on a page in cache
could be satisfied by just one slow physical update of the corresponding disk sectors.
Moreover, write operations are less critical than read operations, because a process is usually
not suspended due to delayed writings, while it is most often suspended because of delayed
reads. Thanks to deferred writes, each physical block device will service, on the average, many
more read requests than write ones.
A dirty page might stay in main memory until the last possible moment that is, until
system shutdown. However, pushing the delayed-write strategy to its limits has two major
drawbacks:
If a hardware or power supply failure occurs, the contents of RAM can no longer be
retrieved, so many file updates that were made since the system was booted are lost.
The size of the page cache, and hence of the RAM required to contain it, would have to
be huge at least as big as the size of the accessed block devices.
Therefore, dirty pages areflushed(written) to disk under the following conditions:
The page cache gets too full and more pages are needed, or the number of dirty pages
becomes too large.
Too much time has elapsed since a page has stayed dirty.
A process requests all pending changes of a block device or of a particular file to be
flushed; it does this by invoking a sync(), fsync(), orfdatasync() system call.
Buffer pages introduce a further complication. The buffer heads associated with each buffer
page allow the kernel to keep track of the status of each individual block buffer. The PG_dirty
-
8/3/2019 Cbfs Report
12/61
flag of the buffer page should be set if at least one of the associated buffer heads has the
BH_Dirty flag set. When the kernel selects a dirty buffer page for flushing, it scans the
associated buffer heads and effectively writes to disk only the contents of the dirty blocks. As
soon as the kernel flushes all dirty blocks in a buffer page to disk, it clears the PG_dirty flag of
the page.
5) Layout of the Ext2 file system :
The first block in each Ext2 partition is never managed by the Ext2 file system,
because it is reserved for the partition boot sector. The rest of the Ext2 partition is split into
block groups , each of which has the layout shown in Figure XXXXXXXXXXX. As you will
notice from the figure, some data structures must fit in exactly one block, while others may
require more than one block. All the block groups in the file system have the same size and are
stored sequentially, thus the kernel can derive the location of a block group in a disk simply
from its integer index.
Figure 3.3 : Layouts of an Ext2 partition and of an Ext2 block group
Block groups reduce file fragmentation, because the kernel tries to keep the data blocks
belonging to a file in the same block group, if possible. Each block in a block group contains
one of the following pieces of information:
A copy of the file system's superblock
A copy of the group of block group descriptors
A data block bitmap
An inode bitmap
A table of inodes
-
8/3/2019 Cbfs Report
13/61
A chunk of data that belongs to a file; i.e., data blocks
If a block does not contain any meaningful information, it is said to be free. As
seen from Figure 3.3 , both the superblock and the group descriptors are duplicated in each
block group. Only the superblock and the group descriptors included in block group 0 are usedby the kernel, while the remaining superblocks and group descriptors are left unchanged; in
fact, the kernel doesn't even look at them. When the e2fsck program executes a consistency
check on the file system status, it refers to the superblock and the group descriptors stored in
block group 0, and then copy them into all other block groups. If data corruption occurs and the
main superblock or the main group descriptors in block group 0 become invalid, the system
administrator can instruct e2fsck to refer to the old copies of the superblock and the group
descriptors stored in a block groups other than the first. Usually, the redundant copies store
enough information to allow e2fsck to bring the Ext2 partition back to a consistent state.
This figure 3.4 shows the actual mapping of the inode to the corresponding data
blocks in a single group.
|--Inode table---| |---Indirect blocks pointing to data blks---------| |---Data Blks----|
Fig 3.4 Inode pointers in Ext2 filesystem
As shown from the figure above, each entry in the inode table points to a specific
data block and the contents of the data blocks are never taken care of. Therefore, there exists
multiple copies of same information in many data blocks in the disk and more space is wasted
for this.
-
8/3/2019 Cbfs Report
14/61
6) Data Blocks Addressing in EXT2:
Each nonempty regular file consists of a group of data blocks . Such blocks may be
referred to either by their relative position inside the file their file block number or by their
position inside the disk partition their logical block number.
Deriving the logical block number of the corresponding data block from an offset f
inside a file is a two-step process:
1. Derive from the offsetfthe file block number the index of the block that contains the
character at offsetf.
2. Translate the file block number to the corresponding logical block number.
Because Unix files do not include any control characters, it is quite easy to derive the
file block number containing thefth character of a file: simply take the quotient offand the file
system's block size and round down to the nearest integer.
For instance, let's assume a block size of 4 KB. Iff is smaller than 4,096, the
character is contained in the first data block of the file, which has file block number 0. Iffis
equal to or greater than 4,096 and less than 8,192, the character is contained in the data block
that has file block number 1, and so on.
This is fine as far as file block numbers are concerned. However, translating a file
block number into the corresponding logical block number is not nearly as straightforward,
because the data blocks of an Ext2 file are not necessarily adjacent on disk.
The Ext2 file system must therefore provide a method to store the connection
between each file block number and the corresponding logical block number on disk. This
mapping, which goes back to early versions of Unix from AT&T, is implemented partly inside
the inode. It also involves some specialized blocks that contain extra pointers, which are an
inode extension used to handle large files.
The i_block field in the disk inode is an array of EXT2_N_BLOCKS components
that contain logical block numbers. In the following discussion, we assume that
EXT2_N_BLOCKS has the default value, namely 15. The array represents the initial part of a
larger data structure, which is illustrated in Figure XXXX. As can be seen in the figure, the 15
components of the array are of 4 different types:
The first 12 components yield the logical block numbers corresponding to the first 12
blocks of the file to the blocks that have file block numbers from 0 to 11.
-
8/3/2019 Cbfs Report
15/61
The component at index 12 contains the logical block number of a block, called
indirect block, that represents a second-order array of logical block numbers. They
correspond to the file block numbers ranging from 12 to b/4+11, where b is the file
system's block size (each logical block number is stored in 4 bytes, so we divide by 4 in the
formula). Therefore, the kernel must look in this component for a pointer to a block, and
then look in that block for another pointer to the ultimate block that contains the file
contents.
The component at index 13 contains the logical block number of an indirect block
containing a second-order array of logical block numbers; in turn, the entries of this
second-order array point to third-order arrays, which store the logical block numbers that
correspond to the file block numbers ranging from b/4+12 to (b/4)2+(b/4)+11.
Finally, the component at index 14 uses triple indirection: the fourth-order arrays store
the logical block numbers corresponding to the file block numbers ranging from
(b/4)2+(b/4)+12 to (b/4)
3+(b/4)
2+(b/4)+11.
Figure 3.5 : Data structures used to address the file's data blocks
In Figure 3.5, the number inside a block represents the corresponding file block number. The
arrows, which represent logical block numbers stored in array components, show how the
-
8/3/2019 Cbfs Report
16/61
kernel finds its way through indirect blocks to reach the block that contains the actual contents
of the file.
Notice how this mechanism favors small files. If the file does not require more than 12 data
blocks, every data can be retrieved in two disk accesses: one to read a component in the i_blockarray of the disk inode and the other to read the requested data block. For larger files, however,
three or even four consecutive disk accesses may be needed to access the required block. In
practice, this is a worst-case estimate, because dentry, inode, and page caches contribute
significantly to reduce the number of real disk accesses.
Notice also how the block size of the file system affects the addressing mechanism, because a
larger block size allows the Ext2 to store more logical block numbers inside a single block.
Table 1 shows the upper limit placed on a file's size for each block size and each addressing
mode. For instance, if the block size is 1,024 bytes and the file contains up to 268 kilobytes of
data, the first 12 KB of a file can be accessed through direct mapping and the remaining 13-268
KB can be addressed through simple indirection. Files larger than 2 GB must be opened on 32-
bit architectures by specifying the O_LARGEFILE opening flag.
Table 1. File-size upper limits for data block addressing
Block size Direct 1-Indirect 2-Indirect 3-Indirect
1,024 12 KB 268 KB 64.26 MB 16.06 GB
2,048 24 KB 1.02 MB 513.02 MB 256.5 GB
4,096 48 KB 4.04 MB 4 GB ~ 4 TB
-
8/3/2019 Cbfs Report
17/61
THE CONTENT-BASED FILE SYSTEM :
The content based file system employs a technique called content hashing to
compare the content of the blocks and find if they are duplicate. Message Digest Algorithm is
used to compute the hash value of each block. For every data blocks in the disk, thecorresponding hash value is calculated. At any given instant, the hash table will have one
entry for every valid data blocks in the disk. There are two hash table structures called the
checksum hash table that is indexed by the checksum field and the block hash table that is
indexed by the block number field. Both these structures point to the single copy of the hash
node.
In the context of the file system, there is no change in any of the inode data
structures. The only difference is that, the inode table will have multiple entries pointing to the
same block number in the disk. For eg. If a 50 Megabytes-sized file is created in the partition
having the content based file system, and if the file contains just 40 MB of unique
information in it and the remaining 10 MB of data are duplicate, then, the inode table for that
file will have the same number of pointers as with the fully-unique 50 MB file. But, the inode
table will have duplicate pointers for the remaining 10 MB and therefore, the essential space
that is occupied by the file is just 40 MB. In this way, disk space can be saved to a good
extend. When the file is read, the inode table is accessed in a normal way and reads the
corresponding data blocks. When a data block is already read from the disk and if it is presentin the buffer cache, then, if the same block is needed again, it doesnt need to invoke a disk
read again. It can just use the copy of data in the buffer cache. In this way, by the better cache
hit, performance can be enhanced during read operation. Therefore, for any operation like file
create, file overwrite, file append or truncation, before the disk write operation is invoked, the
checksum values are compared and only if there are no duplicate blocks, the new block is
written; otherwise, no new data blocks are written and the disk space remains the same . As
a result of this, at any given instant, the disk will be having only a single copy of any data and
there is no place for redundancy. In addition to efficient utilization of disk space, contenthashing also helps in maintaining the data integrity. In certain cases, this technique can reduce
considerable amount of disk space and also increase the performance of the operations.
-
8/3/2019 Cbfs Report
18/61
-
8/3/2019 Cbfs Report
19/61
codes. These codes dont lead to overhead and thus not too inefficient. But, these are not
collision resistant. This means, there may be duplicate blocks that are easily missed by the
error detection codes. Therefore, this technique cannot be relied upon to detect the duplicate
blocks. Considering these cases, collision-resistant hashing proves to be a more viable
method for accomplishing the goal of detecting duplicate blocks.
Collision Resistant Hashing :
Collision-Resistant hashing is a technique by which, a unique hash value is generated
for each unique content of the block. The Message Digest Algorithm 5(MD5) is used for this
purpose. The MD5 algorithm is widely used in many cryptographic applications and it is
found to be more collision resistant than its predecessors. It takes in the input of arbitrary
length and produces a MD5 hash of 128 bits. For any unique input, a unique MD5 hash is
produced. This is a one-way hashing technique wherein, from a given data, its corresponding
hash value can very well be calculated but given a MD5 hash, it is not possible to derive the
input data to it. We employ this technique to compare the contents of the block. As, the
MD5 hashing algorithm gives a unique hash for every unique input data, it can be said that no
identical blocks of data will give raise to a single hash value. Therefore, if the hash value is
calculated for every data blocks, then comparing the hash value would mean comparing the
actual blocks of data.
4.3 TRACKING CONTENT INFORMATION:
The hash table is used to track the content information pertaining to the various disk
data blocks. For every unique content in the disk, there is an entry in the hash table. In the
hash table, there is a checksum->block number mapping available. This mapping is used to
locate the duplicate blocks(if any) before any write operation takes place in the disk. The
overall structure of the hash table is illustrated by the following figure.
-
8/3/2019 Cbfs Report
20/61
Figure : 4.3 Hash table structure
Checksum Hash Table Disk Blocks Block Hash Table
4.4 ONLINE DUPLICATE ELIMINATION:
WRITE SCENARIOS :
1) Unique Data (New Block):
This happens in the cases of file creation or a file append. Basically, a new block is allocated here and then the data is copied to it. Before writing the data to the
block, the hash table is checked and found to be a new data. Therefore, the new block
entry, its checksum value and a reference count of 1 is added to the hash table and the
normal write operation is continued to execute.
2) Unique Data (Existing Block):
This happens in the case of file modification. When a file is modified, one or
more blocks corresponding to the file get modified and before writing it to the disk, the
routine redundancy check is performed. In this case, the new modified checksum doesnt
find an entry in the hash table. But already there is an entry in the hash table for that block
but with a different checksum. Therefore, in this case, the reference count of that block is
decremented and a new block is allocated and the new contents are copied to it with
updating the hash table for that new block.
BLOCK 4
BLOCK 3
BLOCK 6
BLOCK 5
BLOCK 2
BLOCK 1B5 RCC1
B1 RCC2
B3 RCC3
B2 RCC4
B6 RCC5
B4 RCC6
B1 RCC2
B2 RCC4
B3 RCC3
B4 RCC6
B5 RCC1
B6 RCC5
-
8/3/2019 Cbfs Report
21/61
3) Duplicate Data (New block):
This happens in the case of file creation or a file append wherein the efficiency
of content-based file system is utilized. In this case, before writing to a disk, the hash
table is checked for duplicate block and that corresponding block is mapped to thecorresponding inode pointer. The reference count of the block is incremented.
4) Duplicate Data (Existing Block):
This is a case where an existing block is modified and the now modified
contents are found to be duplicate. In this case, the reference count of the old block is
decremented and the reference count of the new block is incremented. Then, the
corresponding inode pointers are updated with the new block.
5) Delete Data:
This happens when a file is modified by deleting a part of its contents, or during
a file remove. In this case, the reference count alone is decremented and when the
reference count is 0, the block gets freed and the hash entry is removed.
DISCUSSION :
Comparison with LBFS :
LBFS is a network file system which conserves communication bandwidth
between clients and servers. It takes advantage of cross-file similarities. When transferring a
file between the client and server, LBFS identifies chunks of data that the recipient already has
in other files and avoids transmitting the redundant data over the network. Here, there is no
block-level redundancy check performed. Instead, some chunks of data are compared to check
duplication of data. Chunks are variable-sized blocks. Here, when a modification is made to
a shared data, the block size is made to increase and then necessary changes are made to the
kernel to handle that. This is more complex to implement when compared to the block-level
redundancy check that is performed in this project
File-Level Redundancy check :
Another technique that is synonymous with the block-level hashing is the file-level
redundancy checking. In this case, the whole file is compared and the redundant files are
-
8/3/2019 Cbfs Report
22/61
eliminated. This usage of this technique is limited to the availability of redundant files in a
file system. Only if there are more than one redundant files in the disk partition, the advantage
can be felt. But in the case of block-level redundancy check, the duplicate blocks are
eliminated in inter-file environment. Even when two files are different, as a whole, the
duplicate elimination can be done for some of the redundant blocks in the two files alone. The
chances of having the advantage of this file system can be found in a more frequent manner
since there may be many blocks that are redundant and spread over different files.
Therefore, the block-level content hashing and redundancy elimination is a
beneficial method that brings about efficient disk space usage and also a better performance by
reducing the number of disk accesses and by attaining a better cache hit.
-
8/3/2019 Cbfs Report
23/61
5. IMPLEMENTATION:
Figure : 5.0 : Control Logic of CBFS
CHECK DUPLICATE IN CHECKSUM TABLE
FOUND NOT FOUND
EXIT WITH SUCCESS
CHANGE PAGE MAPPING
FOUND
CHECK BLOCK HASH TABLE
UPDATE HASH TABLE
CHANGE INODE POINTER
NOT FOUND
ALLOCATE NEW BLOCK
REMOVE HASH ENTRY ADD HASH ENTRY
-
8/3/2019 Cbfs Report
24/61
5.1 DATA STRUCTURES :
1) Hash Table Structure :
There are two hash table structures called checksum hash table (indexed by the
checksum field) and a block hash table (indexed by the block number field). Both these
hash table structures point to a single copy of the hash node. The following figure
illustrates the overall structure of the hash table.
Figure 5.1 : Structure of hash tables
2) Components of the hash table :
The hash table comprises of three field namely a checksum field of 128 bits, a
block number field (Logical block number of the disk) and the reference count of the block.
The block number field helps in locating the physical block of the disk. Therefore, for
every valid data block in the disk, there will be a hash entry in the hash table.
CHECKSUM 1 BLOCK 2 REFERENC
E
BLOCK 6 REF COUNT
CHECKSUM 2 BLOCK 5 REF COUNT
CHECKSUM 5 BLOCK 4 REF COUNT
CHECKSUM 6
CHECKSUM 4 BLOCK 3 REF COUNT
CHECKSUM 3 BLOCK 1 REF COUNT
BLOCK
HASH ABLE
POINTER
INDEXED
BY BLOCK
CHECKSUM
HASH TABLE
POINTERINDEXED BY
CHECKSUM
-
8/3/2019 Cbfs Report
25/61
5.2 HASH TABLE OPERATIONS:
There are two hash table pointers namely, checksum_htable and block_htable. The
operations that are associated with the hash table are namely,
1) Initializing the hash table:
Initializing the hash table involves creation of the hash table structure, and the hash
buckets. This is done at the time of mounting the content based file system. This is
done in the kernel function, cbfs_fill_super(). The code snippet for initializing the two
hash tables are given below :
struct cbfs_hash_table* cbfs_init_checksum_hash_table(int
(*hash) (void *)) {
int i;
struct cbfs_hash_table *newtable;
newtable = kmalloc(sizeof(struct cbfs_hash_table),
GFP_KERNEL);
BUG_ON(!newtable);
memset(newtable,0, sizeof(struct cbfs_hash_table));
newtable->buckets = (struct cbfs_hash_bucket *)
__get_free_pages(GFP_KERNEL,
get_order(NUM_CHECKSUM_BUCKETS *
sizeof(struct cbfs_hash_bucket)));
BUG_ON(!newtable->buckets);
memset(newtable->buckets,0, NUM_CHECKSUM_BUCKETS *
sizeof(struct cbfs_hash_bucket));
for (i=0; ibuckets[i].node_list);
newtable->hash = hash;
spin_lock_init(&newtable->lock);
return newtable;
}
-
8/3/2019 Cbfs Report
26/61
struct cbfs_hash_table* cbfs_init_block_hash_table(int (*hash)
(void *)) {
int i;
struct cbfs_hash_table *newtable;
newtable = kmalloc(sizeof(struct cbfs_hash_table),
GFP_KERNEL);
BUG_ON(!newtable);
memset(newtable,0, sizeof(struct cbfs_hash_table));
newtable->buckets = (struct cbfs_hash_bucket *)
__get_free_pages(GFP_KERNEL,
get_order(NUM_CHECKSUM_BUCKETS *
sizeof(struct cbfs_hash_bucket)));
BUG_ON(!newtable->buckets);
memset(newtable->buckets,0, NUM_BLOCK_BUCKETS *
sizeof(struct cbfs_hash_bucket));
for (i=0; ibuckets[i].node_list);
newtable->hash = hash;
spin_lock_init(&newtable->lock);
return newtable;
}
2) Check duplicate:
This function, takes in the contents of the block and returns whether there is another
block already with the same contents. This function internally invoked
cbfs_compute_checksum() function and calculates the checksum field and then compares it
with the checksum fields in the hash table. Given below is the code snippet for
cbfs_check_duplicate_block() function :
long cbfs_check_duplicate_block(struct cbfs_hash_table
*checksum_htable, struct cbfs_hash_table *block_htable,
char *data, long new_blk_no) {
-
8/3/2019 Cbfs Report
27/61
long err = 0;
char *checksum;
long blk_no;
struct cbfs_hash_node *node_c, *node_b;
checksum = cbfs_compute_checksum(data);
spin_lock(&checksum_htable->lock);
node_c = cbfs_node_lookup_by_checksum(checksum_htable,
checksum);
node_b = cbfs_node_lookup_by_block(block_htable,
new_blk_no);
if (!node_c && !node_b) {
cbfs_add_hash_entry(checksum_htable, block_htable,
checksum, new_blk_no);
err = new_blk_no;
goto out;
}
else if (!node_c && node_b){
err = -1;
}
else {
node_c->ref_count++;
blk_no = node_c->block_number;
err = blk_no;
goto out;
}
out:
spin_unlock(&checksum_htable->lock);
return err;
}
char* cbfs_compute_checksum(char *data) {
char *err = NULL;
char *checksum;
checksum = kmalloc(CHECKSUM_SIZE, GFP_KERNEL);
hmac(data, DATA_SIZE, KEY, KEY_SIZE, (void *)checksum);
-
8/3/2019 Cbfs Report
28/61
err = checksum;
return err;
}
3) Add hash entry
This function, takes in the checksum value, block number and then adds the entry in
the hash table and initializes the reference count to 1. This is invoked when a new block is
allocated and it is not found to be duplicate. The code snippet for the
cbfs_add_hash_entry() function is given below :
void cbfs_add_hash_entry(struct cbfs_hash_table
*checksum_htable, struct cbfs_hash_table *block_htable, char
*checksum, long blk_no) {
struct cbfs_hash_node *newnode;
int checksum_bucket;
int block_bucket;
long *blk = &blk_no;
newnode = kmem_cache_alloc(cacheptr, GFP_KERNEL);
newnode->checksum = checksum;
newnode->block_number = blk_no;
newnode->ref_count = 1;
INIT_LIST_HEAD(&newnode->block_ptr);
INIT_LIST_HEAD(&newnode->checksum_ptr);
checksum_bucket = checksum_htable->hash((void *)checksum);
block_bucket = block_htable->hash((void *)blk);
list_add_tail(&newnode->checksum_ptr,
&checksum_htable-
>buckets[checksum_bucket].node_list);
checksum_htable->len++;
list_add_tail(&newnode->block_ptr, &block_htable-
>buckets[block_bucket].node_list);
block_htable->len++;
}
4) Remove hash entry
-
8/3/2019 Cbfs Report
29/61
This function is for removing the hash entry if its reference count is 1 and for
decrementing the reference count otherwise. This is called inside the cbfs_free_block()
function. Therefore, for every block free in the file system, this function is invoked and
thus the hash table is kept updated at every instant. The function
cbfs_remove_hash_entry() is given below :
int cbfs_remove_hash_entry (struct cbfs_hash_table
*checksum_htable,
struct cbfs_hash_table
*block_htable,
long blk_no) {
int err = 0;
int checksum_bucket;
int block_bucket;
char *checksum=NULL;
long *blk = &blk_no;
struct list_head *pos1, *pos2;
struct cbfs_hash_node *node;
block_bucket = block_htable->hash((void *)blk);
spin_lock(&block_htable->lock);
list_for_each(pos1, &block_htable-
>buckets[block_bucket].node_list) {
node = list_entry(pos1, struct cbfs_hash_node,
block_ptr);
if (node == NULL) {
printk (\nNo node to free !!);
err=0;
goto out;
}
if(node->block_number == blk_no) {
if(node->ref_count == 1) {
checksum = node->checksum;
list_del(pos1);
block_htable->len--;
goto cs;
}
-
8/3/2019 Cbfs Report
30/61
else {
node->ref_count--;
err = -1;
goto out;
}
}
}
goto out;
cs:
checksum_bucket = checksum_htable->hash((void
*)checksum);
list_for_each(pos2, &checksum_htable-
>buckets[checksum_bucket].node_list) {
node = list_entry(pos2, struct cbfs_hash_node,
checksum_ptr);
if(memcmp((void *)node->checksum, (void
*)checksum, CHECKSUM_SIZE) == 0) {
list_del(pos2);
kmem_cache_free(cacheptr, node);
checksum_htable->len--;
err = 0;
goto out;
}
}
out:
spin_unlock(&block_htable->lock);
return err;
}
5) Freeing the hash table:
This is for removing the entire hash table from the memory. This is done when
the CBFS module is removed from the kernel. This is invoked in the function,
cbfs_module_exit() function which will be called at the time of module remove. Given
below is the code snippet for the cbfs_hash_free() function :
-
8/3/2019 Cbfs Report
31/61
void cbfs_hash_free(struct cbfs_hash_table *checksum_htable,
struct cbfs_hash_table *block_htable) {
int i;
struct cbfs_hash_node *node;
struct list_head *pos1, *pos2, *n;
for(i=0;ibuckets[i].node_list) {
node=list_entry(pos1, struct cbfs_hash_node,
checksum_ptr);
list_del(pos1);
}
}
for(i=0;ibuckets[i].node_list) {
node=list_entry(pos2, struct cbfs_hash_node,
block_ptr);
list_del(pos2);
kmem_cache_free(cacheptr, node);
}
}
free_pages ((unsigned long)checksum_htable->buckets,
get_order
(NUM_CHECKSUM_BUCKETS * sizeof(struct
cbfs_hash_bucket)));
free_pages ((unsigned long)block_htable->buckets,
get_order
(NUM_CHECKSUM_BUCKETS * sizeof(struct
cbfs_hash_bucket)));
printk(\nBuckets freed);
kfree(checksum_htable);
kfree(block_htable);
printk(\nHash tables freed);
}
-
8/3/2019 Cbfs Report
32/61
5.3 READ DATA FLOW :
Figure : 5.3 Read data flow
As illustrated in the above flow diagram, the read request travels through a series of layers and
functions and finally, ends up searching the inode pointers for the disk blocks to read. The
inode pointers will be identical for the redundant blocks and for those blocks, the disk read is
made only once and for every subsequent block reads, it will be available in the cache.
Therefore the performance is improved by this content based file system. Also, no extra
complexity is added with regard to the read function or the inode structure. This makes the
content based file system a more viable option.
5.4 WRITE DATA FLOW :
1)Unique Data (New block) :
This is the most common write scenario in a system, wherein, a file is either created or
appended. Here, a new block is allocated, the block address is added into the appropriate place
in the inode pointer and the, the content to be written are copied from the users address space
SYS_READ() FUNCTION
VFS LAYER (Maps the file system-specific read() function or the generic read() function)
GENERIC_FILE_READ()
CHECKS THE INODE POINTERS TO LOCATE
READS THE CORRESPONDING BLOCKS FROM
DISK
-
8/3/2019 Cbfs Report
33/61
to the page that is mapped to that block. At this stage, the hash value for the new content that
is available in the page is calculated.
This hash value if looked up in the checksum hash table. Since this is a unique data, the
hash table lookup returns a miss. Therefore, the already allocated new blocks number,
the hash value of the newly written content and a reference count of 1 is added to the hash
table. After this, the buffer is marked as dirty and the disk write is carried out. These
processes are carried out by invoking the cbfs_check_duplicate_block() function.
SYS_WRITE() FUNCTION
VFS LAYER
GENERIC_FILE_WRITE()
GENERIC_FILE_BUFFERED_WRITE()
CBFS_COMMIT WRITE (Here the Hash Table Is
Checked For Duplicate Blocks. If Duplicate blocks found
Write Is Not Performed, Else Normal Write Is Done)
EXIT WITH SUCCESS
-
8/3/2019 Cbfs Report
34/61
-
8/3/2019 Cbfs Report
35/61
-
8/3/2019 Cbfs Report
36/61
Once it finds this, the next step is to find whether the old version of the block had the same
contents as the current version. To find this, it computes checksum on the current contents,
and compares it with the checksum stored in the hash table against that block number. If
the checksums don't match, it means that the new block now contains a different piece of
data, and thus can no longer be identified by the old checksum
3)Duplicate Data (New block):
Another type of write scenario is where, a new block is to be written and the
content of the to-be-written block is already present in the disk. Here, a new block is
allocated, its address is added to the inode pointers and the new content is copied to the
page in memory that is mapped to the newly-allocated block. At this stage, the hash value
of the new content is calculated and a hash look up is made in the checksum hash table.
Since the content is already present in the disk, the checksum hash table will find the block
holding the content and returns its block number. This block number is the
mapped_block. After this is done, cbfs_free_branches() function is called which removes
the old block number from the inode pointers and then frees the already allocated block.
Then, the mapped block number is added to the inode pointer and then, the reference count
of the mapped block is incremented in the hash table.
BEFORE :
HASH TABLE INODE
12000
13050
15090
1407012000 1e$^%/*ui5*)kp2^#
14070 2
19004 1)fxr#$48%^$&)se%
16020 1x#$6778jhJ)*(^&^
1
14070
1160011600 1Io$%i$%^*iie+|@#
13050 1#^&(^*^JHs^784ld
15090 &*dshj$#%#ko()&^
-
8/3/2019 Cbfs Report
37/61
AFTER :
HASH TABLE INODE
4)Duplicate Data (Overwrite existing block):
This is comparatively a rare and a complex case of write. This is the case where,
the already existing block is to be modified and the modified content happens to be already
present in the disk. Here the overwrite may or may not require allocation of the new
block. This can be found by checking for the block in the block hash table. Again, if it is
an already existing block, its content cannot be changed straight-forward because, it might
be a shared block. So, the hash table is looked up and the reference count is checked. If
the reference count is 1, then the new checksum value is computed and the hash table is
updated for the new checksum value for the same block. If the reference count is greater
than 1, then the duplicate block check is made. This is done by calculating the new
checksum and looking up the checksum hash table. Since the content is already present in
the disk, the checksum hash table will find the block holding the content and returns its
block number. This block number is the mapped_block.
Then, the old block number is removed from the inode pointers and the mapped block is
spliced to it. The reference count of the mapped block is also incremented.
12000
13050
14070
11600
12000
1407012000 2e$^%/*ui5*)kp2^#
14070 2
19004 1)fxr#$48%^$&)se%
16020 1x#$6778jhJ)*(^&^
11600 2Io$%i$%^*iie+|@#
13050 1#^&(^*^JHs^784ld
11600
-
8/3/2019 Cbfs Report
38/61
BEFORE:
HASH TABLE INODE
AFTER
HASH TABLE INODE
5)Delete Data :
This is invoked in the function cbfs_free_block() which is responsible for freeing
the blocks of data. This will be called at the time of file truncation or a block remove.
The cbfs_free_block() invokes the cbfs_remove_hash_entry() function. This function
takes in the block number as argument and looks up the block hash table for the entry. It
12000
13050
14070
11600
12000
1160012000 2e$^%/*ui5*)kp2^#
14070 1
19004 1)fxr#$48%^$&)se%
16020 1x#$6778jhJ)*(^&^
11600 2Io$%i$%^*iie+|@#
13050 1#^&(^*^JHs^784ld
12000
13050
14070
11600
12000
1407012000 2e$^%/*ui5*)kp2^#
14070 2G^%HJF%&((YV**
19004 1)fxr#$48%^$&)se%
16020 1x#$6778jhJ)*(^&^
11600 1Io$%i$%^*iie+|@#
13050 1#^&(^*^JHs^784ld
-
8/3/2019 Cbfs Report
39/61
finds the entry and checks the reference count. If the reference count is 1, then, the entry is
removed from the hash table and a 0 is returned to the cbfs_free_block() function. If the
reference count is greater than 1, then it is decremented and a 1 is returned to the
cbfs_free_block() function. The cbfs_free_block() function on receiving a 0, proceeds
with actually freeing the block by clearing the bit in the block bitmap. Similarly when a 1 is
received, the execution is stopped and the block is not actually freed.
5.5 DUPLICATE ELIMINATED CACHE:
In the block-level duplicate elimination done here, the hash table maintained is
a global table and therefore, the blocks belonging to all the files in that disk partition will
be present in the hash table. This means, two processes accessing two different files
having one or more shared block will have only a single copy of the block in the page
cache. This is because, when the first file accesses the shared block, it will be read from
the disk and made available in the page cache. Subsequently, when another process
accesses that shared block, would first check for the block in the page cache. Since there
will be a page corresponding to the shared block, already in the page cache, a disk read is
saved. In this way, the duplicate elimination is carried out in the page cache which
therefore helps in increasing the performance by reducing the disk accesses.
6) EVALUATION :
6.1) Correctness :
The correctness of the project is checked with a help of a testing program, which
exhaustively checks all the transactions a file system can operate on. The testing
program checks the file system state on different instances like copying a duplicate file,
copying a file with duplicate content, modifying a file which has shared blocks, etc.,
It is found that the testing program returned expected results and the file system remained
stable. Given below is the testing program and its output :
tester.c :
#include
#include
#include
#include
#include
#include
#include
-
8/3/2019 Cbfs Report
40/61
#include
#include
#include
#define NUM_DATA 256
#define NUM_FILES 20
#define STAGE_SIZE 4096
#define FILE_SIZE 256
int main(int argc, char **argv) {
char **data;
int i=0, j=0,err=0,fp;
char filename[32];
data = (char **)malloc(sizeof(char*) * NUM_DATA);
for (i=0; i
-
8/3/2019 Cbfs Report
41/61
printf("\nERROR: Write failed");
exit(1);
}
}
close(fp);
}
fp = open("nondup.dat", O_CREAT|O_WRONLY, 0644);
if (fp < 0) {
perror("tester");
printf("\nERROR: Cannot open file");
exit(1);
}
for (i=0; i
-
8/3/2019 Cbfs Report
42/61
printf("\nERROR: Cannot open file");
exit(1);
}
for (j=0; j
-
8/3/2019 Cbfs Report
43/61
printf("\nERROR: Write failed");
exit(1);
}
close(fp);
}
sync();
printf("\nPerforming non-duplicate overwrites ... (dup-
>nondup)");
fflush(stdout);
for (i=0; i
-
8/3/2019 Cbfs Report
44/61
printf("\nPerforming duplicate overwrites ... (nondup-
>dup)");
fflush(stdout);
fp = open("nondup.dat", O_CREAT|O_RDWR);
if (fp < 0) {
perror("tester");
printf("\nERROR: Cannot open file");
exit(1);
}
for (i=0; i
-
8/3/2019 Cbfs Report
45/61
6.2) Performance :
We conducted all tests on a Virtual Machine running on a 2.8 Ghz Celeron processor
with 1GB of RAM, a 80 GB Western Digital Caviar IDE disk. The operating system wasFedora Core 4 running a 2.6.15 kernel.
We tested the Content based file system using the postmark benchmark.
Postmark As an I/O-intensive benchmark that tests the worst-case I/O performance of
the file system, we ran Postmark [12]. Postmark stresses the file system by performing a series
of operations such as directory lookups, creations, and deletions on small files. Postmark has
three phases:
The file creation phase which creates a working set offiles,
The transactions phase, which involves creations, deletions, appends, and reads, and
The file deletion phase removes all files in the working set.
We configured Postmark to create 20000 files (between 512 bytes and 10KB) and perform
2,00,000 transactions. Figure 6.1 shows the results of Postmark on Ext2 and CBFS
Postmark results :
-
8/3/2019 Cbfs Report
46/61
CODE SNIPPETS :
1) __CBFS_COMMIT_WRITE () :
static int
__cbfs_commit_write(struct inode *inode, struct page *page,
unsigned from, unsigned to) {
unsigned block_start, block_end;
int part = 0;
unsigned blocksize;
struct buffer_head *bh, *head;
sector_t iblock, block;
unsigned bbits;
long int allocated_block, mapped_block;void *paddr;
int err = -EIO;
unsigned long goal;
int offsets[4];
Indirect chain[4];
Indirect *partial;
int boundary = 0;
int depth = 0;
Ext2 CBFS
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
Total Time
Transaction
Time
-
8/3/2019 Cbfs Report
47/61
blocksize = 1 i_blkbits;
bbits = inode->i_blkbits;
iblock = (sector_t)page->index index != iblock)) {
printk(KERN_WARNING "\nCalculation wrong!!");
BUG();
}
paddr = page_address(page);
for(bh = head = page_buffers(page), block_start = 0;
bh != head || !block_start;
iblock++, block_start=block_end, bh = bh-
>b_this_page) {
block_end = block_start + blocksize;
if (block_end = to) {
if (!buffer_uptodate(bh))
part = 1;
}
else if (S_ISREG(inode->i_mode)) {
if (buffer_new(bh))
clear_buffer_new(bh);
recheck:
depth = cbfs_block_to_path(inode, iblock,
offsets, &boundary);
cbfs_get_branch(inode, depth, offsets, chain,
&err);
allocated_block = (long) chain[depth-1].key;
mapped_block =
cbfs_check_duplicate_block(c_htable, b_htable, (char *) paddr,
allocated_block);
if (mapped_block == allocated_block)
goto out;
else if (mapped_block < 0) {
-
8/3/2019 Cbfs Report
48/61
cbfs_free_branches(inode, chain[depth-
1].p, chain[depth-1].p+1, 0);
cbfs_get_block(inode, iblock, bh, 1);
goto recheck;
}
else {
goal = mapped_block;
cbfs_free_branches(inode, chain[depth-1].p,
chain[depth-1].p+1, 0);
cbfs_get_block_direct(inode, iblock, bh, 1,
goal);
}
}
else {
out:
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
}
}
if (bh->b_blocknr != pno_to_blockno(inode, page->index))
{
printk("\nBUG: page mapping is screwed up! %ld, and
%ld", (long int)bh->b_blocknr,
(long int)pno_to_blockno(inode, page-
>index));
}
/*
* If this is a partial write which happened to make all
buffers
* uptodate then we can optimize away a bogus readpage()
for
* the next read(). Here we 'discover' whether the page
went
-
8/3/2019 Cbfs Report
49/61
* uptodate as a result of this (potentially partial)
write.
*/
if (!part)
SetPageUptodate(page);
return 0;
}
2) CBFS_GET_BLOCK_DIRECT():
int cbfs_get_block_direct(struct inode *inode, sector_t
iblock, struct buffer_head *bh, int create, unsigned long goal)
{
int err = -EIO;
int offsets[4];
Indirect chain[4];
Indirect *partial;
int boundary = 0;
int depth = 0;
int left;
depth = cbfs_block_to_path(inode, iblock, offsets,
&boundary);
if (depth == 0)
goto out;
reread:
partial = cbfs_get_branch(inode, depth, offsets, chain,
&err);
/* Simplest case - block found, no allocation needed */
if (!partial) {
got_it:
map_bh(bh, inode->i_sb, le32_to_cpu(chain[depth-
1].key));
if (boundary)
set_buffer_boundary(bh);
-
8/3/2019 Cbfs Report
50/61
/* Clean up and exit */
partial = chain+depth-1; /* the whole chain */
goto cleanup;
}
/* Next simple case - plain lookup or failed read of
indirect block */
if (err == -EIO) {
cleanup:
while (partial > chain) {
brelse(partial->bh);
partial--;
}
out:
return 0;
}
/*
* Indirect block might be removed by truncate while we
were
* reading it. Handling of that case (forget what we've
got and
* reread) is taken out of the main path.
*/
if (err == -EAGAIN)
goto changed;
left = (chain + depth) - partial;
err = cbfs_alloc_branch_direct(inode, left, goal,
offsets+(partial-chain), partial);
if (err)
goto cleanup;
if (cbfs_use_xip(inode->i_sb)) {
/*
* we need to clear the block
*/
-
8/3/2019 Cbfs Report
51/61
err = cbfs_clear_xip_target (inode,
le32_to_cpu(chain[depth-1].key));
if (err)
goto cleanup;
}
if (cbfs_splice_branch_direct(inode, iblock, chain,
partial, left) < 0) {
goto changed;
}
set_buffer_new(bh);
goto got_it;
changed:
while (partial > chain) {
brelse(partial->bh);
partial--;
}
goto reread;
}
3)CBFS_ALLOC_BRANCH_DIRECT:
static int cbfs_alloc_branch_direct(struct inode *inode,
int num, unsigned long
goal,
int *offsets,
Indirect *branch)
{
int blocksize = inode->i_sb->s_blocksize;
int n = 0;
int err = 0;
int i;
int parent;
parent = goal;
-
8/3/2019 Cbfs Report
52/61
branch[0].key = cpu_to_le32(parent);
if (parent) for (n = 1; n < num; n++) {
struct buffer_head *bh;
/* Allocate the next block */
int nr = cbfs_alloc_block(inode, parent, &err);
if (!nr)
break;
branch[n].key = cpu_to_le32(nr);
/*
* Get buffer_head for parent block, zero it out and
set
* the pointer to new one, then send parent to disk.
*/
bh = sb_getblk(inode->i_sb, parent);
if (!bh) {
err = -EIO;
break;
}
lock_buffer(bh);
branch[n].bh = bh;
branch[n].p = (__le32 *) bh->b_data + offsets[n];
*branch[n].p = branch[n].key;
set_buffer_uptodate(bh);
unlock_buffer(bh);
mark_buffer_dirty_inode(bh, inode);
/* We used to sync bh here if IS_SYNC(inode).
* But we now rely upon generic_osync_inode()
* and b_inode_buffers. But not for directories.
*/
if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode))
sync_dirty_buffer(bh);
parent = nr;
}
if (n == num)
return 0;
/* Allocation failed, free what we already allocated */
-
8/3/2019 Cbfs Report
53/61
for (i = 1; i < n; i++)
bforget(branch[i].bh);
return err;
}
4)CBFS_SPLICE_BRANCH_DIRECT():
static inline int cbfs_splice_branch_direct(struct inode
*inode,
long block,
Indirect chain[4],
Indirect *where,
int num)
{
struct cbfs_inode_info *ei = CBFS_I(inode);
int i;
/* Verify that place we are splicing to is still there
and vacant */
write_lock(&ei->i_meta_lock);
// if (!verify_chain(chain, where-1) || *where->p)
// goto changed;
/* That's it */
*where->p = where->key;
ei->i_next_alloc_goal = le32_to_cpu(where[0].key);
write_unlock(&ei->i_meta_lock);
/* We are done with atomic stuff, now do the rest of
housekeeping */
inode->i_ctime = CURRENT_TIME_SEC;
-
8/3/2019 Cbfs Report
54/61
/* had we spliced it onto indirect block? */
if (where->bh)
mark_buffer_dirty_inode(where->bh, inode);
mark_inode_dirty(inode);
return 0;
changed:
write_unlock(&ei->i_meta_lock);
for (i = 1; i < num; i++)
bforget(where[i].bh);
for (i = 0; i < num; i++)
cbfs_free_blocks(inode, le32_to_cpu(where[i].key),
1);
return -EAGAIN;
}
5)CBFS_FREE_BLOCKS() :
void cbfs_free_blocks (struct inode * inode, unsigned long
block,
unsigned long count)
{
int i;
for (i = 0; i < count; i++) {
cbfs_free_block (inode, block, 1);
block++;
}
}
6)CBFS_FREE_BLOCK():
-
8/3/2019 Cbfs Report
55/61
void cbfs_free_block (struct inode * inode, unsigned long
block,
unsigned long count)
{
struct buffer_head *bitmap_bh = NULL;
struct buffer_head * bh2;
unsigned long block_group;
unsigned long bit;
unsigned long i;
unsigned long overflow;
struct super_block * sb = inode->i_sb;
struct cbfs_sb_info * sbi = CBFS_SB(sb);
struct cbfs_group_desc * desc;
struct cbfs_super_block * es = sbi->s_es;
struct cbfs_hash_node *node;
unsigned freed = 0, group_freed = 0;
int err = 0;
err = cbfs_remove_hash_entry(c_htable, b_htable, (long)
block);
if (err < 0) {
err = 1;
goto error_return;
}
if (block < le32_to_cpu(es->s_first_data_block) ||
block + count < block ||
block + count > le32_to_cpu(es->s_blocks_count)) {
cbfs_error (sb, "cbfs_free_blocks",
"Freeing blocks not in datazone - "
"block = %lu, count = %lu", block, count);
goto error_return;
}
cbfs_debug ("freeing block(s) %lu-%lu\n", block, block +
count - 1);
do_more:
-
8/3/2019 Cbfs Report
56/61
overflow = 0;
block_group = (block - le32_to_cpu(es-
>s_first_data_block)) /
CBFS_BLOCKS_PER_GROUP(sb);
bit = (block - le32_to_cpu(es->s_first_data_block)) %
CBFS_BLOCKS_PER_GROUP(sb);
/*
* Check to see if we are freeing blocks across a group
* boundary.
*/
if (bit + count > CBFS_BLOCKS_PER_GROUP(sb)) {
overflow = bit + count - CBFS_BLOCKS_PER_GROUP(sb);
count -= overflow;
}
brelse(bitmap_bh);
bitmap_bh = read_block_bitmap(sb, block_group);
if (!bitmap_bh)
goto error_return;
desc = cbfs_get_group_desc (sb, block_group, &bh2);
if (!desc)
goto error_return;
if (in_range (le32_to_cpu(desc->bg_block_bitmap), block,
count) ||
in_range (le32_to_cpu(desc->bg_inode_bitmap), block,
count) ||
in_range (block, le32_to_cpu(desc->bg_inode_table),
sbi->s_itb_per_group) ||
in_range (block + count - 1, le32_to_cpu(desc-
>bg_inode_table),
sbi->s_itb_per_group))
cbfs_error (sb, "cbfs_free_blocks",
"Freeing blocks in system zones - "
"Block = %lu, count = %lu",
block, count);
for (i = 0, group_freed = 0; i < count; i++) {
-
8/3/2019 Cbfs Report
57/61
if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi,
block_group),
bit + i, bitmap_bh->b_data)) {
cbfs_error(sb, __FUNCTION__,
"bit already cleared for block %lu", block
+ i);
} else {
group_freed++;
}
block++;
}
mark_buffer_dirty(bitmap_bh);
if (sb->s_flags & MS_SYNCHRONOUS)
sync_dirty_buffer(bitmap_bh);
group_release_blocks(sb, block_group, desc, bh2,
group_freed);
freed += group_freed;
if (overflow) {
block += count;
count = overflow;
printk("\nOverflow");
goto do_more;
}
error_return:
brelse(bitmap_bh);
release_blocks(sb, freed);
DQUOT_FREE_BLOCK(inode, freed);
}
7)INIT_CBFS_FS
static int __init init_cbfs_fs(void)
{
int err = init_cbfs_xattr();
c_htable = cbfs_init_checksum_hash_table(myhash_cs);
-
8/3/2019 Cbfs Report
58/61
b_htable = cbfs_init_block_hash_table(myhash_b);
cacheptr = kmem_cache_create("nodespace", sizeof(struct
cbfs_hash_node), 0, 0, NULL, NULL);
if (err)
return err;
err = init_inodecache();
if (err)
goto out1;
err = register_file system(&cbfs_fs_type);
if (err)
goto out;
return 0;
out:
destroy_inodecache();
out1:
exit_cbfs_xattr();
return err;
}
8)EXIT_CBFS_FS:
static void __exit exit_cbfs_fs(void)
{
int result;
cbfs_hash_free(c_htable, b_htable);
unregister_file system(&cbfs_fs_type);
destroy_inodecache();
kmem_cache_destroy(cacheptr);
exit_cbfs_xattr();
}
SCREENSHOTS :
-
8/3/2019 Cbfs Report
59/61
-
8/3/2019 Cbfs Report
60/61
-
8/3/2019 Cbfs Report
61/61