Cbfs Report

download Cbfs Report

of 61

Transcript of Cbfs Report

  • 8/3/2019 Cbfs Report

    1/61

  • 8/3/2019 Cbfs Report

    2/61

    ABSTRACT

    File systems abstract raw disk data into file and directories thereby providing

    an easy-to-use interface for user applications to store and retrieve persistent information.

    Most common file systems that exist today treat file data as opaque entities, and they do not

    utilize information about their contents to perform useful optimizations. For example, todays

    file systems do not detect files with same data. In this project, we design and implement a new

    file system that understands the contents of the data it stores, to enable interesting

    functionality. Specifically we focus on detecting and eliminating duplicate data items across

    files. Eliminating duplicate data has two key advantages: first, the disk just stores a single

    copy of a data item even if multiple files share it, thereby saving storage space. Second, the

    disk I/O required for reading and writing copies of the same data are eliminated, thereby

    improving performance. To implement this, we plan to work on the existing Linux Ext2 file

    system and reuse part of its source code. Through our implementation we hope to demonstrate

    the utility of eliminating duplicate file data both in terms of space savings and performance

    improvements.

  • 8/3/2019 Cbfs Report

    3/61

    TABLE OF CONTENTS

    1. INTRODUCTION

    2. MOTIVATION

    3. BACKGROUND

    3.1 . File System Background

    3.2. Overview of Linux VFS

    3.3 . Layout of the Ext2 File System

    3.4 . Content Based File System

    4 DESIGN

    4.1 Overall structure

    4.2 Detecting Duplicate Data

    4.3 Tracking Content Information

    4.4 Online Duplicate Elimination

    4.5 DISCUSSION

    i.Concept in LBFS

    ii.File-level content checking

    5 IMPLEMENTATION

    5.1 Data Structures

    5.2 HASH TABLE OPERATIONS

    5.2.1. Initializing the hash table

    5.2.2. Compute Hash

    5.2.3. Add Entry

    5.2.4. Check Duplicate

    5.2.5. Remove Entry

    5.2.6. Free Hash Table

    5.3 Read Data flow

    5.4 WRITE DATA FLOW

    5.4.1 Overall Write Flow

    5.4.2 Unique Data (New Block)

  • 8/3/2019 Cbfs Report

    4/61

    5.4.3 Unique Data(Overwrite Existing Block)

    5.4.4 Duplicate Data(New Block)

    5.4.5 Duplicate Data(Overwrite)

    5.4.6 Delete Data

    5.5 Duplicate Eliminated Cache

    5.6 Code snippets

    6 EVALUATION

    6.1 Correctness Check

    6.2 Performance Check

    6.3 Overall Analysis

    7 RELATED WORK

    8 FUTURE WORK

    9 REFERENCES

  • 8/3/2019 Cbfs Report

    5/61

    INTRODUCTION :

    A file system is a method for storing and organizing files and the data they contain to

    make it easy to find and access them. The present file systems in Linux are content-oblivious.

    They do not know about the nature of data present in the disk blocks. Therefore, even if two ormore blocks of data are are identical, they are stored as redundant copies in the disk. This is

    the case dealt with in this project. The outcome is duplicate elimination in disks in the block-

    level. Content hashing is the method used to compare the two blocks to be redundant.

    MD5 algorithm is used to compute the hash values of each of the blocks in the disk. Before

    each write to a disk, the hash value is calculated for the new write and this is compared to the

    already existing hash values. If the same hash value is already present in the hash table, it

    means the block is a duplicate block.

    This project is implemented in the Linux kernel 2.6, wherein the existing ext2 file

    system functions are modified accordingly to accomplish the goal of duplicate elimination.

    The evaluation of this project is done in two phases namely the correctness check and the

    performance check. A testing program which does exhaustive file system operations is

    executed and the file system is found to be stable and also seen to produce expected results.

    The performance check is done by evaluating the postmark results.

    This report details the design and implementation of the content-based file system.

    Since the main change is dont only in the write phase, each write scenarios are explained in

    detail with the status of the hash table illustrated by diagrams both before and after the program

    code segment execution.

    MOTIVATION :

    The existing ext2 file system in Linux is devoid of any information regarding the

    data in the disk. It cannot distinguish between two blocks of data to be either unique or

    identical. This means, two identical files will be stored in two separate locations in the disk

    and thus require the space double the size of the file. This would not only result in wastage of

    disk space but also increased I/O operations in reading both files separately, holding two

    identical pages of data in the cache etc., Let us consider some of the example cases wherein

    this disadvantage proves to be a bigger problem.

    Consider the case wherein the virtual machines are used for testing purposes. When

    Linux is installed in the virtual machine and if the host OS of the virtual machine is also Linux,

    then, all the packages in Linux will first be stored in the disk and when the Linux is again

    installed in the Virtual machine, again the same set of packages will be separately stored in the

  • 8/3/2019 Cbfs Report

    6/61

    disk. The average size of the whole packages in the Linux distribution will be around 5

    Gigabytes of storage. Thus, disk space of about 10 Gigabytes of storage will be needed to

    have the Linux virtual machine with the host OS as Linux. When the virtual machine runs

    there will again be more I/O operations resulting in the degradation of performance. But if the

    duplicate blocks are eliminated, considerable disk space of about 5 Gigabytes of storage can be

    saved and there will be a considerable reduction in the number of I/O operations thereby

    increasing the performance to a good extend.

    Therefore, if by some means, the file system knows the content of the blocks in the

    disk before it writes a new block, this disadvantage can very well be eliminated. Hashing is a

    coming technique to generate a fixed-size unique hash for any arbitrary-sized input. Thus,

    when the content of each data blocks in the disk are hashed, they can easily be compared with

    one another and the file system can control the read and write operation accordingly. Content

    hashing in this project is done by using the Message Digest Algorithm.

    BACKGROUND :

    1) File System Background:

    A file system is a method for storing and organizing files and the data they contain

    to make it easy to find and access them. More formally, it is a set of abstract data types that are

    implemented for the storage, hierarchical organization, manipulation, navigation, access, andretrieval of data.

    A disk has just a group of sectors and tracks. So, it can be able to just do operations

    in the granularity of sectors. eg: read a sector, write to a sector etc. But we need to have a

    hierarchical structure for maintaining the files. With just a disk present, this cannot be done

    because, the hard disk is just a linear collection of bits(0s and 1s) arranged into tracks and

    sectors. For this purpose file systems are used. With a file system, we have an interface

    between the user programs and the hard disk. We tell the file system to write a file to the disk

    and it is the file system that knows the disk structure and copies the blocks of data in the disk.

    A file system treats the data from the disk as just a fixed-sized block that contains

    information. But the file system has no semantic information pertaining to the data. They

    treat all the data, may it be that of a file or a dentry or an inode in a single notion as a block of

    data.

  • 8/3/2019 Cbfs Report

    7/61

    2) Overview of Linux VFS:

    Linux comprises of the Virtual File system Switch (VFS) layer which lies

    intermediate to the application and the various file systems. Every request to the disk, before

    passing to the file system, goes through the VFS. It acts as a generalization layer to all theunderlying file systems. Its function is to locate the file system for a particular file from itsfile

    objectand then map the functions to the file system-specific functions. Some of the common

    functions like read and write work in the same pattern for most of the file systems. Therefore, a

    generic function is available for these type of functions and the VFS layer maps the request to

    these generic functions. There are four basic objects in the VFS namely,

    Super Block Object

    Inode Object

    File Object

    Dentry Object.

    Consider a process P1, makes a read request for a file F1 that is stored in the disk partition

    formatted with the Ext2 file system. Similarly process P2 requests for file F2 in Ext3 file

    system. Both these requests get transferred to the corresponding system call (sys_read() in this

    case). The system call handling routine transfers it to the VFS layer. The VFS layer in turntransfers the read request to the corresponding file systems read function. The VFS knows the

    file system associated of any file by the file object that is passed to it. Some of the file systems

    in turn maps the basic operation requests to the generic functions which ultimately carries out

    the request and gives it to the layer above.

    3) Page Cache:

    The page cache is the main disk cache used by the Linux kernel. In most cases, the

    kernel refers to the page cache when reading from or writing to disk. New pages are added to

    the page cache to satisfy User Mode processes read requests. If the page is not already in the

    cache, a new entry is added to the cache and filled with the data read from the disk. If there is

    enough free memory, the page is kept in the cache for an indefinite period of time and can then

    be reused by other processes without accessing the disk.

  • 8/3/2019 Cbfs Report

    8/61

    Fig 3.1 : Overview of Linux Filesystems

    Similarly, before writing a page of data to a block device, the kernel verifies whether

    the corresponding page is already included in the cache; if not, a new entry is added to the

    cache and filled with the data to be written on disk. The I/O data transfer does not start

    immediately: the disk update is delayed for a few seconds, thus giving a chance to the

    processes to further modify the data to be written (in other words, the kernel implements

    deferred write operations).

    VFS

    PROCESS 1 PROCESS 2 PROCESS 3 PROCESS 4

    FILE 1 FILE 2 FILE 3 FILE 4

    INODE OBJECT

    FILE OBJECT

    ENTRY OBJECT

    SUPER BLOCK OBJECT

    DISK

    DISK CONTROLLER

    EXT2 EXT3 NFS

    READWRITE READ WRITE

  • 8/3/2019 Cbfs Report

    9/61

    Kernel code and kernel data structures don't need to be read from or written to disk.

    Kernel designers have implemented the page cache to fulfill two main requirements:

    Quickly locate a specific page containing data relative to a given owner. To take the

    maximum advantage from the page cache, searching it should be a very fast operation.

    Keep track of how every page in the cache should be handled when reading or writing

    its content. For instance, reading a page from a regular file, a block device file, or a swap

    area must be performed in different ways, thus the kernel must select the proper operation

    depending on the page's owner.

    The unit of information kept in the page cache is, of course, a whole page of data. A

    page does not necessarily contain physically adjacent disk blocks, so it cannot be identified by

    a device number and a block number. Instead, a page in the page cache is identified by an

    owner and by an index within the owner's data usually, an inode and an offset inside the

    corresponding file.

    4)Buffer Pages:

    In old versions of the Linux kernel, there were two different main disk caches: the

    page cache, which stored whole pages of disk data resulting from accesses to the contents of the

    disk files, and the buffer cache , which was used to keep in memory the contents of the blocks

    accessed by the VFS to manage the disk-based file systems.

    Starting from stable version 2.4.10, the buffer cache does not really exist anymore. In

    fact, for reasons of efficiency, block buffers are no longer allocated individually; instead, they

    are stored in dedicated pages called "buffer pages ," which are kept in the page cache.

    Formally, a buffer page is a page of data associated with additional descriptors called

    "buffer heads", whose main purpose is to quickly locate the disk address of each individual

    block in the page. In fact, the chunks of data stored in a page belonging to the page cache are

    not necessarily adjacent on disk.

    Whenever the kernel must individually address a block, it refers to the buffer page

    that holds the block buffer and checks the corresponding buffer head. Here are two common

    cases in which the kernel creates buffer pages:

    When reading or writing pages of a file that are not stored in contiguous disk blocks.

    This happens either because the file system has allocated noncontiguous blocks to the file,

    or because the file contains "holes".

  • 8/3/2019 Cbfs Report

    10/61

    When accessing a single disk block (for instance, when reading a superblock or an

    inode block).

    In the first case, the buffer page's descriptor is inserted in the radix tree of a regular

    file. The buffer heads are preserved because they store precious information: the block deviceand the logical block number that specify the position of the data in the disk.

    In the second case, the buffer page's descriptor is inserted in the radix tree rooted at

    the address_space objectof the inode in the bdev special file system associated with the block

    device. This kind of buffer pages must satisfy a strong constraint: all the block buffers must

    refer to adjacent blocks of the underlying block device.

    An instance of where this is useful is when the VFS wants to read the 1,024-byte

    inode block containing the inode of a given file. Instead of allocating a single buffer, the kernel

    must allocate a whole page storing four buffers; these buffers will contain the data of a group of

    four adjacent blocks on the block device, including the requested inode block.

    All the block buffers within a single buffer page must have the same size; hence, on

    the 80 x 86 architecture, a buffer page can include from one to eight buffers, depending on the

    block size.

    When a page acts as a buffer page, all buffer heads associated with its block buffers

    are collected in a singly linked circular list. The private field of the descriptor of the buffer page

    points to the buffer head of the first block in the page; every buffer head stores in the

    b_this_page field a pointer to the next buffer head in the list. Moreover, every buffer head

    stores the address of the buffer page's descriptor in the b_page field. Figure 3.2 shows a

    buffer page containing four block buffers and the corresponding buffer heads.

    Because the private field contains valid data, the PG_private flag of the page is also set; hence,

    if the page contains disk data and the PG_private flag is set, then the page is a buffer page.

    Notice, however, that other kernel components not related to the block I/O subsystem use the

    private and PG_private fields for other purposes.

    4) Writing Dirty Pages to Disk

    The kernel keeps filling the page cache with pages containing data of block devices.

    Whenever a process modifies some data, the corresponding page is marked as dirty that is, its

    PG_dirty flag is set.

  • 8/3/2019 Cbfs Report

    11/61

    Figure3.2 : A buffer page including four buffers and their buffer heads

    Unix systems allow the deferred writes of dirty pages into block devices, because

    this noticeably improves system performance. Several write operations on a page in cache

    could be satisfied by just one slow physical update of the corresponding disk sectors.

    Moreover, write operations are less critical than read operations, because a process is usually

    not suspended due to delayed writings, while it is most often suspended because of delayed

    reads. Thanks to deferred writes, each physical block device will service, on the average, many

    more read requests than write ones.

    A dirty page might stay in main memory until the last possible moment that is, until

    system shutdown. However, pushing the delayed-write strategy to its limits has two major

    drawbacks:

    If a hardware or power supply failure occurs, the contents of RAM can no longer be

    retrieved, so many file updates that were made since the system was booted are lost.

    The size of the page cache, and hence of the RAM required to contain it, would have to

    be huge at least as big as the size of the accessed block devices.

    Therefore, dirty pages areflushed(written) to disk under the following conditions:

    The page cache gets too full and more pages are needed, or the number of dirty pages

    becomes too large.

    Too much time has elapsed since a page has stayed dirty.

    A process requests all pending changes of a block device or of a particular file to be

    flushed; it does this by invoking a sync(), fsync(), orfdatasync() system call.

    Buffer pages introduce a further complication. The buffer heads associated with each buffer

    page allow the kernel to keep track of the status of each individual block buffer. The PG_dirty

  • 8/3/2019 Cbfs Report

    12/61

    flag of the buffer page should be set if at least one of the associated buffer heads has the

    BH_Dirty flag set. When the kernel selects a dirty buffer page for flushing, it scans the

    associated buffer heads and effectively writes to disk only the contents of the dirty blocks. As

    soon as the kernel flushes all dirty blocks in a buffer page to disk, it clears the PG_dirty flag of

    the page.

    5) Layout of the Ext2 file system :

    The first block in each Ext2 partition is never managed by the Ext2 file system,

    because it is reserved for the partition boot sector. The rest of the Ext2 partition is split into

    block groups , each of which has the layout shown in Figure XXXXXXXXXXX. As you will

    notice from the figure, some data structures must fit in exactly one block, while others may

    require more than one block. All the block groups in the file system have the same size and are

    stored sequentially, thus the kernel can derive the location of a block group in a disk simply

    from its integer index.

    Figure 3.3 : Layouts of an Ext2 partition and of an Ext2 block group

    Block groups reduce file fragmentation, because the kernel tries to keep the data blocks

    belonging to a file in the same block group, if possible. Each block in a block group contains

    one of the following pieces of information:

    A copy of the file system's superblock

    A copy of the group of block group descriptors

    A data block bitmap

    An inode bitmap

    A table of inodes

  • 8/3/2019 Cbfs Report

    13/61

    A chunk of data that belongs to a file; i.e., data blocks

    If a block does not contain any meaningful information, it is said to be free. As

    seen from Figure 3.3 , both the superblock and the group descriptors are duplicated in each

    block group. Only the superblock and the group descriptors included in block group 0 are usedby the kernel, while the remaining superblocks and group descriptors are left unchanged; in

    fact, the kernel doesn't even look at them. When the e2fsck program executes a consistency

    check on the file system status, it refers to the superblock and the group descriptors stored in

    block group 0, and then copy them into all other block groups. If data corruption occurs and the

    main superblock or the main group descriptors in block group 0 become invalid, the system

    administrator can instruct e2fsck to refer to the old copies of the superblock and the group

    descriptors stored in a block groups other than the first. Usually, the redundant copies store

    enough information to allow e2fsck to bring the Ext2 partition back to a consistent state.

    This figure 3.4 shows the actual mapping of the inode to the corresponding data

    blocks in a single group.

    |--Inode table---| |---Indirect blocks pointing to data blks---------| |---Data Blks----|

    Fig 3.4 Inode pointers in Ext2 filesystem

    As shown from the figure above, each entry in the inode table points to a specific

    data block and the contents of the data blocks are never taken care of. Therefore, there exists

    multiple copies of same information in many data blocks in the disk and more space is wasted

    for this.

  • 8/3/2019 Cbfs Report

    14/61

    6) Data Blocks Addressing in EXT2:

    Each nonempty regular file consists of a group of data blocks . Such blocks may be

    referred to either by their relative position inside the file their file block number or by their

    position inside the disk partition their logical block number.

    Deriving the logical block number of the corresponding data block from an offset f

    inside a file is a two-step process:

    1. Derive from the offsetfthe file block number the index of the block that contains the

    character at offsetf.

    2. Translate the file block number to the corresponding logical block number.

    Because Unix files do not include any control characters, it is quite easy to derive the

    file block number containing thefth character of a file: simply take the quotient offand the file

    system's block size and round down to the nearest integer.

    For instance, let's assume a block size of 4 KB. Iff is smaller than 4,096, the

    character is contained in the first data block of the file, which has file block number 0. Iffis

    equal to or greater than 4,096 and less than 8,192, the character is contained in the data block

    that has file block number 1, and so on.

    This is fine as far as file block numbers are concerned. However, translating a file

    block number into the corresponding logical block number is not nearly as straightforward,

    because the data blocks of an Ext2 file are not necessarily adjacent on disk.

    The Ext2 file system must therefore provide a method to store the connection

    between each file block number and the corresponding logical block number on disk. This

    mapping, which goes back to early versions of Unix from AT&T, is implemented partly inside

    the inode. It also involves some specialized blocks that contain extra pointers, which are an

    inode extension used to handle large files.

    The i_block field in the disk inode is an array of EXT2_N_BLOCKS components

    that contain logical block numbers. In the following discussion, we assume that

    EXT2_N_BLOCKS has the default value, namely 15. The array represents the initial part of a

    larger data structure, which is illustrated in Figure XXXX. As can be seen in the figure, the 15

    components of the array are of 4 different types:

    The first 12 components yield the logical block numbers corresponding to the first 12

    blocks of the file to the blocks that have file block numbers from 0 to 11.

  • 8/3/2019 Cbfs Report

    15/61

    The component at index 12 contains the logical block number of a block, called

    indirect block, that represents a second-order array of logical block numbers. They

    correspond to the file block numbers ranging from 12 to b/4+11, where b is the file

    system's block size (each logical block number is stored in 4 bytes, so we divide by 4 in the

    formula). Therefore, the kernel must look in this component for a pointer to a block, and

    then look in that block for another pointer to the ultimate block that contains the file

    contents.

    The component at index 13 contains the logical block number of an indirect block

    containing a second-order array of logical block numbers; in turn, the entries of this

    second-order array point to third-order arrays, which store the logical block numbers that

    correspond to the file block numbers ranging from b/4+12 to (b/4)2+(b/4)+11.

    Finally, the component at index 14 uses triple indirection: the fourth-order arrays store

    the logical block numbers corresponding to the file block numbers ranging from

    (b/4)2+(b/4)+12 to (b/4)

    3+(b/4)

    2+(b/4)+11.

    Figure 3.5 : Data structures used to address the file's data blocks

    In Figure 3.5, the number inside a block represents the corresponding file block number. The

    arrows, which represent logical block numbers stored in array components, show how the

  • 8/3/2019 Cbfs Report

    16/61

    kernel finds its way through indirect blocks to reach the block that contains the actual contents

    of the file.

    Notice how this mechanism favors small files. If the file does not require more than 12 data

    blocks, every data can be retrieved in two disk accesses: one to read a component in the i_blockarray of the disk inode and the other to read the requested data block. For larger files, however,

    three or even four consecutive disk accesses may be needed to access the required block. In

    practice, this is a worst-case estimate, because dentry, inode, and page caches contribute

    significantly to reduce the number of real disk accesses.

    Notice also how the block size of the file system affects the addressing mechanism, because a

    larger block size allows the Ext2 to store more logical block numbers inside a single block.

    Table 1 shows the upper limit placed on a file's size for each block size and each addressing

    mode. For instance, if the block size is 1,024 bytes and the file contains up to 268 kilobytes of

    data, the first 12 KB of a file can be accessed through direct mapping and the remaining 13-268

    KB can be addressed through simple indirection. Files larger than 2 GB must be opened on 32-

    bit architectures by specifying the O_LARGEFILE opening flag.

    Table 1. File-size upper limits for data block addressing

    Block size Direct 1-Indirect 2-Indirect 3-Indirect

    1,024 12 KB 268 KB 64.26 MB 16.06 GB

    2,048 24 KB 1.02 MB 513.02 MB 256.5 GB

    4,096 48 KB 4.04 MB 4 GB ~ 4 TB

  • 8/3/2019 Cbfs Report

    17/61

    THE CONTENT-BASED FILE SYSTEM :

    The content based file system employs a technique called content hashing to

    compare the content of the blocks and find if they are duplicate. Message Digest Algorithm is

    used to compute the hash value of each block. For every data blocks in the disk, thecorresponding hash value is calculated. At any given instant, the hash table will have one

    entry for every valid data blocks in the disk. There are two hash table structures called the

    checksum hash table that is indexed by the checksum field and the block hash table that is

    indexed by the block number field. Both these structures point to the single copy of the hash

    node.

    In the context of the file system, there is no change in any of the inode data

    structures. The only difference is that, the inode table will have multiple entries pointing to the

    same block number in the disk. For eg. If a 50 Megabytes-sized file is created in the partition

    having the content based file system, and if the file contains just 40 MB of unique

    information in it and the remaining 10 MB of data are duplicate, then, the inode table for that

    file will have the same number of pointers as with the fully-unique 50 MB file. But, the inode

    table will have duplicate pointers for the remaining 10 MB and therefore, the essential space

    that is occupied by the file is just 40 MB. In this way, disk space can be saved to a good

    extend. When the file is read, the inode table is accessed in a normal way and reads the

    corresponding data blocks. When a data block is already read from the disk and if it is presentin the buffer cache, then, if the same block is needed again, it doesnt need to invoke a disk

    read again. It can just use the copy of data in the buffer cache. In this way, by the better cache

    hit, performance can be enhanced during read operation. Therefore, for any operation like file

    create, file overwrite, file append or truncation, before the disk write operation is invoked, the

    checksum values are compared and only if there are no duplicate blocks, the new block is

    written; otherwise, no new data blocks are written and the disk space remains the same . As

    a result of this, at any given instant, the disk will be having only a single copy of any data and

    there is no place for redundancy. In addition to efficient utilization of disk space, contenthashing also helps in maintaining the data integrity. In certain cases, this technique can reduce

    considerable amount of disk space and also increase the performance of the operations.

  • 8/3/2019 Cbfs Report

    18/61

  • 8/3/2019 Cbfs Report

    19/61

    codes. These codes dont lead to overhead and thus not too inefficient. But, these are not

    collision resistant. This means, there may be duplicate blocks that are easily missed by the

    error detection codes. Therefore, this technique cannot be relied upon to detect the duplicate

    blocks. Considering these cases, collision-resistant hashing proves to be a more viable

    method for accomplishing the goal of detecting duplicate blocks.

    Collision Resistant Hashing :

    Collision-Resistant hashing is a technique by which, a unique hash value is generated

    for each unique content of the block. The Message Digest Algorithm 5(MD5) is used for this

    purpose. The MD5 algorithm is widely used in many cryptographic applications and it is

    found to be more collision resistant than its predecessors. It takes in the input of arbitrary

    length and produces a MD5 hash of 128 bits. For any unique input, a unique MD5 hash is

    produced. This is a one-way hashing technique wherein, from a given data, its corresponding

    hash value can very well be calculated but given a MD5 hash, it is not possible to derive the

    input data to it. We employ this technique to compare the contents of the block. As, the

    MD5 hashing algorithm gives a unique hash for every unique input data, it can be said that no

    identical blocks of data will give raise to a single hash value. Therefore, if the hash value is

    calculated for every data blocks, then comparing the hash value would mean comparing the

    actual blocks of data.

    4.3 TRACKING CONTENT INFORMATION:

    The hash table is used to track the content information pertaining to the various disk

    data blocks. For every unique content in the disk, there is an entry in the hash table. In the

    hash table, there is a checksum->block number mapping available. This mapping is used to

    locate the duplicate blocks(if any) before any write operation takes place in the disk. The

    overall structure of the hash table is illustrated by the following figure.

  • 8/3/2019 Cbfs Report

    20/61

    Figure : 4.3 Hash table structure

    Checksum Hash Table Disk Blocks Block Hash Table

    4.4 ONLINE DUPLICATE ELIMINATION:

    WRITE SCENARIOS :

    1) Unique Data (New Block):

    This happens in the cases of file creation or a file append. Basically, a new block is allocated here and then the data is copied to it. Before writing the data to the

    block, the hash table is checked and found to be a new data. Therefore, the new block

    entry, its checksum value and a reference count of 1 is added to the hash table and the

    normal write operation is continued to execute.

    2) Unique Data (Existing Block):

    This happens in the case of file modification. When a file is modified, one or

    more blocks corresponding to the file get modified and before writing it to the disk, the

    routine redundancy check is performed. In this case, the new modified checksum doesnt

    find an entry in the hash table. But already there is an entry in the hash table for that block

    but with a different checksum. Therefore, in this case, the reference count of that block is

    decremented and a new block is allocated and the new contents are copied to it with

    updating the hash table for that new block.

    BLOCK 4

    BLOCK 3

    BLOCK 6

    BLOCK 5

    BLOCK 2

    BLOCK 1B5 RCC1

    B1 RCC2

    B3 RCC3

    B2 RCC4

    B6 RCC5

    B4 RCC6

    B1 RCC2

    B2 RCC4

    B3 RCC3

    B4 RCC6

    B5 RCC1

    B6 RCC5

  • 8/3/2019 Cbfs Report

    21/61

    3) Duplicate Data (New block):

    This happens in the case of file creation or a file append wherein the efficiency

    of content-based file system is utilized. In this case, before writing to a disk, the hash

    table is checked for duplicate block and that corresponding block is mapped to thecorresponding inode pointer. The reference count of the block is incremented.

    4) Duplicate Data (Existing Block):

    This is a case where an existing block is modified and the now modified

    contents are found to be duplicate. In this case, the reference count of the old block is

    decremented and the reference count of the new block is incremented. Then, the

    corresponding inode pointers are updated with the new block.

    5) Delete Data:

    This happens when a file is modified by deleting a part of its contents, or during

    a file remove. In this case, the reference count alone is decremented and when the

    reference count is 0, the block gets freed and the hash entry is removed.

    DISCUSSION :

    Comparison with LBFS :

    LBFS is a network file system which conserves communication bandwidth

    between clients and servers. It takes advantage of cross-file similarities. When transferring a

    file between the client and server, LBFS identifies chunks of data that the recipient already has

    in other files and avoids transmitting the redundant data over the network. Here, there is no

    block-level redundancy check performed. Instead, some chunks of data are compared to check

    duplication of data. Chunks are variable-sized blocks. Here, when a modification is made to

    a shared data, the block size is made to increase and then necessary changes are made to the

    kernel to handle that. This is more complex to implement when compared to the block-level

    redundancy check that is performed in this project

    File-Level Redundancy check :

    Another technique that is synonymous with the block-level hashing is the file-level

    redundancy checking. In this case, the whole file is compared and the redundant files are

  • 8/3/2019 Cbfs Report

    22/61

    eliminated. This usage of this technique is limited to the availability of redundant files in a

    file system. Only if there are more than one redundant files in the disk partition, the advantage

    can be felt. But in the case of block-level redundancy check, the duplicate blocks are

    eliminated in inter-file environment. Even when two files are different, as a whole, the

    duplicate elimination can be done for some of the redundant blocks in the two files alone. The

    chances of having the advantage of this file system can be found in a more frequent manner

    since there may be many blocks that are redundant and spread over different files.

    Therefore, the block-level content hashing and redundancy elimination is a

    beneficial method that brings about efficient disk space usage and also a better performance by

    reducing the number of disk accesses and by attaining a better cache hit.

  • 8/3/2019 Cbfs Report

    23/61

    5. IMPLEMENTATION:

    Figure : 5.0 : Control Logic of CBFS

    CHECK DUPLICATE IN CHECKSUM TABLE

    FOUND NOT FOUND

    EXIT WITH SUCCESS

    CHANGE PAGE MAPPING

    FOUND

    CHECK BLOCK HASH TABLE

    UPDATE HASH TABLE

    CHANGE INODE POINTER

    NOT FOUND

    ALLOCATE NEW BLOCK

    REMOVE HASH ENTRY ADD HASH ENTRY

  • 8/3/2019 Cbfs Report

    24/61

    5.1 DATA STRUCTURES :

    1) Hash Table Structure :

    There are two hash table structures called checksum hash table (indexed by the

    checksum field) and a block hash table (indexed by the block number field). Both these

    hash table structures point to a single copy of the hash node. The following figure

    illustrates the overall structure of the hash table.

    Figure 5.1 : Structure of hash tables

    2) Components of the hash table :

    The hash table comprises of three field namely a checksum field of 128 bits, a

    block number field (Logical block number of the disk) and the reference count of the block.

    The block number field helps in locating the physical block of the disk. Therefore, for

    every valid data block in the disk, there will be a hash entry in the hash table.

    CHECKSUM 1 BLOCK 2 REFERENC

    E

    BLOCK 6 REF COUNT

    CHECKSUM 2 BLOCK 5 REF COUNT

    CHECKSUM 5 BLOCK 4 REF COUNT

    CHECKSUM 6

    CHECKSUM 4 BLOCK 3 REF COUNT

    CHECKSUM 3 BLOCK 1 REF COUNT

    BLOCK

    HASH ABLE

    POINTER

    INDEXED

    BY BLOCK

    CHECKSUM

    HASH TABLE

    POINTERINDEXED BY

    CHECKSUM

  • 8/3/2019 Cbfs Report

    25/61

    5.2 HASH TABLE OPERATIONS:

    There are two hash table pointers namely, checksum_htable and block_htable. The

    operations that are associated with the hash table are namely,

    1) Initializing the hash table:

    Initializing the hash table involves creation of the hash table structure, and the hash

    buckets. This is done at the time of mounting the content based file system. This is

    done in the kernel function, cbfs_fill_super(). The code snippet for initializing the two

    hash tables are given below :

    struct cbfs_hash_table* cbfs_init_checksum_hash_table(int

    (*hash) (void *)) {

    int i;

    struct cbfs_hash_table *newtable;

    newtable = kmalloc(sizeof(struct cbfs_hash_table),

    GFP_KERNEL);

    BUG_ON(!newtable);

    memset(newtable,0, sizeof(struct cbfs_hash_table));

    newtable->buckets = (struct cbfs_hash_bucket *)

    __get_free_pages(GFP_KERNEL,

    get_order(NUM_CHECKSUM_BUCKETS *

    sizeof(struct cbfs_hash_bucket)));

    BUG_ON(!newtable->buckets);

    memset(newtable->buckets,0, NUM_CHECKSUM_BUCKETS *

    sizeof(struct cbfs_hash_bucket));

    for (i=0; ibuckets[i].node_list);

    newtable->hash = hash;

    spin_lock_init(&newtable->lock);

    return newtable;

    }

  • 8/3/2019 Cbfs Report

    26/61

    struct cbfs_hash_table* cbfs_init_block_hash_table(int (*hash)

    (void *)) {

    int i;

    struct cbfs_hash_table *newtable;

    newtable = kmalloc(sizeof(struct cbfs_hash_table),

    GFP_KERNEL);

    BUG_ON(!newtable);

    memset(newtable,0, sizeof(struct cbfs_hash_table));

    newtable->buckets = (struct cbfs_hash_bucket *)

    __get_free_pages(GFP_KERNEL,

    get_order(NUM_CHECKSUM_BUCKETS *

    sizeof(struct cbfs_hash_bucket)));

    BUG_ON(!newtable->buckets);

    memset(newtable->buckets,0, NUM_BLOCK_BUCKETS *

    sizeof(struct cbfs_hash_bucket));

    for (i=0; ibuckets[i].node_list);

    newtable->hash = hash;

    spin_lock_init(&newtable->lock);

    return newtable;

    }

    2) Check duplicate:

    This function, takes in the contents of the block and returns whether there is another

    block already with the same contents. This function internally invoked

    cbfs_compute_checksum() function and calculates the checksum field and then compares it

    with the checksum fields in the hash table. Given below is the code snippet for

    cbfs_check_duplicate_block() function :

    long cbfs_check_duplicate_block(struct cbfs_hash_table

    *checksum_htable, struct cbfs_hash_table *block_htable,

    char *data, long new_blk_no) {

  • 8/3/2019 Cbfs Report

    27/61

    long err = 0;

    char *checksum;

    long blk_no;

    struct cbfs_hash_node *node_c, *node_b;

    checksum = cbfs_compute_checksum(data);

    spin_lock(&checksum_htable->lock);

    node_c = cbfs_node_lookup_by_checksum(checksum_htable,

    checksum);

    node_b = cbfs_node_lookup_by_block(block_htable,

    new_blk_no);

    if (!node_c && !node_b) {

    cbfs_add_hash_entry(checksum_htable, block_htable,

    checksum, new_blk_no);

    err = new_blk_no;

    goto out;

    }

    else if (!node_c && node_b){

    err = -1;

    }

    else {

    node_c->ref_count++;

    blk_no = node_c->block_number;

    err = blk_no;

    goto out;

    }

    out:

    spin_unlock(&checksum_htable->lock);

    return err;

    }

    char* cbfs_compute_checksum(char *data) {

    char *err = NULL;

    char *checksum;

    checksum = kmalloc(CHECKSUM_SIZE, GFP_KERNEL);

    hmac(data, DATA_SIZE, KEY, KEY_SIZE, (void *)checksum);

  • 8/3/2019 Cbfs Report

    28/61

    err = checksum;

    return err;

    }

    3) Add hash entry

    This function, takes in the checksum value, block number and then adds the entry in

    the hash table and initializes the reference count to 1. This is invoked when a new block is

    allocated and it is not found to be duplicate. The code snippet for the

    cbfs_add_hash_entry() function is given below :

    void cbfs_add_hash_entry(struct cbfs_hash_table

    *checksum_htable, struct cbfs_hash_table *block_htable, char

    *checksum, long blk_no) {

    struct cbfs_hash_node *newnode;

    int checksum_bucket;

    int block_bucket;

    long *blk = &blk_no;

    newnode = kmem_cache_alloc(cacheptr, GFP_KERNEL);

    newnode->checksum = checksum;

    newnode->block_number = blk_no;

    newnode->ref_count = 1;

    INIT_LIST_HEAD(&newnode->block_ptr);

    INIT_LIST_HEAD(&newnode->checksum_ptr);

    checksum_bucket = checksum_htable->hash((void *)checksum);

    block_bucket = block_htable->hash((void *)blk);

    list_add_tail(&newnode->checksum_ptr,

    &checksum_htable-

    >buckets[checksum_bucket].node_list);

    checksum_htable->len++;

    list_add_tail(&newnode->block_ptr, &block_htable-

    >buckets[block_bucket].node_list);

    block_htable->len++;

    }

    4) Remove hash entry

  • 8/3/2019 Cbfs Report

    29/61

    This function is for removing the hash entry if its reference count is 1 and for

    decrementing the reference count otherwise. This is called inside the cbfs_free_block()

    function. Therefore, for every block free in the file system, this function is invoked and

    thus the hash table is kept updated at every instant. The function

    cbfs_remove_hash_entry() is given below :

    int cbfs_remove_hash_entry (struct cbfs_hash_table

    *checksum_htable,

    struct cbfs_hash_table

    *block_htable,

    long blk_no) {

    int err = 0;

    int checksum_bucket;

    int block_bucket;

    char *checksum=NULL;

    long *blk = &blk_no;

    struct list_head *pos1, *pos2;

    struct cbfs_hash_node *node;

    block_bucket = block_htable->hash((void *)blk);

    spin_lock(&block_htable->lock);

    list_for_each(pos1, &block_htable-

    >buckets[block_bucket].node_list) {

    node = list_entry(pos1, struct cbfs_hash_node,

    block_ptr);

    if (node == NULL) {

    printk (\nNo node to free !!);

    err=0;

    goto out;

    }

    if(node->block_number == blk_no) {

    if(node->ref_count == 1) {

    checksum = node->checksum;

    list_del(pos1);

    block_htable->len--;

    goto cs;

    }

  • 8/3/2019 Cbfs Report

    30/61

    else {

    node->ref_count--;

    err = -1;

    goto out;

    }

    }

    }

    goto out;

    cs:

    checksum_bucket = checksum_htable->hash((void

    *)checksum);

    list_for_each(pos2, &checksum_htable-

    >buckets[checksum_bucket].node_list) {

    node = list_entry(pos2, struct cbfs_hash_node,

    checksum_ptr);

    if(memcmp((void *)node->checksum, (void

    *)checksum, CHECKSUM_SIZE) == 0) {

    list_del(pos2);

    kmem_cache_free(cacheptr, node);

    checksum_htable->len--;

    err = 0;

    goto out;

    }

    }

    out:

    spin_unlock(&block_htable->lock);

    return err;

    }

    5) Freeing the hash table:

    This is for removing the entire hash table from the memory. This is done when

    the CBFS module is removed from the kernel. This is invoked in the function,

    cbfs_module_exit() function which will be called at the time of module remove. Given

    below is the code snippet for the cbfs_hash_free() function :

  • 8/3/2019 Cbfs Report

    31/61

    void cbfs_hash_free(struct cbfs_hash_table *checksum_htable,

    struct cbfs_hash_table *block_htable) {

    int i;

    struct cbfs_hash_node *node;

    struct list_head *pos1, *pos2, *n;

    for(i=0;ibuckets[i].node_list) {

    node=list_entry(pos1, struct cbfs_hash_node,

    checksum_ptr);

    list_del(pos1);

    }

    }

    for(i=0;ibuckets[i].node_list) {

    node=list_entry(pos2, struct cbfs_hash_node,

    block_ptr);

    list_del(pos2);

    kmem_cache_free(cacheptr, node);

    }

    }

    free_pages ((unsigned long)checksum_htable->buckets,

    get_order

    (NUM_CHECKSUM_BUCKETS * sizeof(struct

    cbfs_hash_bucket)));

    free_pages ((unsigned long)block_htable->buckets,

    get_order

    (NUM_CHECKSUM_BUCKETS * sizeof(struct

    cbfs_hash_bucket)));

    printk(\nBuckets freed);

    kfree(checksum_htable);

    kfree(block_htable);

    printk(\nHash tables freed);

    }

  • 8/3/2019 Cbfs Report

    32/61

    5.3 READ DATA FLOW :

    Figure : 5.3 Read data flow

    As illustrated in the above flow diagram, the read request travels through a series of layers and

    functions and finally, ends up searching the inode pointers for the disk blocks to read. The

    inode pointers will be identical for the redundant blocks and for those blocks, the disk read is

    made only once and for every subsequent block reads, it will be available in the cache.

    Therefore the performance is improved by this content based file system. Also, no extra

    complexity is added with regard to the read function or the inode structure. This makes the

    content based file system a more viable option.

    5.4 WRITE DATA FLOW :

    1)Unique Data (New block) :

    This is the most common write scenario in a system, wherein, a file is either created or

    appended. Here, a new block is allocated, the block address is added into the appropriate place

    in the inode pointer and the, the content to be written are copied from the users address space

    SYS_READ() FUNCTION

    VFS LAYER (Maps the file system-specific read() function or the generic read() function)

    GENERIC_FILE_READ()

    CHECKS THE INODE POINTERS TO LOCATE

    READS THE CORRESPONDING BLOCKS FROM

    DISK

  • 8/3/2019 Cbfs Report

    33/61

    to the page that is mapped to that block. At this stage, the hash value for the new content that

    is available in the page is calculated.

    This hash value if looked up in the checksum hash table. Since this is a unique data, the

    hash table lookup returns a miss. Therefore, the already allocated new blocks number,

    the hash value of the newly written content and a reference count of 1 is added to the hash

    table. After this, the buffer is marked as dirty and the disk write is carried out. These

    processes are carried out by invoking the cbfs_check_duplicate_block() function.

    SYS_WRITE() FUNCTION

    VFS LAYER

    GENERIC_FILE_WRITE()

    GENERIC_FILE_BUFFERED_WRITE()

    CBFS_COMMIT WRITE (Here the Hash Table Is

    Checked For Duplicate Blocks. If Duplicate blocks found

    Write Is Not Performed, Else Normal Write Is Done)

    EXIT WITH SUCCESS

  • 8/3/2019 Cbfs Report

    34/61

  • 8/3/2019 Cbfs Report

    35/61

  • 8/3/2019 Cbfs Report

    36/61

    Once it finds this, the next step is to find whether the old version of the block had the same

    contents as the current version. To find this, it computes checksum on the current contents,

    and compares it with the checksum stored in the hash table against that block number. If

    the checksums don't match, it means that the new block now contains a different piece of

    data, and thus can no longer be identified by the old checksum

    3)Duplicate Data (New block):

    Another type of write scenario is where, a new block is to be written and the

    content of the to-be-written block is already present in the disk. Here, a new block is

    allocated, its address is added to the inode pointers and the new content is copied to the

    page in memory that is mapped to the newly-allocated block. At this stage, the hash value

    of the new content is calculated and a hash look up is made in the checksum hash table.

    Since the content is already present in the disk, the checksum hash table will find the block

    holding the content and returns its block number. This block number is the

    mapped_block. After this is done, cbfs_free_branches() function is called which removes

    the old block number from the inode pointers and then frees the already allocated block.

    Then, the mapped block number is added to the inode pointer and then, the reference count

    of the mapped block is incremented in the hash table.

    BEFORE :

    HASH TABLE INODE

    12000

    13050

    15090

    1407012000 1e$^%/*ui5*)kp2^#

    14070 2

    19004 1)fxr#$48%^$&)se%

    16020 1x#$6778jhJ)*(^&^

    1

    14070

    1160011600 1Io$%i$%^*iie+|@#

    13050 1#^&(^*^JHs^784ld

    15090 &*dshj$#%#ko()&^

  • 8/3/2019 Cbfs Report

    37/61

    AFTER :

    HASH TABLE INODE

    4)Duplicate Data (Overwrite existing block):

    This is comparatively a rare and a complex case of write. This is the case where,

    the already existing block is to be modified and the modified content happens to be already

    present in the disk. Here the overwrite may or may not require allocation of the new

    block. This can be found by checking for the block in the block hash table. Again, if it is

    an already existing block, its content cannot be changed straight-forward because, it might

    be a shared block. So, the hash table is looked up and the reference count is checked. If

    the reference count is 1, then the new checksum value is computed and the hash table is

    updated for the new checksum value for the same block. If the reference count is greater

    than 1, then the duplicate block check is made. This is done by calculating the new

    checksum and looking up the checksum hash table. Since the content is already present in

    the disk, the checksum hash table will find the block holding the content and returns its

    block number. This block number is the mapped_block.

    Then, the old block number is removed from the inode pointers and the mapped block is

    spliced to it. The reference count of the mapped block is also incremented.

    12000

    13050

    14070

    11600

    12000

    1407012000 2e$^%/*ui5*)kp2^#

    14070 2

    19004 1)fxr#$48%^$&)se%

    16020 1x#$6778jhJ)*(^&^

    11600 2Io$%i$%^*iie+|@#

    13050 1#^&(^*^JHs^784ld

    11600

  • 8/3/2019 Cbfs Report

    38/61

    BEFORE:

    HASH TABLE INODE

    AFTER

    HASH TABLE INODE

    5)Delete Data :

    This is invoked in the function cbfs_free_block() which is responsible for freeing

    the blocks of data. This will be called at the time of file truncation or a block remove.

    The cbfs_free_block() invokes the cbfs_remove_hash_entry() function. This function

    takes in the block number as argument and looks up the block hash table for the entry. It

    12000

    13050

    14070

    11600

    12000

    1160012000 2e$^%/*ui5*)kp2^#

    14070 1

    19004 1)fxr#$48%^$&)se%

    16020 1x#$6778jhJ)*(^&^

    11600 2Io$%i$%^*iie+|@#

    13050 1#^&(^*^JHs^784ld

    12000

    13050

    14070

    11600

    12000

    1407012000 2e$^%/*ui5*)kp2^#

    14070 2G^%HJF%&((YV**

    19004 1)fxr#$48%^$&)se%

    16020 1x#$6778jhJ)*(^&^

    11600 1Io$%i$%^*iie+|@#

    13050 1#^&(^*^JHs^784ld

  • 8/3/2019 Cbfs Report

    39/61

    finds the entry and checks the reference count. If the reference count is 1, then, the entry is

    removed from the hash table and a 0 is returned to the cbfs_free_block() function. If the

    reference count is greater than 1, then it is decremented and a 1 is returned to the

    cbfs_free_block() function. The cbfs_free_block() function on receiving a 0, proceeds

    with actually freeing the block by clearing the bit in the block bitmap. Similarly when a 1 is

    received, the execution is stopped and the block is not actually freed.

    5.5 DUPLICATE ELIMINATED CACHE:

    In the block-level duplicate elimination done here, the hash table maintained is

    a global table and therefore, the blocks belonging to all the files in that disk partition will

    be present in the hash table. This means, two processes accessing two different files

    having one or more shared block will have only a single copy of the block in the page

    cache. This is because, when the first file accesses the shared block, it will be read from

    the disk and made available in the page cache. Subsequently, when another process

    accesses that shared block, would first check for the block in the page cache. Since there

    will be a page corresponding to the shared block, already in the page cache, a disk read is

    saved. In this way, the duplicate elimination is carried out in the page cache which

    therefore helps in increasing the performance by reducing the disk accesses.

    6) EVALUATION :

    6.1) Correctness :

    The correctness of the project is checked with a help of a testing program, which

    exhaustively checks all the transactions a file system can operate on. The testing

    program checks the file system state on different instances like copying a duplicate file,

    copying a file with duplicate content, modifying a file which has shared blocks, etc.,

    It is found that the testing program returned expected results and the file system remained

    stable. Given below is the testing program and its output :

    tester.c :

    #include

    #include

    #include

    #include

    #include

    #include

    #include

  • 8/3/2019 Cbfs Report

    40/61

    #include

    #include

    #include

    #define NUM_DATA 256

    #define NUM_FILES 20

    #define STAGE_SIZE 4096

    #define FILE_SIZE 256

    int main(int argc, char **argv) {

    char **data;

    int i=0, j=0,err=0,fp;

    char filename[32];

    data = (char **)malloc(sizeof(char*) * NUM_DATA);

    for (i=0; i

  • 8/3/2019 Cbfs Report

    41/61

    printf("\nERROR: Write failed");

    exit(1);

    }

    }

    close(fp);

    }

    fp = open("nondup.dat", O_CREAT|O_WRONLY, 0644);

    if (fp < 0) {

    perror("tester");

    printf("\nERROR: Cannot open file");

    exit(1);

    }

    for (i=0; i

  • 8/3/2019 Cbfs Report

    42/61

    printf("\nERROR: Cannot open file");

    exit(1);

    }

    for (j=0; j

  • 8/3/2019 Cbfs Report

    43/61

    printf("\nERROR: Write failed");

    exit(1);

    }

    close(fp);

    }

    sync();

    printf("\nPerforming non-duplicate overwrites ... (dup-

    >nondup)");

    fflush(stdout);

    for (i=0; i

  • 8/3/2019 Cbfs Report

    44/61

    printf("\nPerforming duplicate overwrites ... (nondup-

    >dup)");

    fflush(stdout);

    fp = open("nondup.dat", O_CREAT|O_RDWR);

    if (fp < 0) {

    perror("tester");

    printf("\nERROR: Cannot open file");

    exit(1);

    }

    for (i=0; i

  • 8/3/2019 Cbfs Report

    45/61

    6.2) Performance :

    We conducted all tests on a Virtual Machine running on a 2.8 Ghz Celeron processor

    with 1GB of RAM, a 80 GB Western Digital Caviar IDE disk. The operating system wasFedora Core 4 running a 2.6.15 kernel.

    We tested the Content based file system using the postmark benchmark.

    Postmark As an I/O-intensive benchmark that tests the worst-case I/O performance of

    the file system, we ran Postmark [12]. Postmark stresses the file system by performing a series

    of operations such as directory lookups, creations, and deletions on small files. Postmark has

    three phases:

    The file creation phase which creates a working set offiles,

    The transactions phase, which involves creations, deletions, appends, and reads, and

    The file deletion phase removes all files in the working set.

    We configured Postmark to create 20000 files (between 512 bytes and 10KB) and perform

    2,00,000 transactions. Figure 6.1 shows the results of Postmark on Ext2 and CBFS

    Postmark results :

  • 8/3/2019 Cbfs Report

    46/61

    CODE SNIPPETS :

    1) __CBFS_COMMIT_WRITE () :

    static int

    __cbfs_commit_write(struct inode *inode, struct page *page,

    unsigned from, unsigned to) {

    unsigned block_start, block_end;

    int part = 0;

    unsigned blocksize;

    struct buffer_head *bh, *head;

    sector_t iblock, block;

    unsigned bbits;

    long int allocated_block, mapped_block;void *paddr;

    int err = -EIO;

    unsigned long goal;

    int offsets[4];

    Indirect chain[4];

    Indirect *partial;

    int boundary = 0;

    int depth = 0;

    Ext2 CBFS

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    550

    600

    650

    700

    Total Time

    Transaction

    Time

  • 8/3/2019 Cbfs Report

    47/61

    blocksize = 1 i_blkbits;

    bbits = inode->i_blkbits;

    iblock = (sector_t)page->index index != iblock)) {

    printk(KERN_WARNING "\nCalculation wrong!!");

    BUG();

    }

    paddr = page_address(page);

    for(bh = head = page_buffers(page), block_start = 0;

    bh != head || !block_start;

    iblock++, block_start=block_end, bh = bh-

    >b_this_page) {

    block_end = block_start + blocksize;

    if (block_end = to) {

    if (!buffer_uptodate(bh))

    part = 1;

    }

    else if (S_ISREG(inode->i_mode)) {

    if (buffer_new(bh))

    clear_buffer_new(bh);

    recheck:

    depth = cbfs_block_to_path(inode, iblock,

    offsets, &boundary);

    cbfs_get_branch(inode, depth, offsets, chain,

    &err);

    allocated_block = (long) chain[depth-1].key;

    mapped_block =

    cbfs_check_duplicate_block(c_htable, b_htable, (char *) paddr,

    allocated_block);

    if (mapped_block == allocated_block)

    goto out;

    else if (mapped_block < 0) {

  • 8/3/2019 Cbfs Report

    48/61

    cbfs_free_branches(inode, chain[depth-

    1].p, chain[depth-1].p+1, 0);

    cbfs_get_block(inode, iblock, bh, 1);

    goto recheck;

    }

    else {

    goal = mapped_block;

    cbfs_free_branches(inode, chain[depth-1].p,

    chain[depth-1].p+1, 0);

    cbfs_get_block_direct(inode, iblock, bh, 1,

    goal);

    }

    }

    else {

    out:

    set_buffer_uptodate(bh);

    mark_buffer_dirty(bh);

    }

    }

    if (bh->b_blocknr != pno_to_blockno(inode, page->index))

    {

    printk("\nBUG: page mapping is screwed up! %ld, and

    %ld", (long int)bh->b_blocknr,

    (long int)pno_to_blockno(inode, page-

    >index));

    }

    /*

    * If this is a partial write which happened to make all

    buffers

    * uptodate then we can optimize away a bogus readpage()

    for

    * the next read(). Here we 'discover' whether the page

    went

  • 8/3/2019 Cbfs Report

    49/61

    * uptodate as a result of this (potentially partial)

    write.

    */

    if (!part)

    SetPageUptodate(page);

    return 0;

    }

    2) CBFS_GET_BLOCK_DIRECT():

    int cbfs_get_block_direct(struct inode *inode, sector_t

    iblock, struct buffer_head *bh, int create, unsigned long goal)

    {

    int err = -EIO;

    int offsets[4];

    Indirect chain[4];

    Indirect *partial;

    int boundary = 0;

    int depth = 0;

    int left;

    depth = cbfs_block_to_path(inode, iblock, offsets,

    &boundary);

    if (depth == 0)

    goto out;

    reread:

    partial = cbfs_get_branch(inode, depth, offsets, chain,

    &err);

    /* Simplest case - block found, no allocation needed */

    if (!partial) {

    got_it:

    map_bh(bh, inode->i_sb, le32_to_cpu(chain[depth-

    1].key));

    if (boundary)

    set_buffer_boundary(bh);

  • 8/3/2019 Cbfs Report

    50/61

    /* Clean up and exit */

    partial = chain+depth-1; /* the whole chain */

    goto cleanup;

    }

    /* Next simple case - plain lookup or failed read of

    indirect block */

    if (err == -EIO) {

    cleanup:

    while (partial > chain) {

    brelse(partial->bh);

    partial--;

    }

    out:

    return 0;

    }

    /*

    * Indirect block might be removed by truncate while we

    were

    * reading it. Handling of that case (forget what we've

    got and

    * reread) is taken out of the main path.

    */

    if (err == -EAGAIN)

    goto changed;

    left = (chain + depth) - partial;

    err = cbfs_alloc_branch_direct(inode, left, goal,

    offsets+(partial-chain), partial);

    if (err)

    goto cleanup;

    if (cbfs_use_xip(inode->i_sb)) {

    /*

    * we need to clear the block

    */

  • 8/3/2019 Cbfs Report

    51/61

    err = cbfs_clear_xip_target (inode,

    le32_to_cpu(chain[depth-1].key));

    if (err)

    goto cleanup;

    }

    if (cbfs_splice_branch_direct(inode, iblock, chain,

    partial, left) < 0) {

    goto changed;

    }

    set_buffer_new(bh);

    goto got_it;

    changed:

    while (partial > chain) {

    brelse(partial->bh);

    partial--;

    }

    goto reread;

    }

    3)CBFS_ALLOC_BRANCH_DIRECT:

    static int cbfs_alloc_branch_direct(struct inode *inode,

    int num, unsigned long

    goal,

    int *offsets,

    Indirect *branch)

    {

    int blocksize = inode->i_sb->s_blocksize;

    int n = 0;

    int err = 0;

    int i;

    int parent;

    parent = goal;

  • 8/3/2019 Cbfs Report

    52/61

    branch[0].key = cpu_to_le32(parent);

    if (parent) for (n = 1; n < num; n++) {

    struct buffer_head *bh;

    /* Allocate the next block */

    int nr = cbfs_alloc_block(inode, parent, &err);

    if (!nr)

    break;

    branch[n].key = cpu_to_le32(nr);

    /*

    * Get buffer_head for parent block, zero it out and

    set

    * the pointer to new one, then send parent to disk.

    */

    bh = sb_getblk(inode->i_sb, parent);

    if (!bh) {

    err = -EIO;

    break;

    }

    lock_buffer(bh);

    branch[n].bh = bh;

    branch[n].p = (__le32 *) bh->b_data + offsets[n];

    *branch[n].p = branch[n].key;

    set_buffer_uptodate(bh);

    unlock_buffer(bh);

    mark_buffer_dirty_inode(bh, inode);

    /* We used to sync bh here if IS_SYNC(inode).

    * But we now rely upon generic_osync_inode()

    * and b_inode_buffers. But not for directories.

    */

    if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode))

    sync_dirty_buffer(bh);

    parent = nr;

    }

    if (n == num)

    return 0;

    /* Allocation failed, free what we already allocated */

  • 8/3/2019 Cbfs Report

    53/61

    for (i = 1; i < n; i++)

    bforget(branch[i].bh);

    return err;

    }

    4)CBFS_SPLICE_BRANCH_DIRECT():

    static inline int cbfs_splice_branch_direct(struct inode

    *inode,

    long block,

    Indirect chain[4],

    Indirect *where,

    int num)

    {

    struct cbfs_inode_info *ei = CBFS_I(inode);

    int i;

    /* Verify that place we are splicing to is still there

    and vacant */

    write_lock(&ei->i_meta_lock);

    // if (!verify_chain(chain, where-1) || *where->p)

    // goto changed;

    /* That's it */

    *where->p = where->key;

    ei->i_next_alloc_goal = le32_to_cpu(where[0].key);

    write_unlock(&ei->i_meta_lock);

    /* We are done with atomic stuff, now do the rest of

    housekeeping */

    inode->i_ctime = CURRENT_TIME_SEC;

  • 8/3/2019 Cbfs Report

    54/61

    /* had we spliced it onto indirect block? */

    if (where->bh)

    mark_buffer_dirty_inode(where->bh, inode);

    mark_inode_dirty(inode);

    return 0;

    changed:

    write_unlock(&ei->i_meta_lock);

    for (i = 1; i < num; i++)

    bforget(where[i].bh);

    for (i = 0; i < num; i++)

    cbfs_free_blocks(inode, le32_to_cpu(where[i].key),

    1);

    return -EAGAIN;

    }

    5)CBFS_FREE_BLOCKS() :

    void cbfs_free_blocks (struct inode * inode, unsigned long

    block,

    unsigned long count)

    {

    int i;

    for (i = 0; i < count; i++) {

    cbfs_free_block (inode, block, 1);

    block++;

    }

    }

    6)CBFS_FREE_BLOCK():

  • 8/3/2019 Cbfs Report

    55/61

    void cbfs_free_block (struct inode * inode, unsigned long

    block,

    unsigned long count)

    {

    struct buffer_head *bitmap_bh = NULL;

    struct buffer_head * bh2;

    unsigned long block_group;

    unsigned long bit;

    unsigned long i;

    unsigned long overflow;

    struct super_block * sb = inode->i_sb;

    struct cbfs_sb_info * sbi = CBFS_SB(sb);

    struct cbfs_group_desc * desc;

    struct cbfs_super_block * es = sbi->s_es;

    struct cbfs_hash_node *node;

    unsigned freed = 0, group_freed = 0;

    int err = 0;

    err = cbfs_remove_hash_entry(c_htable, b_htable, (long)

    block);

    if (err < 0) {

    err = 1;

    goto error_return;

    }

    if (block < le32_to_cpu(es->s_first_data_block) ||

    block + count < block ||

    block + count > le32_to_cpu(es->s_blocks_count)) {

    cbfs_error (sb, "cbfs_free_blocks",

    "Freeing blocks not in datazone - "

    "block = %lu, count = %lu", block, count);

    goto error_return;

    }

    cbfs_debug ("freeing block(s) %lu-%lu\n", block, block +

    count - 1);

    do_more:

  • 8/3/2019 Cbfs Report

    56/61

    overflow = 0;

    block_group = (block - le32_to_cpu(es-

    >s_first_data_block)) /

    CBFS_BLOCKS_PER_GROUP(sb);

    bit = (block - le32_to_cpu(es->s_first_data_block)) %

    CBFS_BLOCKS_PER_GROUP(sb);

    /*

    * Check to see if we are freeing blocks across a group

    * boundary.

    */

    if (bit + count > CBFS_BLOCKS_PER_GROUP(sb)) {

    overflow = bit + count - CBFS_BLOCKS_PER_GROUP(sb);

    count -= overflow;

    }

    brelse(bitmap_bh);

    bitmap_bh = read_block_bitmap(sb, block_group);

    if (!bitmap_bh)

    goto error_return;

    desc = cbfs_get_group_desc (sb, block_group, &bh2);

    if (!desc)

    goto error_return;

    if (in_range (le32_to_cpu(desc->bg_block_bitmap), block,

    count) ||

    in_range (le32_to_cpu(desc->bg_inode_bitmap), block,

    count) ||

    in_range (block, le32_to_cpu(desc->bg_inode_table),

    sbi->s_itb_per_group) ||

    in_range (block + count - 1, le32_to_cpu(desc-

    >bg_inode_table),

    sbi->s_itb_per_group))

    cbfs_error (sb, "cbfs_free_blocks",

    "Freeing blocks in system zones - "

    "Block = %lu, count = %lu",

    block, count);

    for (i = 0, group_freed = 0; i < count; i++) {

  • 8/3/2019 Cbfs Report

    57/61

    if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi,

    block_group),

    bit + i, bitmap_bh->b_data)) {

    cbfs_error(sb, __FUNCTION__,

    "bit already cleared for block %lu", block

    + i);

    } else {

    group_freed++;

    }

    block++;

    }

    mark_buffer_dirty(bitmap_bh);

    if (sb->s_flags & MS_SYNCHRONOUS)

    sync_dirty_buffer(bitmap_bh);

    group_release_blocks(sb, block_group, desc, bh2,

    group_freed);

    freed += group_freed;

    if (overflow) {

    block += count;

    count = overflow;

    printk("\nOverflow");

    goto do_more;

    }

    error_return:

    brelse(bitmap_bh);

    release_blocks(sb, freed);

    DQUOT_FREE_BLOCK(inode, freed);

    }

    7)INIT_CBFS_FS

    static int __init init_cbfs_fs(void)

    {

    int err = init_cbfs_xattr();

    c_htable = cbfs_init_checksum_hash_table(myhash_cs);

  • 8/3/2019 Cbfs Report

    58/61

    b_htable = cbfs_init_block_hash_table(myhash_b);

    cacheptr = kmem_cache_create("nodespace", sizeof(struct

    cbfs_hash_node), 0, 0, NULL, NULL);

    if (err)

    return err;

    err = init_inodecache();

    if (err)

    goto out1;

    err = register_file system(&cbfs_fs_type);

    if (err)

    goto out;

    return 0;

    out:

    destroy_inodecache();

    out1:

    exit_cbfs_xattr();

    return err;

    }

    8)EXIT_CBFS_FS:

    static void __exit exit_cbfs_fs(void)

    {

    int result;

    cbfs_hash_free(c_htable, b_htable);

    unregister_file system(&cbfs_fs_type);

    destroy_inodecache();

    kmem_cache_destroy(cacheptr);

    exit_cbfs_xattr();

    }

    SCREENSHOTS :

  • 8/3/2019 Cbfs Report

    59/61

  • 8/3/2019 Cbfs Report

    60/61

  • 8/3/2019 Cbfs Report

    61/61