Linux File Management - MATALA, IVAN G.

7/27/2019 Linux File Management - MATALA, IVAN G.

1/37

File Management

A Case Study

Submitted to Faculty of the

Computer Engineering Department

Engr. Joshua Cuesta

Mapua Institute of Technology

In Partial Fulfillment of the Requirements

For the Degree of BS Computer Engineering

Matala, Ivan G.

September 9, 2013


2/37

The Linux Virtual File System

The Virtual Filesystem (sometimes called the Virtual File Switch or more commonly simply the

VFS) is the subsystem of the kernel that implements the file and filesystem-related interfaces

provided to user-space programs.All filesystems rely on the VFS to enable them not only to

coexist, but also to interoperate.This enables programs to use standard Unix system calls to read

and write to different filesystems, even on different media, as shown in the figure below.

(Love, 2010)

The VFS in action: Using the cp(1) utility to move data from a hard disk mounted as ext3 to a

removable disk mounted as ext2. Two different filesystems, two different media, one VFS.

Common Filesystem Interface

The VFS is the glue that enables system calls such as open(), read(), and write()to work regardless

of the filesystem or underlying physical medium. These days, that might not sound novelwe

have long been taking such a feature for grantedbut it is a non-trivial feat for such generic system

calls to work across many diverse filesystems and varying media. More so, the system calls work

between these different filesystems and media we can use standard system calls to copy or move


3/37

files from one filesystem to another. In older operating systems, such as DOS, this would never

have worked; any access to a nonnative filesystem required special tools. It is only because modern

operating systems, such as Linux, abstract access to the filesystems via a virtual interface that such

interoperation and generic access is possible.

New filesystems and new varieties of storage media can find their way into Linux, andprograms

need not be rewritten or even recompiled. In this chapter, we will discuss the VFS, which provides

the abstraction allowing myriad filesystems to behave as one. In the next chapter, we will discuss

the block I/O layer, which allows various storage devices CD to Blu-ray discs to hard drives to

CompactFlash.Together, the VFS and the block I/O layer provide the abstractions, interfaces, and

glue that allow user-space programs to issue generic system calls to access files via a uniform

naming policy on any filesystem, which itself exists on any storage medium. (Love, 2010)

Filesystem Abstraction Layer

Such a generic interface for any type of filesystem is feasible only because the kernel implements

an abstraction layer around its low-level filesystem interface.This abstraction layer enables Linux

to support different filesystems, even if they differ in supported features or behavior. This is

possible because the VFS provides a common file model that can represent any filesystems

general feature set and behavior. Ofcourse, it is biased toward Unix-style filesystems. (You see

what constitutes a Unix-style filesystem later in this chapter.) Regardless, wildly differing

filesystem types are still supportable in Linux, from DOSs FAT to Windowss NTFS to many

Unix-style and Linux-specific filesystems.

The abstraction layer works by defining the basic conceptual interfaces and data structures that all

filesystems support.The filesystems mold their view of concepts such as this is how I open files


4/37

and this is what a directory is to me to match the expectations of the VFS. The actual filesystem

code hides the implementation details. To the VFS layer and the rest of the kernel, however, each

filesystem looks the same. They all support notions such as files and directories, and they all

support operations such as creating and deleting files.

The result is a general abstraction layer that enables the kernel to support many types of filesystems

easily and cleanly.The filesystems are programmed to provide the abstracted interfaces and data

structures the VFS expects; in turn, the kernel easily works with any filesystem and the exported

user-space interface seamlessly works on any filesystem.

In fact, nothing in the kernel needs to understand the underlying details of the filesystems, except

the filesystems themselves. For example, consider a simple user-space program that does

ret = write (fd, buf, len);

This system call writes the len bytes pointed to by buf into the current position in the file

represented by the file descriptor fd. This system call is first handled by a generic sys_write()

system call that determines the actual file writing method for the filesystem on which fd resides.

The generic write system call then invokes this method, which is part of the filesystem

implementation, to write the data to the media (or whatever this filesystem does on write). The

figure below shows the flow from user-spaces write() call through the data arriving on the physical

media. On one side of the system call is the generic VFS interface, providing the frontend to user-

space; on the other side of the system call is the filesystem-specific backend, dealing with the

implementation details. The rest of this chapter looks at how the VFS achieves this abstraction and

provides its interfaces. (Love, 2010)


5/37

The flow of data from user-space issuing a write() call, through the VFSs generic system call,

into the filesystems specific write method, and finally arriving at the physical media.

Linux File System

Linux retains UNIXs standard file-system model. In UNIX, a file does not have to be an object

stored on disk or fetched over a network from a remote file server. Rather, UNIX files can be

anything capable of handling the input or output of a stream of data. Device drivers can appear as

files, and inter-process communication channels or network connections also look like files to the

user.

The Linux kernel handles all these types of files by hiding the implementation details of any single

file type behind a layer of software, the virtual file system (VFS). Here, we first cover the virtual

file system and then discuss the standard Linux file systemext3. (Silberschatz, 2013)

The Virtual File System

The Linux VFS is designed around object-oriented principles. It has two components: a set of

definitions that specify what file-system objects are allowed to look like and a layer of software to

manipulate the objects. The VFS defines four main object types:


6/37

An inode objectrepresents an individual file. Afile objectrepresents an open file. Asuperblock objectrepresents an entire filesystem. A dentry objectrepresents an individual directory entry.

For each of these four object types, the VFS defines a set of operations. Every object of one of

these types contains a pointer to a function table. The function table lists the addresses of the actual

functions that implement the defined operations for that object. For example, an abbreviated API

for some ofthe file objects operations includes:

int open(. . .)Open a file. ssize_t read(. . .)Read from a file. ssize_t write(. . .)Write to a file. Int mmap(. . .)Memory-map a file.

The complete definition of the file object is specified in the struct file operations, which is located

in the file /usr/include/linux/fs.h. An implementation of the file object (for a specific file type) is

required to implement each function specified in the definition of the file object.

VFS and Processes Interaction

Besides providing a common interface to all filesystem implementations, the VFS has another

important role related to system performance. The most recently used dentry objects are contained

in a disk cache named the dentry cache, which speeds up the translation from a file pathname to

the inode of the last pathname component.


7/37

Interaction between processes and VFS objects

Generally speaking, a disk cache is a software mechanism that allows the kernel to keep in RAM

some information that is normally stored on a disk, so that further accesses to that data can be

quickly satisfied without a slow access to the disk itself. Beside the dentry cache, Linux uses other

disk caches, like the buffer cache and the page cache, which will be described in forthcoming

chapters. (Bovet & Cesati, 2000)

The VFS software layer can perform an operation on one of the file-system objects by calling the

appropriate function from the objects function table, without having to know in advance exactly

what kind of object it is dealing with. The VFS does not know, or care, whether an inode represents

a networked file, a disk file, a network socket, or a directory file. The appropriate function for that

files read() operation will always be at the same place in its function table, and the VFS software

layer will call that function without caring how the data are actually read. (Silberschatz, 2013)

The inode and file objects are the mechanisms used to access files. An inode object is a data

structure containing pointers to the disk blocks that contain the actual file contents, and a file object

represents a point of access to the data in an open file. A process cannot access an inodes contents

without first obtaining a file object pointing to the inode. The file object keeps track of where in


8/37

the file the process is currently reading or writing, to keep track of sequential file I/O. It also

remembers the permissions (for example, read or write) requested when the file was opened and

tracks the processs activity if necessary to perform adaptive read-ahead, fetching file data into

memory before the process requests the data, to improve performance.

File objects typically belong to a single process, but inode objects do not. There is one file object

for every instance ofan open file, but always only a single inode object. Even when a file is no

longer in use by any process, its inode object may still be cached by the VFS to improve

performance if the file is used again in the near future. All cached file data are linked onto a list in

the files inode object. The inode also maintains standard information about each file, such as the

owner, size, and time most recently modified.

Directory files are dealt with slightly differently from other files. The UNIX programming

interface defines a number of operations on directories, such as creating, deleting, and renaming a

file in a directory. The system calls for these directory operations do not require that the user open

the files concerned, unlike the case for reading or writing data. The VFS therefore defines these

directory operations in the inode object, rather than in the file object.

The superblock object represents a connected set of files that form a self-contained file system.

The operating-system kernel maintains a single superblock object for each disk device mounted as

a file system and for each networked file system currently connected. The main responsibility of

the superblock object is to provide access to inodes. The VFS identifies every inode by a unique

file-system/inode number pair, and it finds the inode corresponding to a particular inode number

by asking the superblock object to return the inode with that number. (Silberschatz, 2013)


9/37

Finally, a dentry object represents a directory entry, which may include the name of a directory in

the path name of a file (such as /usr) or the actual file (such as stdio.h). For example, the file

/usr/include/stdio.h contains the directory entries (1) /, (2) usr, (3) include, and (4) stdio.h. Each of

these values is represented by a separate dentry object. (Bovet & Cesati, 2000)

As an example of how dentry objects are used, consider the situation in which a process wishes to

open the file with the pathname /usr/include/stdio.h using an editor. Because Linux treats directory

names as files, translating this path requires first obtaining the inode for the root/. The operating

system must then read through this file to obtain the inode for the file include. It must continue this

process until it obtains the inode for the file stdio.h. Because path-name translation can be a time-

consuming task, Linux maintains a cache of dentry objects, which is consulted during path-name

translation. Obtaining the inode from the dentry cache is considerably faster than having to read

the on-disk file. (Silberschatz, 2013)

The Linux ext3 File System

The standard on-diskfile system used by Linux is called ext3, for historical reasons. Linux was

originally programmed with a Minix-compatible file system, to ease exchanging data with the

Minix development system, but that file system was severely restricted by 14-character file-name

limits and a maximum file-system size of 64 MB. The Minix file system was superseded by a new

file system, which was christened the extended file system (extfs). A later redesign to improve

performance and scalability and to add a few missing features led to the second extended file

system (ext2). Further development added journaling capabilities, and the system was renamed the

third extended file system (ext3). Linux kernel developers are working on augmenting ext3 with

modern file-system features such as extents. This new file system is called the fourth extended file

system (ext4). The rest of this section discusses ext3, however, since it remains the most-deployed


10/37

Linux file system. Most of the discussion applies equally to ext4. Linuxs ext3 has much in

common with the BSD Fast File System (FFS). It uses a similar mechanism for locating the data

blocksbelonging to a specific file, storing data-block pointers in indirect blocks throughout the file

system with up to three levels of indirection. As in FFS, directory files are stored on disk just like

normal files, although their contents are interpreted differently. Each block in a directory file

consists of a linked list of entries. In turn, each entry contains the length of the entry, the name of

a file, and the inode number of the inode to which that entry refers. (Silberschatz, 2013)

The main differences between ext3 and FFS lie in their disk-allocation policies. In FFS, the disk

is allocated to files in blocks of 8 KB. These blocks are subdivided into fragments of 1 KB for

storage of small files or partially filled blocks at the ends of files. In contrast, ext3 does not use

fragments at all but performs all its allocations in smaller units. The default block size on ext3

varies as a function of the total size of the file system. Supported block sizes are1, 2, 4, and 8 KB.

(Bovet & Cesati, 2000)

To maintain high performance, the operating system must try to perform I/O operations in large

chunks whenever possible by clustering physically adjacent I/O requests. Clustering reduces the

per-request overhead incurred by device drivers, disks, and disk-controller hardware. A block-

sized I/O request size is too small to maintain good performance, so ext3 uses allocation policies

designed to place logically adjacent blocks ofa file into physically adjacentblocks on disk, so that

it can submit an I/O request for several disk blocks as a single operation. (Silberschatz, 2013)

The ext3 allocation policy works as follows: As in FFS, an ext3 file system is partitioned into

multiple segments. In ext3, these are called block groups. FFS uses the similar concept of cylinder

groups, where each group corresponds to a single cylinder of a physical disk. (Note that modern

disk-drive technology packs sectors onto the disk at different densities, and thus with different


11/37

cylinder sizes, depending on how far the disk head is from the center of the disk. Therefore, fixed-

sized cylinder groups do not necessarily correspond to the disks geometry.) (Bovet & Cesati,

2000)

When allocating a file, ext3 must first select the block group for that file. For data blocks, it

attempts to allocate the file to the block group to which the files inode has been allocated. For

inode allocations, it selects the block group in which the files parent directory resides for

nondirectory files. Directory files are not kept together but rather are dispersed throughout the

available block groups. These policies are designed not only to keep related information within

the same block group but also to spread out the disk load among the disks block groups to reduce

the fragmentation of any one area of the disk. (Silberschatz, 2013)

Within a block group, ext3 tries to keep allocations physically contiguous if possible, reducing

fragmentation if it can. It maintains a bitmap of all free blocks in a block group. When allocating

the first blocks for a new file, it starts searching for a free block from the beginning of the block

group. When extending a file, it continues the search from the block most recently allocated to the

file. The search is performed in two stages. First, ext3 searches for an entire free byte in the bitmap;

if it fails to find one, it looks for any free bit. The search for free bytes aims to allocate disk space

in chunks of at least eight blocks where possible. (Bovet & Cesati, 2000)

Once a freeblock has been identified, the search is extended backward until an allocated block is

encountered. When a free byte is found in the bitmap, this backward extension prevents ext3 from

leaving a hole between the most recently allocated block in the previous nonzero byte and the zero

byte found. Once the next block to be allocated has been found by either bit or byte search, ext3

extends the allocation forward for up to eight blocks and preallocates these extra blocks to the file.

This preallocation helps to reduce fragmentation during interleaved writes to separate files and


12/37

also reduces the CPU cost of disk allocation by allocating multiple blocks simultaneously. The

preallocated blocks are returned to the free-space bitmap when the file is closed. (Silberschatz,

2013)

The Figure below illustrates the allocation policies. Each row represents a sequence of set and

unset bits in an allocation bitmap, indicating used and free blocks on disk. In the first case, if we

can find any free blocks sufficiently near the start of the search, then we allocate them no matter

how fragmented

Ext3 block-allocation policies.

they may be. The fragmentation is partially compensated for by the fact that the blocks are close

together and can probably all be read without any diskseeks. Furthermore, allocating them all to


13/37

one file is better in the long run than allocating isolated blocks to separate files once large free

areas become scarce on disk. In the second case, we have not immediately found a free block close

by,so we search forward for an entire free byte in the bitmap. If we allocated that byte as a whole,

we would end up creating a fragmented area of free space between it and the allocation preceding

it. Thus, before allocating, we backup to make this allocation flush with the allocation preceding

it, and then we allocate forward to satisfy the default allocation of eight blocks. (Silberschatz,

2013)

System Calls Handled by the VFS

The table below illustrates the VFS system calls that refer to filesystems, regular files, directories,

and symbolic links. A few other system calls handled by the VFS, such as ioperm( ), ioctl( ), pipe(

), and mknod( ), refer to device files and pipes and hence will be discussed in later chapters. A last

group of system calls handled by the VFS, such as socket( ), connect( ), bind( ), and protocols( ),

refer to sockets and are used to implement networking; they will not be covered in this book.

Some System Calls Handled by the VFS


14/37

We said earlier that the VFS is a layer between application programs and specific filesystems.

However, in some cases a file operation can be performed by the VFS itself, without invoking a

lower-level procedure. For instance, when a process closes an open file, the file on disk doesn't

usually need to be touched, and hence the VFS simply releases the corresponding file object.

Similarly, when the lseek( ) system call modifies a file pointer, which is an attribute related to the

interaction between an opened file and a process, the VFS needs to modify only the corresponding

file object without accessing the file on disk and therefore does not have to invoke a specific

filesystem procedure. In some sense, the VFS could be considered as a "generic" filesystem that

relies, when necessary, on specific ones. (Bovet & Cesati, 2000)

Journaling

The ext3 file system supports a popular feature called journaling, whereby modifications to the file

system are written sequentially to a journal. A set of operations that performs a specific task is a

transaction. Once a transaction is written to the journal, it is considered to be committed.

Meanwhile, the journal entries relating to the transaction are repla yed across the actual filesystem

structures. As the changes are made, a pointer is updated to indicate which actions have completed

and which are still incomplete. When an entire committed transaction is completed, it is removed

fromthe journal. The journal, which is actually a circular buffer, may be in a separate section of

the file system, or it may even be on a separate disk spindle. It is more efficient, but more complex,

to have it under separate readwrite heads, thereby decreasing head contention and seek times.

(Silberschatz, 2013)

If the system crashes, some transactions may remain in the journal. Those transactions were never

completed to the file system even though they were committed by the operating system, so they

must be completed once the system recovers. The transactions can be executed from the pointer


15/37

until the work is complete, and the file-system structures remain consistent. The only problem

occurs when a transaction has been abortedthat is, it was not committed before the system

crashed. Any changes from those transactions that were applied to the file system must be undone,

again preserving the consistency ofthe file system. This recovery is all that is needed after a crash,

eliminating all problems with consistency checking. (Bovet & Cesati, 2000)

Journaling file systems may perform some operations faster than non-journaling systems, as

updates proceed much faster when they are applied to the in-memory journal rather than directly

to the on-disk data structures. The reason for this improvement is found in the performance

advantage of sequential I/O over random I/O. Costly synchronous random writes to the file system

are turned into much less costly synchronous sequential writes to the file systems journal. Those

changes, in turn, are replayed asynchronously via random writes to the appropriate structures. The

overall result is a significant gain in performance of file-system metadata-oriented operations, such

as file creation and deletion. Due to this performance improvement, ext3 can be configured to

journal only metadata and not file data. (Silberschatz, 2013)

VFS Data Structures

Each VFS object is stored in a suitable data structure, which includes both the object attributes

and a pointer to a table of object methods. The kernel may dynamically modify the methods of the

object, and hence it may install specialized behavior for the object. The following sections explain

the VFS objects and their interrelationships in detail.


16/37

The Fields of the Superblock Object

All superblock objects (one per mounted filesystem) are linked together in a circular doubly linked

list. The addresses of the first and last elements of the list are stored in the next and prev fields,

respectively, of the s_list field in the super_blocks variable. This field has the data type struct

list_head, which is also found in the s_dirty field of the superblock and in a number of other places

in the kernel; it consists simply of pointers to the next and previous elements of a list. Thus, the

s_list field of a superblock object includes the pointers to the two adjacent superblock objects in

the list. (Bovet & Cesati, 2000)

The Linux Process File System

The flexibility of the Linux VFS enables us to implement a file system that does not store data

persistently at all but rather provides an interface to some other functionality. The Linux process

file system, known as the /proc file system, is an example of a file system whose contents are not

actually stored anywhere but are computed on demand according to user file I/O requests.



17/37

A /proc file system is not unique to Linux. SVR4 UNIX introduced a /proc file system as an

efficient interface to the kernels process debugging support. Each subdirectory of the file system

corresponded not to a directory on any disk but rather to an active process on the current system.

A listing of the file system reveals one directory per process, with the directory name being the

ASCII decimal representation of the processs unique process identifier (PID).

Linux implements such a /proc file system but extends it greatly by adding a number of extra

directories and text files under the file systems root directory. These new entries correspond to

various statistics about the kernel and the associated loaded drivers. The /proc file system provides

a way for programs to access this information as plain text files; the standard UNIX user

environment provides powerful tools to process such files. For example, in the past, the traditional

UNIX ps command for listing the states of all running processes has been implemented as a

privileged process that reads the process state directly from the kernels virtual memory. Under

Linux, this command is implemented as an entirely unprivileged program that simply parses and

formats the information from /proc. (Silberschatz, 2013)

The /proc file system must implement two things: a directory structure and the file contents within.

Because a UNIX file system is defined as a set of file and directory inodes identified by their inode

numbers, the /proc file system must define a unique and persistent inode number for each directory

and the associated files. Once such a mapping exists, the file system can use this inode number to

identify just what operation is required when a user tries to read from a particular file inode or to

perform a lookup in a particular directory inode. When data are read from one of these files, the

/proc file system will collect the appropriate information, format it into textual form, and place it

into the requesting processs read buffer.


18/37

The mapping from inode number to information type splits the inode number into two fields. In

Linux, a PID is 16 bits in size, but an inode number is 32 bits. The top 16 bits of the inode number

are interpreted as a PID, and the remaining bits define what type of information is being requested

about that process.

A PID of zero is not valid, so a zero PID field in the inode number is taken to mean that this inode

contains globalrather than process-specificinformation. Separate global files exist in /proc to

report information such as the kernel version, free memory, performance statistics, and drivers

currently running. (Silberschatz, 2013)

Not all the inode numbers in this range are reserved. The kernel can allocate new /proc inode

mappings dynamically, maintaining a bitmap of allocated inode numbers. It also maintains a tree

data structure of registered global /proc file-system entries. Each entry contains the files inode

number, file name, and access permissions, along with the special functions used to generate the

files contents. Drivers can register and deregister entries in this tree at any time, and a special

section of the treeappearing under the /proc/sys directoryis reserved for kernel variables.

Files under this tree are managed by a set of common handlers that allow both reading and writing

of these variables, so a system administrator can tune the value of kernel parameters simply by

writing out the new desired values in ASCII decimal to the appropriate file.

To allow efficient access to these variables from within applications, the /proc/sys subtree is made

available through a special system call, sysctl(), that reads and writes the same variables in binary,

rather than in text, without the overhead ofthe file system. sysctl() is not an extra facility; it simply

reads the /proc dynamic entry tree to identify the variables to which the application is referring.



19/37

Disk Data Structures

Layouts of an Ext2 partition and of an Ext2 block group

The first block in any Ext2 partition is never managed by the Ext2 filesystem, since it is reserved

for the partition boot sector (see Appendix A). The rest of the Ext2 partition is split into block

groups , each of which has the layout shown in above figure. As you will notice from the figure,

some data structures must fit in exactly one block while others may require more than one block.

All the block groups in the filesystem have the same size and are stored sequentially, so the kernel

can derive the location of a block group in a disk simply from its integer index.

(Bover & Cesati, 2000)

Block groups reduce file fragmentation, since the kernel tries to keep the data blocks belonging to

a file in the same block group if possible. Each block in a block group contains one of the following

pieces of information:

A copy of the filesystems superblock A copy of the group of block group descriptors A data block bitmap A group of inodes


20/37

An inode bitmap A chunk of data belonging to a file; that is, a data block

If a block does not contain any meaningful information, it is said to be free. As can be seen from

the figure above, both the superblock and the group descriptors are duplicated in each block group.

Only the superblock and the group descriptors included in block group are used by the kernel,

while the remaining superblocks and group descriptors are left unchanged; in fact, the kernel

doesn't even look at them. When the /sbin/e2fsck program executes a consistency check on the

filesystem status, it refers to the superblock and the group descriptors stored in block group 0, then

copies them into all other block groups. If data corruption occurs and the main superblock or the

main group descriptors in block group becomes invalid, the system administrator can instruct

/sbin/e2fsck to refer to the old copies of the superblock and the group descriptors stored in a block

groups other than the first. Usually, the redundant copies store enough information to allow

/sbin/e2fsck to bring the Ext2 partition back to a consistent state. (Bover & Cesati, 2000)

How many block groups are there? Well, that depends both on the partition size and on the block

size. The main constraint is that the block bitmap, which is used to identify the blocks that are used

and free inside a group, must be stored in a single block. Therefore, in each block group there can

be at most 8xb blocks, where b is the block size in bytes. Thus, the total number of block groups

is roughly s/(8xb), where s is the partition size in blocks.

As an example, let's consider an 8 GB Ext2 partition with a 4 KB block size. In this case, each 4

KB block bitmap describes 32 K data blocks, that is, 128 MB. Therefore, at most 64 block groups

are needed. Clearly, the smaller the block size, the larger the number of block groups. (Bover &

Cesati, 2000)


21/37

Comparison of UNIX File Management and MS-DOS File Management

The UNIX V7 Filesystem

Even early versions of UNIX had a fairly sophisticated multiuser file system since it was derived

from MULTICS. Below we will discuss the V7 file system, the one for the PDP-11 that made

UNIX famous. We will examine a modern UNIX file system in the context of Linux in Chap. 10.

The file system is in the form of a tree starting at the root directory, with the addition of links,

forming a directed acyclic graph. File names are up to 14 characters and can contain any ASCII

characters except / (because that is the separator between components in a path) and NUL (because

that is used to pad out names shorter than 14 characters). NUL has the numerical value of 0.

A UNIX directory entry contains one entry for each file in that directory. Each entry is extremely

simple because UNIX uses the i-node scheme. A directory entry contains only two fields: the file

name (14 bytes) and the number of the i-node for that file (2 bytes). These parameters limit the

number of files per file system to 64K.

Like the i-node above, the UNIX i-nodes contains some attributes. The attributes contain the file

size, three times (creation, last access, and last modification), owner, group, protection

information, and a count of the number of directory entries that point to the i-node. The latter field

is needed due to links. Whenever a new link is made to an i-node, the count in the i-node is

increased. When a link is removed, the count is decremented. When it gets to 0, the i-node is

reclaimed and the disk blocks are put back in the free list.

Keeping track of disk blocks is done using a generalization of the figure below in order to handle

very large files. The first 10 disk addresses are stored in the i-node


22/37

A Unix V7 directory entry

itself, so for small files, all the necessary information is right in the i-node, which is fetched from

disk to main memory when the file is opened. For somewhat larger files, one of the addresses in

the i-node is the address of a disk block called a single indirect block. This block contains

additional disk addresses. If this still is not enough, another address in the i-node, called a double

indirect block, contains the address of a block that contains a list of single indirect blocks. Each of

these single indirect blocks points to a few hundred data blocks. If even this is not enough, a triple

indirect block can also be used. The complete picture is given in the figure below.

A UNIX i-node


23/37

When a file is opened, the file system must take the file name supplied and locate its disk blocks.

Let us consider how the path name /usr/ast/mbox is looked up. We will use UNIX as an example,

but the algorithm is basically the same for all hierarchical directory systems. First the file system

locates the root directory. In UNIX its i-node is located at a fixed place on the disk. From this i-

node, it locates the root directory, which can be anywhere on the disk, but say block 1.

Then it reads the root directory and looks up the first component of the path, usr, in the root

directory to find the i-node number of the file /usr. Locating an inode from its number is

straightforward, since each one has a fixed location on the disk. From this i-node, the system

locates the directory for /usr and looks up the next component, ast, in it. When it has found the

entry for ast, it has the i-node for the directory /usr/ast. From this i-node it can find the directory

itself and look up inbox.The i-node for this file is then read into memory and kept there until the

file is closed. The lookup process is illustrated in figure below.

The steps in looking up /usr/ast/mbox


24/37

Relative path names are looked up the same way as absolute ones, only starting from the working

directory instead of starting from the root directory. Every directory has entries for . and .. which

are put there when the directory is created. The entry . has the i-node number for the current

directory, and the entry for .. has the i-node number for the parent directory. Thus, a procedure

looking up ../dick/prog.c simply looks up .. in the working directory, finds the i-node number for

the parent directory, and searches that directory for dick. No special mechanism is needed to handle

these names. As far as the directory system is concerned, they are just ordinary ASCII strings, just

the same as any other names. The only bit of trickery here is that .. in the root directory points to

itself. (Tanenbaum, 2008)

Unix Filesystem

Historically, Unix has provided four basic filesystem-related abstractions: files, directory entries,

inodes, and mount points.

A filesystem is a hierarchical storage of data adhering to a specific structure. Filesystems contain

files, directories, and associated control information.Typical operations performed on filesystems

are creation, deletion, and mounting. In Unix, filesystems are mounted at a specific mount point

in a global hierarchy known as a namespace. This enables all mounted filesystems to appear as

entries in a single tree. Contrast this single, unified tree with the behavior of DOS and Windows,

which break the file namespace up into drive letters, such as C:.This breaks the namespace up

among device and partition boundaries, leaking hardware details into the filesystem abstraction.

As this delineation may be arbitrary and even confusing to the user, it is inferior to Linuxs unified

namespace.


25/37

A file is an ordered string of bytes. The first byte marks the beginning of the file, and the last byte

marks the end of the file. Each file is assigned a human-readable name for identification by both

the system and the user. Typical file operations are read, write, create, and delete. The Unix

concept of the file is in stark contrast to record-oriented filesystems, such as OpenVMSs Files-

11. Record-oriented filesystems provide a richer, more structured representation of files than

Unixs simplebyte-stream abstraction, at the cost of simplicity and flexibility.

Files are organized in directories. A directory is analogous to a folder and usually contains related

files. Directories can also contain other directories, called subdirectories. In this fashion,

directories may be nested to form paths. Each component of a path is called a directory entry. A

path example is /home/wolfman/butterthe root directory /, the directories home and wolfman,

and the file butter are all directory entries, called dentries. In Unix, directories are actually normal

files that simply list the files contained therein. Because a directory is a file to the VFS, the same

operations performed on files can be performed on directories.

Unix systems separate the concept of a file from any associated information about it, such as access

permissions, size, owner, creation time, and so on. This information is sometimes called file

metadata (that is, data about the files data) and is stored in a separate data structure from the file,

called the inode. This name is short for index node, although these days the term inode is much

more ubiquitous.

All this information is tied together with the filesystems own control information, which is stored

in the superblock. The superblock is a data structure containing information about the filesystem

as a whole. Sometimes the collective data is referred to as filesystem metadata. Filesystem

metadata includes information about both the individual files and the filesystem as a whole.


26/37

Traditionally, Unix filesystems implement these notions as part of their physical ondisk

layout. For example, file information is stored as an inode in a separate block on the disk;

directories are files; control information is stored centrally in a superblock, and so on. The Unix

file concepts are physically mapped on to the storage medium. The Linux VFS is designed to work

with filesystems that understand and implement such concepts. Non-Unix filesystems, such as

FAT or NTFS, still work in Linux, but their filesystem code must provide the appearance of these

concepts. For example, even if a filesystem does not support distinct inodes, it must assemble the

inode data structure in memory as if it did. Or if a filesystem treats directories as a special object,

to the VFS they must represent directories as mere files. Often, this involves some special

processing done on-the-fly by the non-Unix filesystems to cope with the Unix paradigm and the

requirements of the VFS. Such filesystems still work, however, and the overhead is not

unreasonable. (Love, 2010)

UNIX File Locking

When a file can be accessed by more than one process, a synchronization problem occurs: what

happens if two processes try to write in the same file location? Or again, what happens if a process

reads from a file location while another process is writing into it?

In traditional Unix systems, concurrent accesses to the same file location produce unpredictable

results. However, the systems provide a mechanism that allows the processes to lock a file region

so that concurrent accesses may be easily avoided. (Bovet & Cesati, 2000)

The POSIX standard requires a file-locking mechanism based on the fcntl( ) system call. It is

possible to lock an arbitrary region of a file (even a single byte) or to lock the whole file (including


27/37

data appended in the future). Since a process can choose to lock just a part of a file, it can also hold

multiple locks on different parts of the file.

This kind of lock does not keep out another process that is ignorant of locking. Like a critical

region in code, the lock is considered "advisory" because it doesn't work unless other processes

cooperate in checking the existence of a lock before accessing the file. Therefore, POSIX's locks

are known as advisory locks. (Bovet & Cesati, 2000)

Traditional BSD variants implement advisory locking through the flock( ) system call. This call

does not allow a process to lock a file region, just the whole file. Traditional System V variants

provide the lockf( ) system call, which is just an interface to

fcntl( ). More importantly, System V Release 3 introduced mandatory locking: the kernel checks

that every invocation of the open( ), read( ), and write( ) system calls does not violate a mandatory

lock on the file being accessed. Therefore, mandatory locks are enforced even between non-

cooperative processes. A file is marked as a candidate for mandatory locking by setting its set-

group bit (SGID) and clearing the group-execute permission bit. Since the set-group bit makes no

sense when the group-execute bit is off, the kernel interprets that combination as a hint to use

mandatory locks instead of advisory ones.

Whether processes use advisory or mandatory locks, they can make use of both shared read locks

and exclusive write locks. Any number of processes may have read locks on some file region, but

only one process can have a write lock on it at the same time. Moreover, it is not possible to get a

write lock when another process owns a read lock for the same file region and vice versa (see table

below). (Bovet & Cesati, 2000)


28/37

The MS-DOS File System

The MS-DOS file system is the one the first IBM PCs came with. It was the main file system up

through Windows 98 and Windows ME. It is still supported on Windows 2000, Windows XP, and

Windows Vista, although it is no longer standard on new PCs now except for floppy disks.

However, it and an extension of it (FAT-32) have become widely used for many embedded

systems. Most digital cameras use it. Many MP3 players use it exclusively. The popular Apple

iPod uses it as the default file system, although knowledgeable hackers can reformat the iPod and

install a different file system. Thus the number of electronic devices using the MS-DOS file system

is vastly larger now than at any time in the past, and certainly much larger than the number using

the more modern NTFS file system. For that reason alone, it is worth looking at in some detail.

To read a file, an MS-DOS program must first make an open system call to get a handle for it. The

open system call specifies a path, which may be either absolute or relative to the current working

directory. The path is looked up component by component until the final directory is located and

read into memory. It is then searched for the file to be opened.

Although MS-DOS directories are variable sized, they use a fixed-size 32-byte directory entry.

The format of an MS-DOS directory entry is shown in the figure below. It contains the file name,

attributes, creation date and time, starting block, and exact file size. File names shorter than 8 + 3

characters are left justified and padded with spaces on the right, in each field separately. The


29/37

Attributes field is new and contains bits to indicate that a file is read-only, needs to be archived, is

hidden, or is a system file. Read-only files cannot be written. This is to protect them from

accidental damage. The archived bit has no actual operating system function (i.e., MS-DOS does

not examine or set it). The intention is to allow user-level archive programs to clear it upon

archiving a file and to have other programs set it when modifying a file. In this way, a backup

program can just examine this attribute bit on every file to see which files to back up. The hidden

bit can be set to prevent a file from appearing in directory listings. Its main use is to avoid confusing

novice users with files they might not understand. Finally, the system bit also hides files. In

addition, system files cannot accidentally be deleted using the del command. The main components

of MS-DOS have this bit set.

The MS-DOS directory entry

The directory entry also contains the date and time the file was created or last modified. The time

is accurate only to 2 sec because it is stored in a 2-byte field, which can store only 65,536 unique

values (a day contains 86,400 seconds). The time field is subdivided into seconds (5 bits), minutes

(6 bits), and hours (5 bits).

The date counts in days using three subfields: day (5 bits), month (4 bits), and year-1980 (7 bits).

With a 7-bit number for the year and time beginning in 1980, the highest expressible year is 2107.


30/37

Thus MS-DOS has a built-in Y2108 problem. To avoid catastrophe, MS-DOS users should begin

with Y2108 compliance as early as possible. If MS-DOS had used the combined date and time

fields as a 32-bit seconds counter, it could have represented every second exactly and delayed the

catastrophe until 2116.

MS- DOS stores the file size as a 32-bit number, so in theory files can be as large as 4 GB.

However, other limits (described below) restrict the maximum file size to 2 GB or less. A

surprisingly large part of the entry (10 bytes) is unused.

MSDOS keeps track of file blocks via a file allocation table in main memory. The directory entry

contains the number of the first file block. This number is used as an index into a 64K entry FAT

in main memory. By following the chain, all the blocks can be found.

The FAT file system comes in three versions: FAT-12, FAT-16, and FAT-32, depending on how

many bits a disk address contains. Actually, FAT-32 is something of a misnomer, since only the

low-order 28 bits of the disk addresses are used. It should have been called FAT-28, but powers

of two sound so much neater.

For all FATs, the disk block can be set to some multiple of 512 bytes (possibly different for each

partition), with the set of allowed block sizes (called cluster sizes by Microsoft) being different for

each variant. The first version of MS-DOS used FAT-12 with 512-byte blocks, giving a maximum

partition size of 212 x 512 bytes (actually only 4086 x 512 bytes because 10 of the disk addresses

were used as special markers, such as end of file, bad block, etc.). With these parameters, the

maximum disk partition size was about 2 MB and the size of the FAT table in memory was 4096

entries of 2 bytes each. Using a 12-bit table entry would have been too slow.


31/37

This system worked well for floppy disks, but when hard disks came out, it became a problem.

Microsoft solved the problem by allowing additional block sizes of 1 KB, 2 KB, and 4 KB. This

change preserved the structure and size of the FAT-12 table, but allowed disk partitions of up to

16 MB.

Since MS-DOS supported four disk partitions per disk drive, the new FAT-12 file system worked

up to 64-MB disks. Beyond that, something had to give. What happened was the introduction of

FAT-16, with 16-bit disk pointers. Additionally, block sizes of 8 KB, 16 KB, and 32 KB were

permitted. (32,768 is the largest power of two that can be represented in 16 bits.) The FAT-16

table now occupied 128 KB of main memory all the time, but with the larger memories by then

available, it was widely used and rapidly replaced the FAT-12 file system. The largest disk

partition that can be supported by FAT-16 is 2 GB (64K entries of 32 KB each) and the largest

disk, 8 GB, namely four partitions of 2 GB each.

For business letters, this limit is not a problem, but for storing digital video using the DV standard,

a 2-GB file holds just over 9 minutes of video. As a consequence of the fact that a PC disk can

support only four partitions, the largest video that can be stored on a disk is about 38 minutes, no

matter how large the disk is. This limit also means that the largest video that can be edited on line

is less than 19 minutes, since both input and output files are needed.

Starting with the second release of Windows 95, the FAT-32 file system, with its 28-bit disk

addresses, was introduced and the version of MS- DOS underlying Windows 95 was adapted to

support FAT-32. In this system, partitions could theoretically be 228 x 215 bytes, but they are

actually limited to 2 TB (2048 GB) because internally the system keeps track of partition sizes in

512-byte sectors using a 32-bit number, and 29 x 232 is 2 TB. The maximum partition size for

various block sizes and all three FAT types is shown in the figure below.


32/37

Maximum partition size for different block sizes. The empty boxes represent forbidden

combinations.

In addition to supporting larger disks, the FAT-32 file system has two other advantages over FAT-

16. First, an 8-GB disk using FAT-32 can be a single partition. Using FAT-16 it has to be four

partitions, which appears to the Windows user as the C:, D:, E:, and F: logical disk drives. It is up

to the user to decide which file to place on which drive and keep track of what is where.

The other advantage of FAT-32 over FAT-16 is that for a given size disk partition, a smaller block

size can be used. For example, for a 2-GB disk partition, AT-16 must use 32-KB blocks; otherwise

with only 64K available disk addresses, it cannot cover the whole partition. In contrast, FAT-32

can use, for example, 4-KB blocks for a 2-GB disk partition. The advantage of the smaller block

size is that most files are much shorter than 32 KB. If the block size is 32 KB, a file of 10 bytes

ties up 32 KB of disk space. If the average file is, say, 8 KB, then with a 32-KB block, 3/4 of the

disk will be wasted, not a terribly efficient way to use the disk. With an 8-KB file and a 4-KB

block, there is no disk wastage, but the price paid is more RAM eaten up by the FAT. With a 4-


33/37

KB block and a 2-GB disk partition, there are 512K blocks, so the FAT must have 512K entries in

memory (occupying 2 MB of RAM).

MS-DOS uses the FAT to keep track of free disk blocks. Any block that is not currently allocated

is marked with a special code. When MS-DOS needs a new disk block, it searches the FAT for an

entry containing this code. Thus no bitmap or free list is required. (Tanenbaum, 2008)

Comparison of UNIX against Windows NT File System

Windows Vista supports several file systems, the most important of which are FAT-16, FAT-32,

and NTFS (NT File System). FAT-16 is the old MS-DOS file system. It uses 16-bit disk addresses,

which limits it to disk partitions no larger than 2 GB. Mostly it is used to access floppy disks, for

customers that still use them. FAT-32 uses 32-bit disk addresses and supports disk partitions up to

2 TB. There is no security in FAT-32, and today it is only really used for transportable media, like

flash drives. NTFS is the file system developed specifically for the NT version of Windows.

Starting with Windows XP it became the default file system installed by most computer

manufacturers, greatly improving the security and functionality of Windows. NTFS uses 64-bit

disk addresses and can (theoretically) support disk partitions up to 264 bytes, although other

considerations limit it to smaller sizes.

In this chapter we will examine the NTFS file system because it is a modern file system with many

interesting features and design innovations. It is a large and complex file system and space

limitations prevent us from covering all of its features, but the material presented below should

give a reasonable impression of it.


34/37

Fundamental Concepts

Individual file names in NTFS are limited to 255 characters; full paths are limited to 32,767

characters. File names are in Unicode, allowing people in countries not using the Latin alphabet

(e.g., Greece, Japan, India, Russia, and Israel) to write file names in their native language. For

example, 01XE is a perfectly legal file name. NTFS fully supports case-sensitive names (so Po is

different from Foo and F00). The Win32 API does not fully support case-sensitivity for file names

and not at all for directory names. The support for case-sensitivity exists when running the POSIX

subsystem in order to maintain compatibility with UNIX. Win32 is not case-sensitive, but it is

case-preserving, so file names can have different case letters in them. Though case-sensitivity is a

feature that is very familiar to users of UNIX, it is largely inconvenient to ordinary users who do

not make such distinctions normally. For example, the Internet is largely case-insensitive today.

An NTFS file is not just a linear sequence of bytes, as FAT-32 and UNIX files are. Instead, a file

consists of multiple attributes, each of which is represented by a stream of bytes. Most files have

a few short streams, such as the name of the file and its 64-bit object ID, plus one long (unnamed)

stream with the data. However, a file can also have two or more (long) data streams as well. Each

stream has a name consisting of the file name, a colon, and the stream name, as in ,foo:streand .

Each stream has its own size and is lockable independently of all the other streams. The idea of

multiple streams in a file is not new in NTFS. The file system on the Apple Macintosh uses two

streams per file, the data fork and the resource fork. The first use of multiple streams for NTFS

was to allow an NT file server to serve Macintosh clients. Multiple data streams are also used to

represent metadata about files, such as the thumbnail pictures of JPEG images that are available

in the Windows GUI. But alas, the multiple data streams are fragile and frequently fall off of files


35/37

when they are transported to other file systems, transported over the network, or even when backed

up and later restored, because many utilities ignore them.

NTFS is a hierarchical file system, similar to the UNIX file system. The separator between

component names is "V", however, instead of 7", a fossil inherited from the compatibility

requirements with CP/M when MS- DOS was created. Unlike UNIX the concept of the current

working directory, hard links to the current directory (.) and the parent directory (..) are

implemented as conventions rather than as a fundamental part of the file system design. Hard links

are supported, but only used for the POSIX subsystem, as is NTFS support for traversal checking

on directories (the 'x' permission in UNIX).

Symbolic links in NTFS were not supported until Windows Vista. Creation of symbolic links is

normally restricted to administrators to avoid security issues like spoofing, as UNIX experienced

when symbolic links were first introduced in 4.2BSD. The implementation of symbolic links in

Vista uses an NTFS feature called reparse points (discussed later in this section). In addition,

compression, encryption, fault tolerance, journaling, and sparse files are also supported. These

features and their implementations will be discussed shortly.

Implementation of the NT File System

NTFS is a highly complex and sophisticated file system that was developed specifically for NT as

an alternative to the HPFS file system that had been developed for OS/2. While most of NT was

designed on dry land, NTFS is unique among the components of the operating system in that much

of its original design took place aboard a sailboat out on the Puget Sound (following a strict

protocol of work in the morning, beer in the afternoon). Below we will examine a number of


36/37

features of NTFS, starting with its structure, then moving on to file name lookup, file compression,

journaling, and file encryption.

Windows NT File System Structure

Each NTFS volume (e.g., disk partition) contains files, directories, bitmaps, and other data

structures. Each volume is organized as a linear sequence of blocks (clusters in Microsoft's

terminology), with the block size being fixed for each volume and ranging from 512 bytes to 64

KB, depending on the volume size. Most NTFS disks use 4-KB blocks as a compromise between

large blocks (for efficient transfers) and small blocks (for low internal fragmentation). Blocks are

referred to by their offset from the start of the volume using 64-bit numbers.

The main data structure in each volume is the MFT (Master File Table), which is a linear sequence

of fixed-size 1-KB records. Each MFT record describes one file or one directory. It contains the

file's attributes, such as its name and timestamps, and the list of disk addresses where its blocks

are located. If a file is extremely large, it is sometimes necessary to use two or more MFT records

to contain the list of all the blocks, in which case the first MFT record, called the base record,

points to the other MFT records. This overflow scheme dates back to CP/M, where each directory

entry was called an extent. A bitmap keeps track of which MFT entries are free.

The MFT is itself a file and as such can be placed anywhere within the volume, thus eliminating

the problem with defective sectors in the first track. Furthermore, the file can grow as needed, up

to a maximum size of 248

records. (Tanenbaum, 2008)


37/37

References

Silberschatz, A., Galvin, P. & Gagne, G. (2013). Operating system concepts. Hoboken, N.J.

Chichester: Wiley John Wiley distributor.

Bovet, D. & Cesati, M. (2001). Understanding the Linux kernel. Beijing Cambridge, Mass:

O'Reilly.

Tanenbaum, A. S. (2008). Modern operating systems. (3rd ed.). Upper Saddle River, New Jersey:

Prentice Hall.

Love, R. (2010). Linux kernel development. Upper Saddle River, NJ: Addison-Wesley.

Linux File Management - MATALA, IVAN G.

Documents

Transcript of Linux File Management - MATALA, IVAN G.