Linux File Management - MATALA, IVAN G.
-
Upload
ivan-matala -
Category
Documents
-
view
222 -
download
0
Transcript of Linux File Management - MATALA, IVAN G.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
1/37
File Management
A Case Study
Submitted to Faculty of the
Computer Engineering Department
Engr. Joshua Cuesta
Mapua Institute of Technology
In Partial Fulfillment of the Requirements
For the Degree of BS Computer Engineering
Matala, Ivan G.
September 9, 2013
-
7/27/2019 Linux File Management - MATALA, IVAN G.
2/37
The Linux Virtual File System
The Virtual Filesystem (sometimes called the Virtual File Switch or more commonly simply the
VFS) is the subsystem of the kernel that implements the file and filesystem-related interfaces
provided to user-space programs.All filesystems rely on the VFS to enable them not only to
coexist, but also to interoperate.This enables programs to use standard Unix system calls to read
and write to different filesystems, even on different media, as shown in the figure below.
(Love, 2010)
The VFS in action: Using the cp(1) utility to move data from a hard disk mounted as ext3 to a
removable disk mounted as ext2. Two different filesystems, two different media, one VFS.
Common Filesystem Interface
The VFS is the glue that enables system calls such as open(), read(), and write()to work regardless
of the filesystem or underlying physical medium. These days, that might not sound novelwe
have long been taking such a feature for grantedbut it is a non-trivial feat for such generic system
calls to work across many diverse filesystems and varying media. More so, the system calls work
between these different filesystems and media we can use standard system calls to copy or move
-
7/27/2019 Linux File Management - MATALA, IVAN G.
3/37
files from one filesystem to another. In older operating systems, such as DOS, this would never
have worked; any access to a nonnative filesystem required special tools. It is only because modern
operating systems, such as Linux, abstract access to the filesystems via a virtual interface that such
interoperation and generic access is possible.
New filesystems and new varieties of storage media can find their way into Linux, andprograms
need not be rewritten or even recompiled. In this chapter, we will discuss the VFS, which provides
the abstraction allowing myriad filesystems to behave as one. In the next chapter, we will discuss
the block I/O layer, which allows various storage devices CD to Blu-ray discs to hard drives to
CompactFlash.Together, the VFS and the block I/O layer provide the abstractions, interfaces, and
glue that allow user-space programs to issue generic system calls to access files via a uniform
naming policy on any filesystem, which itself exists on any storage medium. (Love, 2010)
Filesystem Abstraction Layer
Such a generic interface for any type of filesystem is feasible only because the kernel implements
an abstraction layer around its low-level filesystem interface.This abstraction layer enables Linux
to support different filesystems, even if they differ in supported features or behavior. This is
possible because the VFS provides a common file model that can represent any filesystems
general feature set and behavior. Ofcourse, it is biased toward Unix-style filesystems. (You see
what constitutes a Unix-style filesystem later in this chapter.) Regardless, wildly differing
filesystem types are still supportable in Linux, from DOSs FAT to Windowss NTFS to many
Unix-style and Linux-specific filesystems.
The abstraction layer works by defining the basic conceptual interfaces and data structures that all
filesystems support.The filesystems mold their view of concepts such as this is how I open files
-
7/27/2019 Linux File Management - MATALA, IVAN G.
4/37
and this is what a directory is to me to match the expectations of the VFS. The actual filesystem
code hides the implementation details. To the VFS layer and the rest of the kernel, however, each
filesystem looks the same. They all support notions such as files and directories, and they all
support operations such as creating and deleting files.
The result is a general abstraction layer that enables the kernel to support many types of filesystems
easily and cleanly.The filesystems are programmed to provide the abstracted interfaces and data
structures the VFS expects; in turn, the kernel easily works with any filesystem and the exported
user-space interface seamlessly works on any filesystem.
In fact, nothing in the kernel needs to understand the underlying details of the filesystems, except
the filesystems themselves. For example, consider a simple user-space program that does
ret = write (fd, buf, len);
This system call writes the len bytes pointed to by buf into the current position in the file
represented by the file descriptor fd. This system call is first handled by a generic sys_write()
system call that determines the actual file writing method for the filesystem on which fd resides.
The generic write system call then invokes this method, which is part of the filesystem
implementation, to write the data to the media (or whatever this filesystem does on write). The
figure below shows the flow from user-spaces write() call through the data arriving on the physical
media. On one side of the system call is the generic VFS interface, providing the frontend to user-
space; on the other side of the system call is the filesystem-specific backend, dealing with the
implementation details. The rest of this chapter looks at how the VFS achieves this abstraction and
provides its interfaces. (Love, 2010)
-
7/27/2019 Linux File Management - MATALA, IVAN G.
5/37
The flow of data from user-space issuing a write() call, through the VFSs generic system call,
into the filesystems specific write method, and finally arriving at the physical media.
Linux File System
Linux retains UNIXs standard file-system model. In UNIX, a file does not have to be an object
stored on disk or fetched over a network from a remote file server. Rather, UNIX files can be
anything capable of handling the input or output of a stream of data. Device drivers can appear as
files, and inter-process communication channels or network connections also look like files to the
user.
The Linux kernel handles all these types of files by hiding the implementation details of any single
file type behind a layer of software, the virtual file system (VFS). Here, we first cover the virtual
file system and then discuss the standard Linux file systemext3. (Silberschatz, 2013)
The Virtual File System
The Linux VFS is designed around object-oriented principles. It has two components: a set of
definitions that specify what file-system objects are allowed to look like and a layer of software to
manipulate the objects. The VFS defines four main object types:
-
7/27/2019 Linux File Management - MATALA, IVAN G.
6/37
An inode objectrepresents an individual file. Afile objectrepresents an open file. Asuperblock objectrepresents an entire filesystem. A dentry objectrepresents an individual directory entry.
For each of these four object types, the VFS defines a set of operations. Every object of one of
these types contains a pointer to a function table. The function table lists the addresses of the actual
functions that implement the defined operations for that object. For example, an abbreviated API
for some ofthe file objects operations includes:
int open(. . .)Open a file. ssize_t read(. . .)Read from a file. ssize_t write(. . .)Write to a file. Int mmap(. . .)Memory-map a file.
The complete definition of the file object is specified in the struct file operations, which is located
in the file /usr/include/linux/fs.h. An implementation of the file object (for a specific file type) is
required to implement each function specified in the definition of the file object.
VFS and Processes Interaction
Besides providing a common interface to all filesystem implementations, the VFS has another
important role related to system performance. The most recently used dentry objects are contained
in a disk cache named the dentry cache, which speeds up the translation from a file pathname to
the inode of the last pathname component.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
7/37
Interaction between processes and VFS objects
Generally speaking, a disk cache is a software mechanism that allows the kernel to keep in RAM
some information that is normally stored on a disk, so that further accesses to that data can be
quickly satisfied without a slow access to the disk itself. Beside the dentry cache, Linux uses other
disk caches, like the buffer cache and the page cache, which will be described in forthcoming
chapters. (Bovet & Cesati, 2000)
The VFS software layer can perform an operation on one of the file-system objects by calling the
appropriate function from the objects function table, without having to know in advance exactly
what kind of object it is dealing with. The VFS does not know, or care, whether an inode represents
a networked file, a disk file, a network socket, or a directory file. The appropriate function for that
files read() operation will always be at the same place in its function table, and the VFS software
layer will call that function without caring how the data are actually read. (Silberschatz, 2013)
The inode and file objects are the mechanisms used to access files. An inode object is a data
structure containing pointers to the disk blocks that contain the actual file contents, and a file object
represents a point of access to the data in an open file. A process cannot access an inodes contents
without first obtaining a file object pointing to the inode. The file object keeps track of where in
-
7/27/2019 Linux File Management - MATALA, IVAN G.
8/37
the file the process is currently reading or writing, to keep track of sequential file I/O. It also
remembers the permissions (for example, read or write) requested when the file was opened and
tracks the processs activity if necessary to perform adaptive read-ahead, fetching file data into
memory before the process requests the data, to improve performance.
File objects typically belong to a single process, but inode objects do not. There is one file object
for every instance ofan open file, but always only a single inode object. Even when a file is no
longer in use by any process, its inode object may still be cached by the VFS to improve
performance if the file is used again in the near future. All cached file data are linked onto a list in
the files inode object. The inode also maintains standard information about each file, such as the
owner, size, and time most recently modified.
Directory files are dealt with slightly differently from other files. The UNIX programming
interface defines a number of operations on directories, such as creating, deleting, and renaming a
file in a directory. The system calls for these directory operations do not require that the user open
the files concerned, unlike the case for reading or writing data. The VFS therefore defines these
directory operations in the inode object, rather than in the file object.
The superblock object represents a connected set of files that form a self-contained file system.
The operating-system kernel maintains a single superblock object for each disk device mounted as
a file system and for each networked file system currently connected. The main responsibility of
the superblock object is to provide access to inodes. The VFS identifies every inode by a unique
file-system/inode number pair, and it finds the inode corresponding to a particular inode number
by asking the superblock object to return the inode with that number. (Silberschatz, 2013)
-
7/27/2019 Linux File Management - MATALA, IVAN G.
9/37
Finally, a dentry object represents a directory entry, which may include the name of a directory in
the path name of a file (such as /usr) or the actual file (such as stdio.h). For example, the file
/usr/include/stdio.h contains the directory entries (1) /, (2) usr, (3) include, and (4) stdio.h. Each of
these values is represented by a separate dentry object. (Bovet & Cesati, 2000)
As an example of how dentry objects are used, consider the situation in which a process wishes to
open the file with the pathname /usr/include/stdio.h using an editor. Because Linux treats directory
names as files, translating this path requires first obtaining the inode for the root/. The operating
system must then read through this file to obtain the inode for the file include. It must continue this
process until it obtains the inode for the file stdio.h. Because path-name translation can be a time-
consuming task, Linux maintains a cache of dentry objects, which is consulted during path-name
translation. Obtaining the inode from the dentry cache is considerably faster than having to read
the on-disk file. (Silberschatz, 2013)
The Linux ext3 File System
The standard on-diskfile system used by Linux is called ext3, for historical reasons. Linux was
originally programmed with a Minix-compatible file system, to ease exchanging data with the
Minix development system, but that file system was severely restricted by 14-character file-name
limits and a maximum file-system size of 64 MB. The Minix file system was superseded by a new
file system, which was christened the extended file system (extfs). A later redesign to improve
performance and scalability and to add a few missing features led to the second extended file
system (ext2). Further development added journaling capabilities, and the system was renamed the
third extended file system (ext3). Linux kernel developers are working on augmenting ext3 with
modern file-system features such as extents. This new file system is called the fourth extended file
system (ext4). The rest of this section discusses ext3, however, since it remains the most-deployed
-
7/27/2019 Linux File Management - MATALA, IVAN G.
10/37
Linux file system. Most of the discussion applies equally to ext4. Linuxs ext3 has much in
common with the BSD Fast File System (FFS). It uses a similar mechanism for locating the data
blocksbelonging to a specific file, storing data-block pointers in indirect blocks throughout the file
system with up to three levels of indirection. As in FFS, directory files are stored on disk just like
normal files, although their contents are interpreted differently. Each block in a directory file
consists of a linked list of entries. In turn, each entry contains the length of the entry, the name of
a file, and the inode number of the inode to which that entry refers. (Silberschatz, 2013)
The main differences between ext3 and FFS lie in their disk-allocation policies. In FFS, the disk
is allocated to files in blocks of 8 KB. These blocks are subdivided into fragments of 1 KB for
storage of small files or partially filled blocks at the ends of files. In contrast, ext3 does not use
fragments at all but performs all its allocations in smaller units. The default block size on ext3
varies as a function of the total size of the file system. Supported block sizes are1, 2, 4, and 8 KB.
(Bovet & Cesati, 2000)
To maintain high performance, the operating system must try to perform I/O operations in large
chunks whenever possible by clustering physically adjacent I/O requests. Clustering reduces the
per-request overhead incurred by device drivers, disks, and disk-controller hardware. A block-
sized I/O request size is too small to maintain good performance, so ext3 uses allocation policies
designed to place logically adjacent blocks ofa file into physically adjacentblocks on disk, so that
it can submit an I/O request for several disk blocks as a single operation. (Silberschatz, 2013)
The ext3 allocation policy works as follows: As in FFS, an ext3 file system is partitioned into
multiple segments. In ext3, these are called block groups. FFS uses the similar concept of cylinder
groups, where each group corresponds to a single cylinder of a physical disk. (Note that modern
disk-drive technology packs sectors onto the disk at different densities, and thus with different
-
7/27/2019 Linux File Management - MATALA, IVAN G.
11/37
cylinder sizes, depending on how far the disk head is from the center of the disk. Therefore, fixed-
sized cylinder groups do not necessarily correspond to the disks geometry.) (Bovet & Cesati,
2000)
When allocating a file, ext3 must first select the block group for that file. For data blocks, it
attempts to allocate the file to the block group to which the files inode has been allocated. For
inode allocations, it selects the block group in which the files parent directory resides for
nondirectory files. Directory files are not kept together but rather are dispersed throughout the
available block groups. These policies are designed not only to keep related information within
the same block group but also to spread out the disk load among the disks block groups to reduce
the fragmentation of any one area of the disk. (Silberschatz, 2013)
Within a block group, ext3 tries to keep allocations physically contiguous if possible, reducing
fragmentation if it can. It maintains a bitmap of all free blocks in a block group. When allocating
the first blocks for a new file, it starts searching for a free block from the beginning of the block
group. When extending a file, it continues the search from the block most recently allocated to the
file. The search is performed in two stages. First, ext3 searches for an entire free byte in the bitmap;
if it fails to find one, it looks for any free bit. The search for free bytes aims to allocate disk space
in chunks of at least eight blocks where possible. (Bovet & Cesati, 2000)
Once a freeblock has been identified, the search is extended backward until an allocated block is
encountered. When a free byte is found in the bitmap, this backward extension prevents ext3 from
leaving a hole between the most recently allocated block in the previous nonzero byte and the zero
byte found. Once the next block to be allocated has been found by either bit or byte search, ext3
extends the allocation forward for up to eight blocks and preallocates these extra blocks to the file.
This preallocation helps to reduce fragmentation during interleaved writes to separate files and
-
7/27/2019 Linux File Management - MATALA, IVAN G.
12/37
also reduces the CPU cost of disk allocation by allocating multiple blocks simultaneously. The
preallocated blocks are returned to the free-space bitmap when the file is closed. (Silberschatz,
2013)
The Figure below illustrates the allocation policies. Each row represents a sequence of set and
unset bits in an allocation bitmap, indicating used and free blocks on disk. In the first case, if we
can find any free blocks sufficiently near the start of the search, then we allocate them no matter
how fragmented
Ext3 block-allocation policies.
they may be. The fragmentation is partially compensated for by the fact that the blocks are close
together and can probably all be read without any diskseeks. Furthermore, allocating them all to
-
7/27/2019 Linux File Management - MATALA, IVAN G.
13/37
one file is better in the long run than allocating isolated blocks to separate files once large free
areas become scarce on disk. In the second case, we have not immediately found a free block close
by,so we search forward for an entire free byte in the bitmap. If we allocated that byte as a whole,
we would end up creating a fragmented area of free space between it and the allocation preceding
it. Thus, before allocating, we backup to make this allocation flush with the allocation preceding
it, and then we allocate forward to satisfy the default allocation of eight blocks. (Silberschatz,
2013)
System Calls Handled by the VFS
The table below illustrates the VFS system calls that refer to filesystems, regular files, directories,
and symbolic links. A few other system calls handled by the VFS, such as ioperm( ), ioctl( ), pipe(
), and mknod( ), refer to device files and pipes and hence will be discussed in later chapters. A last
group of system calls handled by the VFS, such as socket( ), connect( ), bind( ), and protocols( ),
refer to sockets and are used to implement networking; they will not be covered in this book.
Some System Calls Handled by the VFS
-
7/27/2019 Linux File Management - MATALA, IVAN G.
14/37
We said earlier that the VFS is a layer between application programs and specific filesystems.
However, in some cases a file operation can be performed by the VFS itself, without invoking a
lower-level procedure. For instance, when a process closes an open file, the file on disk doesn't
usually need to be touched, and hence the VFS simply releases the corresponding file object.
Similarly, when the lseek( ) system call modifies a file pointer, which is an attribute related to the
interaction between an opened file and a process, the VFS needs to modify only the corresponding
file object without accessing the file on disk and therefore does not have to invoke a specific
filesystem procedure. In some sense, the VFS could be considered as a "generic" filesystem that
relies, when necessary, on specific ones. (Bovet & Cesati, 2000)
Journaling
The ext3 file system supports a popular feature called journaling, whereby modifications to the file
system are written sequentially to a journal. A set of operations that performs a specific task is a
transaction. Once a transaction is written to the journal, it is considered to be committed.
Meanwhile, the journal entries relating to the transaction are repla yed across the actual filesystem
structures. As the changes are made, a pointer is updated to indicate which actions have completed
and which are still incomplete. When an entire committed transaction is completed, it is removed
fromthe journal. The journal, which is actually a circular buffer, may be in a separate section of
the file system, or it may even be on a separate disk spindle. It is more efficient, but more complex,
to have it under separate readwrite heads, thereby decreasing head contention and seek times.
(Silberschatz, 2013)
If the system crashes, some transactions may remain in the journal. Those transactions were never
completed to the file system even though they were committed by the operating system, so they
must be completed once the system recovers. The transactions can be executed from the pointer
-
7/27/2019 Linux File Management - MATALA, IVAN G.
15/37
until the work is complete, and the file-system structures remain consistent. The only problem
occurs when a transaction has been abortedthat is, it was not committed before the system
crashed. Any changes from those transactions that were applied to the file system must be undone,
again preserving the consistency ofthe file system. This recovery is all that is needed after a crash,
eliminating all problems with consistency checking. (Bovet & Cesati, 2000)
Journaling file systems may perform some operations faster than non-journaling systems, as
updates proceed much faster when they are applied to the in-memory journal rather than directly
to the on-disk data structures. The reason for this improvement is found in the performance
advantage of sequential I/O over random I/O. Costly synchronous random writes to the file system
are turned into much less costly synchronous sequential writes to the file systems journal. Those
changes, in turn, are replayed asynchronously via random writes to the appropriate structures. The
overall result is a significant gain in performance of file-system metadata-oriented operations, such
as file creation and deletion. Due to this performance improvement, ext3 can be configured to
journal only metadata and not file data. (Silberschatz, 2013)
VFS Data Structures
Each VFS object is stored in a suitable data structure, which includes both the object attributes
and a pointer to a table of object methods. The kernel may dynamically modify the methods of the
object, and hence it may install specialized behavior for the object. The following sections explain
the VFS objects and their interrelationships in detail.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
16/37
The Fields of the Superblock Object
All superblock objects (one per mounted filesystem) are linked together in a circular doubly linked
list. The addresses of the first and last elements of the list are stored in the next and prev fields,
respectively, of the s_list field in the super_blocks variable. This field has the data type struct
list_head, which is also found in the s_dirty field of the superblock and in a number of other places
in the kernel; it consists simply of pointers to the next and previous elements of a list. Thus, the
s_list field of a superblock object includes the pointers to the two adjacent superblock objects in
the list. (Bovet & Cesati, 2000)
The Linux Process File System
The flexibility of the Linux VFS enables us to implement a file system that does not store data
persistently at all but rather provides an interface to some other functionality. The Linux process
file system, known as the /proc file system, is an example of a file system whose contents are not
actually stored anywhere but are computed on demand according to user file I/O requests.
(Silberschatz, 2013)
-
7/27/2019 Linux File Management - MATALA, IVAN G.
17/37
A /proc file system is not unique to Linux. SVR4 UNIX introduced a /proc file system as an
efficient interface to the kernels process debugging support. Each subdirectory of the file system
corresponded not to a directory on any disk but rather to an active process on the current system.
A listing of the file system reveals one directory per process, with the directory name being the
ASCII decimal representation of the processs unique process identifier (PID).
Linux implements such a /proc file system but extends it greatly by adding a number of extra
directories and text files under the file systems root directory. These new entries correspond to
various statistics about the kernel and the associated loaded drivers. The /proc file system provides
a way for programs to access this information as plain text files; the standard UNIX user
environment provides powerful tools to process such files. For example, in the past, the traditional
UNIX ps command for listing the states of all running processes has been implemented as a
privileged process that reads the process state directly from the kernels virtual memory. Under
Linux, this command is implemented as an entirely unprivileged program that simply parses and
formats the information from /proc. (Silberschatz, 2013)
The /proc file system must implement two things: a directory structure and the file contents within.
Because a UNIX file system is defined as a set of file and directory inodes identified by their inode
numbers, the /proc file system must define a unique and persistent inode number for each directory
and the associated files. Once such a mapping exists, the file system can use this inode number to
identify just what operation is required when a user tries to read from a particular file inode or to
perform a lookup in a particular directory inode. When data are read from one of these files, the
/proc file system will collect the appropriate information, format it into textual form, and place it
into the requesting processs read buffer.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
18/37
The mapping from inode number to information type splits the inode number into two fields. In
Linux, a PID is 16 bits in size, but an inode number is 32 bits. The top 16 bits of the inode number
are interpreted as a PID, and the remaining bits define what type of information is being requested
about that process.
A PID of zero is not valid, so a zero PID field in the inode number is taken to mean that this inode
contains globalrather than process-specificinformation. Separate global files exist in /proc to
report information such as the kernel version, free memory, performance statistics, and drivers
currently running. (Silberschatz, 2013)
Not all the inode numbers in this range are reserved. The kernel can allocate new /proc inode
mappings dynamically, maintaining a bitmap of allocated inode numbers. It also maintains a tree
data structure of registered global /proc file-system entries. Each entry contains the files inode
number, file name, and access permissions, along with the special functions used to generate the
files contents. Drivers can register and deregister entries in this tree at any time, and a special
section of the treeappearing under the /proc/sys directoryis reserved for kernel variables.
Files under this tree are managed by a set of common handlers that allow both reading and writing
of these variables, so a system administrator can tune the value of kernel parameters simply by
writing out the new desired values in ASCII decimal to the appropriate file.
To allow efficient access to these variables from within applications, the /proc/sys subtree is made
available through a special system call, sysctl(), that reads and writes the same variables in binary,
rather than in text, without the overhead ofthe file system. sysctl() is not an extra facility; it simply
reads the /proc dynamic entry tree to identify the variables to which the application is referring.
(Silberschatz, 2013)
-
7/27/2019 Linux File Management - MATALA, IVAN G.
19/37
Disk Data Structures
Layouts of an Ext2 partition and of an Ext2 block group
The first block in any Ext2 partition is never managed by the Ext2 filesystem, since it is reserved
for the partition boot sector (see Appendix A). The rest of the Ext2 partition is split into block
groups , each of which has the layout shown in above figure. As you will notice from the figure,
some data structures must fit in exactly one block while others may require more than one block.
All the block groups in the filesystem have the same size and are stored sequentially, so the kernel
can derive the location of a block group in a disk simply from its integer index.
(Bover & Cesati, 2000)
Block groups reduce file fragmentation, since the kernel tries to keep the data blocks belonging to
a file in the same block group if possible. Each block in a block group contains one of the following
pieces of information:
A copy of the filesystems superblock A copy of the group of block group descriptors A data block bitmap A group of inodes
-
7/27/2019 Linux File Management - MATALA, IVAN G.
20/37
An inode bitmap A chunk of data belonging to a file; that is, a data block
If a block does not contain any meaningful information, it is said to be free. As can be seen from
the figure above, both the superblock and the group descriptors are duplicated in each block group.
Only the superblock and the group descriptors included in block group are used by the kernel,
while the remaining superblocks and group descriptors are left unchanged; in fact, the kernel
doesn't even look at them. When the /sbin/e2fsck program executes a consistency check on the
filesystem status, it refers to the superblock and the group descriptors stored in block group 0, then
copies them into all other block groups. If data corruption occurs and the main superblock or the
main group descriptors in block group becomes invalid, the system administrator can instruct
/sbin/e2fsck to refer to the old copies of the superblock and the group descriptors stored in a block
groups other than the first. Usually, the redundant copies store enough information to allow
/sbin/e2fsck to bring the Ext2 partition back to a consistent state. (Bover & Cesati, 2000)
How many block groups are there? Well, that depends both on the partition size and on the block
size. The main constraint is that the block bitmap, which is used to identify the blocks that are used
and free inside a group, must be stored in a single block. Therefore, in each block group there can
be at most 8xb blocks, where b is the block size in bytes. Thus, the total number of block groups
is roughly s/(8xb), where s is the partition size in blocks.
As an example, let's consider an 8 GB Ext2 partition with a 4 KB block size. In this case, each 4
KB block bitmap describes 32 K data blocks, that is, 128 MB. Therefore, at most 64 block groups
are needed. Clearly, the smaller the block size, the larger the number of block groups. (Bover &
Cesati, 2000)
-
7/27/2019 Linux File Management - MATALA, IVAN G.
21/37
Comparison of UNIX File Management and MS-DOS File Management
The UNIX V7 Filesystem
Even early versions of UNIX had a fairly sophisticated multiuser file system since it was derived
from MULTICS. Below we will discuss the V7 file system, the one for the PDP-11 that made
UNIX famous. We will examine a modern UNIX file system in the context of Linux in Chap. 10.
The file system is in the form of a tree starting at the root directory, with the addition of links,
forming a directed acyclic graph. File names are up to 14 characters and can contain any ASCII
characters except / (because that is the separator between components in a path) and NUL (because
that is used to pad out names shorter than 14 characters). NUL has the numerical value of 0.
A UNIX directory entry contains one entry for each file in that directory. Each entry is extremely
simple because UNIX uses the i-node scheme. A directory entry contains only two fields: the file
name (14 bytes) and the number of the i-node for that file (2 bytes). These parameters limit the
number of files per file system to 64K.
Like the i-node above, the UNIX i-nodes contains some attributes. The attributes contain the file
size, three times (creation, last access, and last modification), owner, group, protection
information, and a count of the number of directory entries that point to the i-node. The latter field
is needed due to links. Whenever a new link is made to an i-node, the count in the i-node is
increased. When a link is removed, the count is decremented. When it gets to 0, the i-node is
reclaimed and the disk blocks are put back in the free list.
Keeping track of disk blocks is done using a generalization of the figure below in order to handle
very large files. The first 10 disk addresses are stored in the i-node
-
7/27/2019 Linux File Management - MATALA, IVAN G.
22/37
A Unix V7 directory entry
itself, so for small files, all the necessary information is right in the i-node, which is fetched from
disk to main memory when the file is opened. For somewhat larger files, one of the addresses in
the i-node is the address of a disk block called a single indirect block. This block contains
additional disk addresses. If this still is not enough, another address in the i-node, called a double
indirect block, contains the address of a block that contains a list of single indirect blocks. Each of
these single indirect blocks points to a few hundred data blocks. If even this is not enough, a triple
indirect block can also be used. The complete picture is given in the figure below.
A UNIX i-node
-
7/27/2019 Linux File Management - MATALA, IVAN G.
23/37
When a file is opened, the file system must take the file name supplied and locate its disk blocks.
Let us consider how the path name /usr/ast/mbox is looked up. We will use UNIX as an example,
but the algorithm is basically the same for all hierarchical directory systems. First the file system
locates the root directory. In UNIX its i-node is located at a fixed place on the disk. From this i-
node, it locates the root directory, which can be anywhere on the disk, but say block 1.
Then it reads the root directory and looks up the first component of the path, usr, in the root
directory to find the i-node number of the file /usr. Locating an inode from its number is
straightforward, since each one has a fixed location on the disk. From this i-node, the system
locates the directory for /usr and looks up the next component, ast, in it. When it has found the
entry for ast, it has the i-node for the directory /usr/ast. From this i-node it can find the directory
itself and look up inbox.The i-node for this file is then read into memory and kept there until the
file is closed. The lookup process is illustrated in figure below.
The steps in looking up /usr/ast/mbox
-
7/27/2019 Linux File Management - MATALA, IVAN G.
24/37
Relative path names are looked up the same way as absolute ones, only starting from the working
directory instead of starting from the root directory. Every directory has entries for . and .. which
are put there when the directory is created. The entry . has the i-node number for the current
directory, and the entry for .. has the i-node number for the parent directory. Thus, a procedure
looking up ../dick/prog.c simply looks up .. in the working directory, finds the i-node number for
the parent directory, and searches that directory for dick. No special mechanism is needed to handle
these names. As far as the directory system is concerned, they are just ordinary ASCII strings, just
the same as any other names. The only bit of trickery here is that .. in the root directory points to
itself. (Tanenbaum, 2008)
Unix Filesystem
Historically, Unix has provided four basic filesystem-related abstractions: files, directory entries,
inodes, and mount points.
A filesystem is a hierarchical storage of data adhering to a specific structure. Filesystems contain
files, directories, and associated control information.Typical operations performed on filesystems
are creation, deletion, and mounting. In Unix, filesystems are mounted at a specific mount point
in a global hierarchy known as a namespace. This enables all mounted filesystems to appear as
entries in a single tree. Contrast this single, unified tree with the behavior of DOS and Windows,
which break the file namespace up into drive letters, such as C:.This breaks the namespace up
among device and partition boundaries, leaking hardware details into the filesystem abstraction.
As this delineation may be arbitrary and even confusing to the user, it is inferior to Linuxs unified
namespace.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
25/37
A file is an ordered string of bytes. The first byte marks the beginning of the file, and the last byte
marks the end of the file. Each file is assigned a human-readable name for identification by both
the system and the user. Typical file operations are read, write, create, and delete. The Unix
concept of the file is in stark contrast to record-oriented filesystems, such as OpenVMSs Files-
11. Record-oriented filesystems provide a richer, more structured representation of files than
Unixs simplebyte-stream abstraction, at the cost of simplicity and flexibility.
Files are organized in directories. A directory is analogous to a folder and usually contains related
files. Directories can also contain other directories, called subdirectories. In this fashion,
directories may be nested to form paths. Each component of a path is called a directory entry. A
path example is /home/wolfman/butterthe root directory /, the directories home and wolfman,
and the file butter are all directory entries, called dentries. In Unix, directories are actually normal
files that simply list the files contained therein. Because a directory is a file to the VFS, the same
operations performed on files can be performed on directories.
Unix systems separate the concept of a file from any associated information about it, such as access
permissions, size, owner, creation time, and so on. This information is sometimes called file
metadata (that is, data about the files data) and is stored in a separate data structure from the file,
called the inode. This name is short for index node, although these days the term inode is much
more ubiquitous.
All this information is tied together with the filesystems own control information, which is stored
in the superblock. The superblock is a data structure containing information about the filesystem
as a whole. Sometimes the collective data is referred to as filesystem metadata. Filesystem
metadata includes information about both the individual files and the filesystem as a whole.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
26/37
Traditionally, Unix filesystems implement these notions as part of their physical ondisk
layout. For example, file information is stored as an inode in a separate block on the disk;
directories are files; control information is stored centrally in a superblock, and so on. The Unix
file concepts are physically mapped on to the storage medium. The Linux VFS is designed to work
with filesystems that understand and implement such concepts. Non-Unix filesystems, such as
FAT or NTFS, still work in Linux, but their filesystem code must provide the appearance of these
concepts. For example, even if a filesystem does not support distinct inodes, it must assemble the
inode data structure in memory as if it did. Or if a filesystem treats directories as a special object,
to the VFS they must represent directories as mere files. Often, this involves some special
processing done on-the-fly by the non-Unix filesystems to cope with the Unix paradigm and the
requirements of the VFS. Such filesystems still work, however, and the overhead is not
unreasonable. (Love, 2010)
UNIX File Locking
When a file can be accessed by more than one process, a synchronization problem occurs: what
happens if two processes try to write in the same file location? Or again, what happens if a process
reads from a file location while another process is writing into it?
In traditional Unix systems, concurrent accesses to the same file location produce unpredictable
results. However, the systems provide a mechanism that allows the processes to lock a file region
so that concurrent accesses may be easily avoided. (Bovet & Cesati, 2000)
The POSIX standard requires a file-locking mechanism based on the fcntl( ) system call. It is
possible to lock an arbitrary region of a file (even a single byte) or to lock the whole file (including
-
7/27/2019 Linux File Management - MATALA, IVAN G.
27/37
data appended in the future). Since a process can choose to lock just a part of a file, it can also hold
multiple locks on different parts of the file.
This kind of lock does not keep out another process that is ignorant of locking. Like a critical
region in code, the lock is considered "advisory" because it doesn't work unless other processes
cooperate in checking the existence of a lock before accessing the file. Therefore, POSIX's locks
are known as advisory locks. (Bovet & Cesati, 2000)
Traditional BSD variants implement advisory locking through the flock( ) system call. This call
does not allow a process to lock a file region, just the whole file. Traditional System V variants
provide the lockf( ) system call, which is just an interface to
fcntl( ). More importantly, System V Release 3 introduced mandatory locking: the kernel checks
that every invocation of the open( ), read( ), and write( ) system calls does not violate a mandatory
lock on the file being accessed. Therefore, mandatory locks are enforced even between non-
cooperative processes. A file is marked as a candidate for mandatory locking by setting its set-
group bit (SGID) and clearing the group-execute permission bit. Since the set-group bit makes no
sense when the group-execute bit is off, the kernel interprets that combination as a hint to use
mandatory locks instead of advisory ones.
Whether processes use advisory or mandatory locks, they can make use of both shared read locks
and exclusive write locks. Any number of processes may have read locks on some file region, but
only one process can have a write lock on it at the same time. Moreover, it is not possible to get a
write lock when another process owns a read lock for the same file region and vice versa (see table
below). (Bovet & Cesati, 2000)
-
7/27/2019 Linux File Management - MATALA, IVAN G.
28/37
The MS-DOS File System
The MS-DOS file system is the one the first IBM PCs came with. It was the main file system up
through Windows 98 and Windows ME. It is still supported on Windows 2000, Windows XP, and
Windows Vista, although it is no longer standard on new PCs now except for floppy disks.
However, it and an extension of it (FAT-32) have become widely used for many embedded
systems. Most digital cameras use it. Many MP3 players use it exclusively. The popular Apple
iPod uses it as the default file system, although knowledgeable hackers can reformat the iPod and
install a different file system. Thus the number of electronic devices using the MS-DOS file system
is vastly larger now than at any time in the past, and certainly much larger than the number using
the more modern NTFS file system. For that reason alone, it is worth looking at in some detail.
To read a file, an MS-DOS program must first make an open system call to get a handle for it. The
open system call specifies a path, which may be either absolute or relative to the current working
directory. The path is looked up component by component until the final directory is located and
read into memory. It is then searched for the file to be opened.
Although MS-DOS directories are variable sized, they use a fixed-size 32-byte directory entry.
The format of an MS-DOS directory entry is shown in the figure below. It contains the file name,
attributes, creation date and time, starting block, and exact file size. File names shorter than 8 + 3
characters are left justified and padded with spaces on the right, in each field separately. The
-
7/27/2019 Linux File Management - MATALA, IVAN G.
29/37
Attributes field is new and contains bits to indicate that a file is read-only, needs to be archived, is
hidden, or is a system file. Read-only files cannot be written. This is to protect them from
accidental damage. The archived bit has no actual operating system function (i.e., MS-DOS does
not examine or set it). The intention is to allow user-level archive programs to clear it upon
archiving a file and to have other programs set it when modifying a file. In this way, a backup
program can just examine this attribute bit on every file to see which files to back up. The hidden
bit can be set to prevent a file from appearing in directory listings. Its main use is to avoid confusing
novice users with files they might not understand. Finally, the system bit also hides files. In
addition, system files cannot accidentally be deleted using the del command. The main components
of MS-DOS have this bit set.
The MS-DOS directory entry
The directory entry also contains the date and time the file was created or last modified. The time
is accurate only to 2 sec because it is stored in a 2-byte field, which can store only 65,536 unique
values (a day contains 86,400 seconds). The time field is subdivided into seconds (5 bits), minutes
(6 bits), and hours (5 bits).
The date counts in days using three subfields: day (5 bits), month (4 bits), and year-1980 (7 bits).
With a 7-bit number for the year and time beginning in 1980, the highest expressible year is 2107.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
30/37
Thus MS-DOS has a built-in Y2108 problem. To avoid catastrophe, MS-DOS users should begin
with Y2108 compliance as early as possible. If MS-DOS had used the combined date and time
fields as a 32-bit seconds counter, it could have represented every second exactly and delayed the
catastrophe until 2116.
MS- DOS stores the file size as a 32-bit number, so in theory files can be as large as 4 GB.
However, other limits (described below) restrict the maximum file size to 2 GB or less. A
surprisingly large part of the entry (10 bytes) is unused.
MSDOS keeps track of file blocks via a file allocation table in main memory. The directory entry
contains the number of the first file block. This number is used as an index into a 64K entry FAT
in main memory. By following the chain, all the blocks can be found.
The FAT file system comes in three versions: FAT-12, FAT-16, and FAT-32, depending on how
many bits a disk address contains. Actually, FAT-32 is something of a misnomer, since only the
low-order 28 bits of the disk addresses are used. It should have been called FAT-28, but powers
of two sound so much neater.
For all FATs, the disk block can be set to some multiple of 512 bytes (possibly different for each
partition), with the set of allowed block sizes (called cluster sizes by Microsoft) being different for
each variant. The first version of MS-DOS used FAT-12 with 512-byte blocks, giving a maximum
partition size of 212 x 512 bytes (actually only 4086 x 512 bytes because 10 of the disk addresses
were used as special markers, such as end of file, bad block, etc.). With these parameters, the
maximum disk partition size was about 2 MB and the size of the FAT table in memory was 4096
entries of 2 bytes each. Using a 12-bit table entry would have been too slow.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
31/37
This system worked well for floppy disks, but when hard disks came out, it became a problem.
Microsoft solved the problem by allowing additional block sizes of 1 KB, 2 KB, and 4 KB. This
change preserved the structure and size of the FAT-12 table, but allowed disk partitions of up to
16 MB.
Since MS-DOS supported four disk partitions per disk drive, the new FAT-12 file system worked
up to 64-MB disks. Beyond that, something had to give. What happened was the introduction of
FAT-16, with 16-bit disk pointers. Additionally, block sizes of 8 KB, 16 KB, and 32 KB were
permitted. (32,768 is the largest power of two that can be represented in 16 bits.) The FAT-16
table now occupied 128 KB of main memory all the time, but with the larger memories by then
available, it was widely used and rapidly replaced the FAT-12 file system. The largest disk
partition that can be supported by FAT-16 is 2 GB (64K entries of 32 KB each) and the largest
disk, 8 GB, namely four partitions of 2 GB each.
For business letters, this limit is not a problem, but for storing digital video using the DV standard,
a 2-GB file holds just over 9 minutes of video. As a consequence of the fact that a PC disk can
support only four partitions, the largest video that can be stored on a disk is about 38 minutes, no
matter how large the disk is. This limit also means that the largest video that can be edited on line
is less than 19 minutes, since both input and output files are needed.
Starting with the second release of Windows 95, the FAT-32 file system, with its 28-bit disk
addresses, was introduced and the version of MS- DOS underlying Windows 95 was adapted to
support FAT-32. In this system, partitions could theoretically be 228 x 215 bytes, but they are
actually limited to 2 TB (2048 GB) because internally the system keeps track of partition sizes in
512-byte sectors using a 32-bit number, and 29 x 232 is 2 TB. The maximum partition size for
various block sizes and all three FAT types is shown in the figure below.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
32/37
Maximum partition size for different block sizes. The empty boxes represent forbidden
combinations.
In addition to supporting larger disks, the FAT-32 file system has two other advantages over FAT-
16. First, an 8-GB disk using FAT-32 can be a single partition. Using FAT-16 it has to be four
partitions, which appears to the Windows user as the C:, D:, E:, and F: logical disk drives. It is up
to the user to decide which file to place on which drive and keep track of what is where.
The other advantage of FAT-32 over FAT-16 is that for a given size disk partition, a smaller block
size can be used. For example, for a 2-GB disk partition, AT-16 must use 32-KB blocks; otherwise
with only 64K available disk addresses, it cannot cover the whole partition. In contrast, FAT-32
can use, for example, 4-KB blocks for a 2-GB disk partition. The advantage of the smaller block
size is that most files are much shorter than 32 KB. If the block size is 32 KB, a file of 10 bytes
ties up 32 KB of disk space. If the average file is, say, 8 KB, then with a 32-KB block, 3/4 of the
disk will be wasted, not a terribly efficient way to use the disk. With an 8-KB file and a 4-KB
block, there is no disk wastage, but the price paid is more RAM eaten up by the FAT. With a 4-
-
7/27/2019 Linux File Management - MATALA, IVAN G.
33/37
KB block and a 2-GB disk partition, there are 512K blocks, so the FAT must have 512K entries in
memory (occupying 2 MB of RAM).
MS-DOS uses the FAT to keep track of free disk blocks. Any block that is not currently allocated
is marked with a special code. When MS-DOS needs a new disk block, it searches the FAT for an
entry containing this code. Thus no bitmap or free list is required. (Tanenbaum, 2008)
Comparison of UNIX against Windows NT File System
Windows Vista supports several file systems, the most important of which are FAT-16, FAT-32,
and NTFS (NT File System). FAT-16 is the old MS-DOS file system. It uses 16-bit disk addresses,
which limits it to disk partitions no larger than 2 GB. Mostly it is used to access floppy disks, for
customers that still use them. FAT-32 uses 32-bit disk addresses and supports disk partitions up to
2 TB. There is no security in FAT-32, and today it is only really used for transportable media, like
flash drives. NTFS is the file system developed specifically for the NT version of Windows.
Starting with Windows XP it became the default file system installed by most computer
manufacturers, greatly improving the security and functionality of Windows. NTFS uses 64-bit
disk addresses and can (theoretically) support disk partitions up to 264 bytes, although other
considerations limit it to smaller sizes.
In this chapter we will examine the NTFS file system because it is a modern file system with many
interesting features and design innovations. It is a large and complex file system and space
limitations prevent us from covering all of its features, but the material presented below should
give a reasonable impression of it.
-
7/27/2019 Linux File Management - MATALA, IVAN G.
34/37
Fundamental Concepts
Individual file names in NTFS are limited to 255 characters; full paths are limited to 32,767
characters. File names are in Unicode, allowing people in countries not using the Latin alphabet
(e.g., Greece, Japan, India, Russia, and Israel) to write file names in their native language. For
example, 01XE is a perfectly legal file name. NTFS fully supports case-sensitive names (so Po is
different from Foo and F00). The Win32 API does not fully support case-sensitivity for file names
and not at all for directory names. The support for case-sensitivity exists when running the POSIX
subsystem in order to maintain compatibility with UNIX. Win32 is not case-sensitive, but it is
case-preserving, so file names can have different case letters in them. Though case-sensitivity is a
feature that is very familiar to users of UNIX, it is largely inconvenient to ordinary users who do
not make such distinctions normally. For example, the Internet is largely case-insensitive today.
An NTFS file is not just a linear sequence of bytes, as FAT-32 and UNIX files are. Instead, a file
consists of multiple attributes, each of which is represented by a stream of bytes. Most files have
a few short streams, such as the name of the file and its 64-bit object ID, plus one long (unnamed)
stream with the data. However, a file can also have two or more (long) data streams as well. Each
stream has a name consisting of the file name, a colon, and the stream name, as in ,foo:streand .
Each stream has its own size and is lockable independently of all the other streams. The idea of
multiple streams in a file is not new in NTFS. The file system on the Apple Macintosh uses two
streams per file, the data fork and the resource fork. The first use of multiple streams for NTFS
was to allow an NT file server to serve Macintosh clients. Multiple data streams are also used to
represent metadata about files, such as the thumbnail pictures of JPEG images that are available
in the Windows GUI. But alas, the multiple data streams are fragile and frequently fall off of files
-
7/27/2019 Linux File Management - MATALA, IVAN G.
35/37
when they are transported to other file systems, transported over the network, or even when backed
up and later restored, because many utilities ignore them.
NTFS is a hierarchical file system, similar to the UNIX file system. The separator between
component names is "V", however, instead of 7", a fossil inherited from the compatibility
requirements with CP/M when MS- DOS was created. Unlike UNIX the concept of the current
working directory, hard links to the current directory (.) and the parent directory (..) are
implemented as conventions rather than as a fundamental part of the file system design. Hard links
are supported, but only used for the POSIX subsystem, as is NTFS support for traversal checking
on directories (the 'x' permission in UNIX).
Symbolic links in NTFS were not supported until Windows Vista. Creation of symbolic links is
normally restricted to administrators to avoid security issues like spoofing, as UNIX experienced
when symbolic links were first introduced in 4.2BSD. The implementation of symbolic links in
Vista uses an NTFS feature called reparse points (discussed later in this section). In addition,
compression, encryption, fault tolerance, journaling, and sparse files are also supported. These
features and their implementations will be discussed shortly.
Implementation of the NT File System
NTFS is a highly complex and sophisticated file system that was developed specifically for NT as
an alternative to the HPFS file system that had been developed for OS/2. While most of NT was
designed on dry land, NTFS is unique among the components of the operating system in that much
of its original design took place aboard a sailboat out on the Puget Sound (following a strict
protocol of work in the morning, beer in the afternoon). Below we will examine a number of
-
7/27/2019 Linux File Management - MATALA, IVAN G.
36/37
features of NTFS, starting with its structure, then moving on to file name lookup, file compression,
journaling, and file encryption.
Windows NT File System Structure
Each NTFS volume (e.g., disk partition) contains files, directories, bitmaps, and other data
structures. Each volume is organized as a linear sequence of blocks (clusters in Microsoft's
terminology), with the block size being fixed for each volume and ranging from 512 bytes to 64
KB, depending on the volume size. Most NTFS disks use 4-KB blocks as a compromise between
large blocks (for efficient transfers) and small blocks (for low internal fragmentation). Blocks are
referred to by their offset from the start of the volume using 64-bit numbers.
The main data structure in each volume is the MFT (Master File Table), which is a linear sequence
of fixed-size 1-KB records. Each MFT record describes one file or one directory. It contains the
file's attributes, such as its name and timestamps, and the list of disk addresses where its blocks
are located. If a file is extremely large, it is sometimes necessary to use two or more MFT records
to contain the list of all the blocks, in which case the first MFT record, called the base record,
points to the other MFT records. This overflow scheme dates back to CP/M, where each directory
entry was called an extent. A bitmap keeps track of which MFT entries are free.
The MFT is itself a file and as such can be placed anywhere within the volume, thus eliminating
the problem with defective sectors in the first track. Furthermore, the file can grow as needed, up
to a maximum size of 248
records. (Tanenbaum, 2008)
-
7/27/2019 Linux File Management - MATALA, IVAN G.
37/37
References
Silberschatz, A., Galvin, P. & Gagne, G. (2013). Operating system concepts. Hoboken, N.J.
Chichester: Wiley John Wiley distributor.
Bovet, D. & Cesati, M. (2001). Understanding the Linux kernel. Beijing Cambridge, Mass:
O'Reilly.
Tanenbaum, A. S. (2008). Modern operating systems. (3rd ed.). Upper Saddle River, New Jersey:
Prentice Hall.
Love, R. (2010). Linux kernel development. Upper Saddle River, NJ: Addison-Wesley.