Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

75
1 Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

description

Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18. Examples. UNIX: UFS based on FFS Windows: Disk: FAT, FAT32 and NTFS CD, DVD, floppy-disk .. filesystems Linux (40+): ext2, ext3, .. Distributed filesystems: NFS - PowerPoint PPT Presentation

Transcript of Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

Page 1: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

1

Filesystem Implementation

Tanenbaum Ch. 6

Silberschatz Ch. 11

Bovet Ch. 12, 18

Page 2: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

2

Examples

• UNIX: UFS based on FFS• Windows:

– Disk: FAT, FAT32 and NTFS– CD, DVD, floppy-disk .. filesystems

• Linux (40+): ext2, ext3, ..• Distributed filesystems: NFS

Modern OS must concurrently support multiple types of filesystems (fs)!

Page 3: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

3

Layered Approach

HW (Disk)

Interrupt handlersDevice drivers

Handles basic reading/writing of physical blocks

Handles files (logical blocksphysical blocks)

Handles metadata and directory structures

Shared

Content / operation on files

Page 4: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

4

Virtual File Systems (VFS)

• VFS provide an object-oriented way of implementing filesystems

• VFS allows the same system call interface (the API) to be used for different types of filesystems

• The interface is to the VFS interface, rather than any specific type of filesystem

Page 5: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

5

Schematic View of VFS

Concurrently support of multiple filesystems

Page 6: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

6

VFS and Linux

• VFS introduces a “common file model” – vnode: File representation structure

• “implemented” by– FAT32, NTFS, ext2/3, AFS, NFS, ReiserFS …

• Linux– i-node object (a file)

– file object (an open file)

– superblock object (entire file system)

– dentry object (directory entry)

Page 7: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

7

The VFS Objects: Common File Model

Process struct file struct dentry struct inode

Represents an open file in a process

from <fs.h>Represents a directory entry

from <dcache.h>

Represents a file in the filesystem

from <fs.h>

static ssize_t fifo_read( struct file *file, char *buf, size_t count, loff_t *ppos )

{

struct inode *node = file->f_dentry->d_inode; unsigned int minor = iminor(node);…}

macro from < /usr/include/linux/fs.h>

struct super_block file

Represents a filesystem

Page 8: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

8

Outline

• Implementing filesystems on disk– implementing files

• contiguous allocation• linked-list allocation• file allocation table (FAT)• i-node

– implementing directories– trade-offs and performance

• Look at some of the VFS objects for Linux– no complete listings

• Example of filesystem implementation– ext2, ext3

Page 9: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

9

Filesystem Implementation

A possible file system layout

Page 10: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

10

• Where to store/allocate blocks?

• How to find files/blocks?

• What is a good block size?

Logical address (block) Physical address (block)

Files Consist of Blocks of Data

1 2 3

4 5 6

7 8 9

10 11 12

9

5

11

4

7

2

12

6

10

8

1

3

Page 11: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

11

Implementing Files (1)

(a) Contiguous allocation of disk space for 7 files(b) State of the disk after files D and E have been removed

Page 12: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

12

Contiguous Allocation

+ Finding files/blocks is easy+ Offset + number of blocks

+ Excellent read performance

– Fragmentation– Compaction – Reuse of holes– Need to know max file size when allocating

• Where could this allocation be useful?

• What is the standard alternative to static allocation in computer science (think arrays in C)?

Page 13: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

13

Implementing Files (2)

- Storing a file as a linked list of disk blocks- Directory contains a pointer to first and last blocksHow much data can be stored in 10 blocks?

Page 14: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

14

Linked List Allocation

+ No holes, no pre-allocation problem+ Only address of first block needs to be stored

– Finding block n is expensive– Need to read all n-1 blocks prior to block n

– Size of data block is not 2x

– The pointer is not data

• Both disadvantages can be removed using a new data structure, which?

Page 15: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

15

Implementing Files – FAT

Idea: store the pointers in a table• Fast random access

– Table can be stored in RAM

• Full 2x block size

This method is called FAT(File Allocation Table)

Disadvantage: table size20 GB, block size 1 KB 20M blocks 80 MB (4-byte entries)

or 60MB (3-byte entries)What can we do to reduce the storage requirement?

A: 4 – 7 – 2 – 10 – 12

Page 16: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

16

FAT i-nodes

• Do we actually need to have the whole table in memory all the time?– table size proportional to disk size!

• Actually, only open files need to be there…

• Split the table into per-file tables, called i-nodes (index node)

Page 17: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

17

Implementing Files (4)

An example i-node

Indirect block tohandle large files

Page 18: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

18

Indirect Addressing

An i-node with 3 levels of indirect blocks

Page 19: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

19

Directories

• Opening a file:– locate root directory, – search for desired directory, – directory contains info to find file blocks on

disk. • disk address (contiguous allocation)

• number of first block (linked list)

• i-node number

• Directory system: maps ASCII file name onto the information needed to open it

Page 20: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

20

Implementing Directories (1)

Where to store file attributes?(a) A simple directory

fixed size entries (1 per file)disk addresses and attributes in directory entry

(b) Directory in which each entry just refers to an i-node

Page 21: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

21

Implementing Directories (2)

• Directories are files (i-node) with i-node pointers

• Directory systems should translate a name to a file (i-node)– dentry keeps this info in VFS

Page 22: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

22

Locating /usr/ast/mbox

Page 23: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

23

Shared Files

Storing attributes in i-node simplifies sharing

File system containing a shared file

Page 24: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

24

Hard/Symbolic Links

• Hard links are actually the same file– share the same i-node

– will be seen as the same file everywhere• same owner

• same contents

• same permissions

• keeps counter

• Symbolic links are dereferenced– a special file

• different owners/permissions

• can cross filesystem boundaries

– short cuts in Windows, alias in Mac

Page 25: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

25

Shared Files

(a) Situation prior to linking

(b) After the link is created

(c) After the original owner removes the file

Page 26: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

26

Check this under Linux..Execute as u1=user1, u2=user2 (make sure that u2 has write permissions)

1. u1: echo Hi > file-u12. u2: ln file-u1 file-u23. u2: ln –s file-u1 file-u2-s4. u2: cat file-u25. u2: cat file-u2-s6. u1: echo again >> file-u17. u1: rm file-u18. u2: cat file-u29. u2: cat file-u2-s

What is the output of line 4, 5 & 8, 9? Why?

Page 27: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

27

Mounting

/

usr bin tmp windows

Windows Documents and Settings

Temp

• The directory i-node indicates that it is a mount point

Page 28: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

28

Disk Space Management

Block size (bytes)

Store files in fixed-size blocks, how big the blocks should be?- Average file size is important

All files are 2KB large

Page 29: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

29

Keeping Track of Free Blocks (1)

(a) Storing the free list on a linked list (32 bits / block)(b) A bit map (1 bit per block, but for all blocks)

Page 30: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

30

Keeping Track of Free Blocks (2)

• Bitmap size depends on disk and block size• Linked list size depends on # free blocks• Bitmaps are generally smaller• Linked lists can use free blocks …• Only one block of the linked list needed in

main memory– The others are read/written on demand– Problems? What happens if files are deleted?

Page 31: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

31

Keeping Track of Free Blocks (3)

(a) Almost-full block of pointers to free disk blocks in RAM- three blocks of pointers on disk

(b) Result of freeing a 3-block file(c) Alternative strategy for handling 3 free blocks

- shaded entries are pointers to free disk blocks

Page 32: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

32

Quota

Quotas for keeping track of each user’s disk use

Page 33: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

33

Backups

• Performing filesystem backups is essential for reliable systems

• Two types– Full– Incremental

• Typically a mixed algorithm is used– How to keep track of which files to save?

Page 34: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

34

Backups

• A filesystem to be dumped– squares are directories, circles are files– shaded items, modified since last dump– each directory & file labeled by i-node number

File that hasnot changed

Page 35: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

35

Backups

• Commonly all modified files and directories above them are stored– Can restore on another filesystem– Individual files can be restored from incremental

backup

• Bitmaps are used to find the modified i-nodes

Page 36: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

36

Backups

4 phases of the algorithm1. Recursively mark each dir and each modified i-node (a)

2. Recursively unmark non-modified dirs (b)

3. Dump all dirs (c)

4. Dump all modified i-nodes (d)

Page 37: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

37

Outline

• Implementing filesystems on disk– Implementing files

• contiguous allocation• linked-list allocation• file allocation table (FAT)• i-node

– Implementing directories– trade-offs and performance

• Look at some of the VFS objects for Linux– no complete listings

• Example of filesystem implementation– ext2, ext3

Page 38: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

38

The Common File Model

Process struct file struct dentry struct inode

Represents an open file in a process

from <fs.h>Represents a directory entry

from <dcache.h>

Represents a file in the filesystem

from <fs.h>

struct super_block file

Represents a filesystem

Page 39: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

39

task_struct (sched.h)struct task_struct {

volatile long state; struct thread_info *thread_info;atomic_t usage;unsigned long flags;unsigned long ptrace;int lock_depth;int prio, static_prio;struct list_head run_list;prio_array_t *array;unsigned long sleep_avg;long interactive_credit;

[…]/* file system info */

int link_count, total_link_count;struct tty_struct *tty; /* NULL if no tty */

/* ipc stuff */struct sysv_sem sysvsem;

/* CPU-specific state of this task */struct thread_struct thread;

/* filesystem information */struct fs_struct *fs;

/* open file information */struct files_struct *files;

/* namespace */struct namespace *namespace;

/* signal handlers */struct signal_struct *signal;struct sighand_struct *sighand;

[…]};

struct files_struct { atomic_t count; spinlock_t file_lock; int max_fds; int max_fdset; int next_fd; struct file ** fd; /* current fd array */ fd_set *close_on_exec; fd_set *open_fds; fd_set close_on_exec_init; fd_set open_fds_init; struct file *

fd_array[NR_OPEN_DEFAULT];};

Remember:• Each process is represented using

a task_struct• Keeps “a list” of open files

– files_struct

Page 40: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

40

File (fs.h)

struct file {struct list_head f_list;struct dentry *f_dentry;struct vfsmount *f_vfsmnt;struct file_operations *f_op;atomic_t f_count;unsigned int f_flags;mode_t f_mode;loff_t f_pos;struct fown_struct f_owner;unsigned int f_uid, f_gid;int f_error;struct file_ra_state f_ra;

unsigned long f_version;void *f_security;

[..]

};

The file object:• Created by the OS when a file is opened• Does not exist on disk!

– no “dirty” bit is needed• Several processes can use the same file

object• Contains a list of pointers to operations

on this file

Set by the OS when file loaded from inode

Current file pointer (offset)

File reference count

Directory entry for the file!Directory entry for the file!

Page 41: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

41

Operations of Files

struct file_operations {struct module *owner;loff_t (*llseek) (struct file *, loff_t, int);ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);int (*readdir) (struct file *, void *, filldir_t);unsigned int (*poll) (struct file *, struct poll_table_struct *);int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);int (*mmap) (struct file *, struct vm_area_struct *);int (*open) (struct inode *, struct file *);int (*flush) (struct file *);int (*release) (struct inode *, struct file *);int (*fsync) (struct file *, struct dentry *, int datasync);int (*aio_fsync) (struct kiocb *, int datasync);int (*fasync) (int, struct file *, int);int (*lock) (struct file *, int, struct file_lock *);ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void __user *);ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);

};

Page 42: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

42

Dentry (Directory Entry)

• Dentry does not represent directories!– i-nodes represent directories

• Used in directory related operations– e.g., pathname lookup

/users/aja/crap/exam.tex

1 dentry and 1 i-node for each component

Page 43: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

43

Dentry Cache

• Dentry objects are created on the fly– time consuming!– inefficient

• dentry objects are often reused soon after creation

• Store dentry objects in a SW cache– the dentry cache (remember dcache.h)

Page 44: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

44

Software Caches• The frequently used (created/destroyed) objects

are stored/allocated in SW caches• Basically three caches exists in Linux

– User mode memory (VM)– Slab allocator (common structures/objects)– Page cache (inodes, disk blocks)

• Disk caches (the Page Cache) are used to cache disk accesses (not VM pages!!)– Crucial to system performance!– Must also be part of the page replacement algorithm

• Bovet, Ch. 17

Page 45: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

45

Dentry Cache

• Unused dentry objects stored in a list– Allows easy LRU replacement

• A hash table (name dentry)– Allows fast lookup

• Dentry states:– In use – used, and contains valid info– Unused – not used, but points to valid i-node– Negative – the i-node does not exist, kept to speed up lookups– Free – contains no valid info (stored in the slab cache)

Can safely be deleted by the page replacement algorithm

Page 46: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

46

dentry (dcache.h)struct dentry {

atomic_t d_count;unsigned long d_vfs_flags; /* moved here to be on same cacheline */spinlock_t d_lock; /* per dentry lock */struct inode * d_inode; /* Where the name belongs to - NULL is negative */struct list_head d_lru; /* LRU list */struct list_head d_child; /* child of parent list */struct list_head d_subdirs; /* our children */struct list_head d_alias; /* inode alias list */unsigned long d_time; /* used by d_revalidate */struct dentry_operations *d_op;struct super_block * d_sb; /* The root of the dentry tree */unsigned int d_flags;int d_mounted;void * d_fsdata; /* fs-specific data */

struct rcu_head d_rcu;struct dcookie_struct * d_cookie; /* cookie, if any */unsigned long d_move_count;/* to indicated moved dentry while lockless lookup */struct qstr * d_qstr; /* quick str ptr used in lockless lookup and concurrent d_move */struct dentry * d_parent; /* parent directory */struct qstr d_name;struct hlist_node d_hash; /* lookup hash list */struct hlist_head * d_bucket; /* lookup hash bucket */unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */

} ____cacheline_aligned;

dentry:• Associates the components

of a pathname to their inodes• Does not exist on disk

Page 47: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

47

inode (fs.h)struct inode {

struct hlist_node i_hash;struct list_head i_list;struct list_head i_sb_list;struct list_head i_dentry;unsigned longi_ino;atomic_t i_count;umode_t i_mode;unsigned int i_nlink;uid_t i_uid;gid_t i_gid;dev_t i_rdev;loff_t i_size;struct timespec i_atime;struct timespec i_mtime;struct timespec i_ctime;unsigned int i_blkbits;unsigned long i_blksize;unsigned long i_version;unsigned long i_blocks;unsigned short i_bytes;spinlock_t i_lock;struct semaphore i_sem;struct inode_operations *i_op;struct file_operations *i_fop; struct super_block *i_sb;

struct file_lock *i_flock;struct address_space*i_mapping;struct address_spacei_data;struct dquot *i_dquot[MAXQUOTAS];/* These three should probably be a union */struct list_head i_devices;struct pipe_inode_info *i_pipe;struct block_device *i_bdev;struct cdev *i_cdev;int i_cindex;

unsigned long i_dnotify_mask; struct dnotify_struct *i_dnotify; unsigned long i_state;unsigned int i_flags;unsigned char i_sock;

atomic_t i_writecount;void *i_security;u32 i_generation;union {

void *generic_ip;} u;#ifdef __NEED_I_SIZE_ORDERED

seqcount_t i_size_seqcount;#endif

};

List of operations supported on this file(system) There is also an inode cache (inode.c)

Structure with pointers to the page cache

Page 48: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

48

inode_operations (fs.h)struct inode_operations {

int (*create) (struct inode *,struct dentry *,int, struct nameidata *);struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);int (*link) (struct dentry *,struct inode *,struct dentry *);int (*unlink) (struct inode *,struct dentry *);int (*symlink) (struct inode *,struct dentry *,const char *);int (*mkdir) (struct inode *,struct dentry *,int);int (*rmdir) (struct inode *,struct dentry *);int (*mknod) (struct inode *,struct dentry *,int,dev_t);int (*rename) (struct inode *, struct dentry *,

struct inode *, struct dentry *);int (*readlink) (struct dentry *, char __user *,int);int (*follow_link) (struct dentry *, struct nameidata *);void (*truncate) (struct inode *);int (*permission) (struct inode *, int, struct nameidata *);int (*setattr) (struct dentry *, struct iattr *);int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);ssize_t (*listxattr) (struct dentry *, char *, size_t);int (*removexattr) (struct dentry *, const char *);

};

Page 49: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

49

struct address_space

• Stores pages in the page cache as a radix tree– Remember digital search trees (tries)?

• Allows fast lookup and sorting– Retrieve all dirty blocks

• Read more on:– http://lwn.net/Articles/175432/

– Bovet, Ch. 15

Page 50: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

50

super_block (fs.h)struct super_block {

struct list_head s_list; /* Keep this first */dev_t s_dev; /* search index; _not_ kdev_t */unsigned long s_blocksize;unsigned long s_old_blocksize;unsigned char s_blocksize_bits;unsigned char s_dirt;unsigned long long s_maxbytes; /* Max file size */struct file_system_type * s_type;struct super_operations * s_op;struct dquot_operations * dq_op;

struct quotactl_ops * s_qcop;struct export_operations * s_export_op;unsigned long s_flags;unsigned long s_magic;struct dentry * s_root;struct rw_semaphore s_umount;

Used to store filesystem specific information

This reflects VFS’s view of the fs!

Page 51: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

51

struct semaphore s_lock;int s_count;int s_syncing;int s_need_sync_fs;atomic_t s_active;void * s_security;struct list_head s_inodes; /* all inodes */struct list_head s_dirty; /* dirty inodes */struct list_head s_io; /* parked for writeback */struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */struct list_head s_files;struct block_device * s_bdev;struct list_head s_instances;struct quota_info s_dquot; /* Diskquota specific options */char s_id[32]; /* Informational name */struct kobject kobj; /* anchor for sysfs */void * s_fs_info; /* Filesystem private info *//* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */struct semaphore s_vfs_rename_sem; /* Kludge */

};

Page 52: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

52

Outline

• Implementing filesystems on disk– Implementing files

• contiguous allocation• linked-list allocation• file allocation table (FAT)• i-node

– Implementing directories– trade-offs and performance

• Look at some of the VFS objects for Linux– no complete listings

• Example of filesystem implementation– Ext2: popular and robust– Ext3: extended with journaling Bovet, Ch. 18

Page 53: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

53

Ext2

• Basic features– Native to Linux– Variable block size– “Related” blocks stored in Block Groups– Pre-allocates blocks to allow file growth– Supports fast symlinks

Page 54: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

54

Ext2Ext2 partition:

Boot Block

Block Group 0 …

Super Block

Block Group n

Block group descriptors

Data Block Bitmap

inode Bitmap

inode Table

Data Blocks

Copy in every block group

One bit for each block in the group

s/(8b), s=partition size, b = block size (bytes)Contains:• pointer to block bitmap• pointer to inode bitmap• pointer to inode table• free blocks count• free inodes count• directory count• pads

1 n 1 1 n n

Page 55: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

55

Disk vs. Memory Structs

• There needs to be a mapping– VFS ↔ disk structures– inode ↔ ext2_inode– superblock ↔ ext2_super_block

• Most structures are stored in page cache

• Some operations are generic VFS and some ext2-specific

Page 56: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

56

Ext2Disk data structure for an inode (fixed 128 bytes size)

struct ext2_inode {__u16 i_mode; /* File mode */__u16 i_uid; /* Low 16 bits of Owner Uid */__u32 i_size; /* Size in bytes */__u32 i_atime; /* Access time */__u32 i_ctime; /* Creation time */__u32 i_mtime; /* Modification time */__u32 i_dtime; /* Deletion Time */__u16 i_gid; /* Low 16 bits of Group Id */__u16 i_links_count; /* Links count */__u32 i_blocks; /* Blocks count */__u32 i_flags; /* File flags */union osd1; /* OS dependent 1 */__u32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */__u32 i_generation; /* File version (for NFS) */__u32 i_file_acl; /* File ACL */__u32 i_dir_acl; /* Directory ACL */__u32 i_faddr; /* Fragment address */union osd2; /* OS dependent 2 */

};

Pointer to extended attributes

Effective length of file

#blocks allocated to file

Pointer to the blocks

Page 57: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

57

File Size

• i_size and i_blocks do not always match– internal fragmentation in blocks

• i_size < i_blocks*512

– File “holes”• i_size > i_blocks*512

echo -n "S" | dd of=hole bs=1024 seek=6

• Creates a file (hole) with zeroes and an ‘S’• Only one block is allocated

Page 58: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

58

Ext2

• Ext2 supports the following file types– Unknown– Regular– Directory

• Stores names and inode numbers in data blocks

– Character and block devices, named pipes and sockets• Use no data blocks

– Symbolic links• Stores filenames < 60 characters in inode, else in data block• Uses the i_block[EXT2_N_BLOCKS] field

Page 59: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

59

Directory Blocks

1 2 . . \0 \0

5 2 h o m e

3 2 u s r \0

4 1 f i l e

1 \0 \0 \0

21

22

53

47

12

16

28

12

1 2 . \0 \0 \021 12

inode rec_len

name_len file_type

name

Example from Bovet, 2005. p. 749

Page 60: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

60

Ext2 – How to Find a Block

Finding the block number within a file is simple:

f div bwhere b is the block size, f is the position in the file

The fth character is in the f mod b position in the block

Page 61: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

61

Ext2 – How to Find a Block

• Blocks on disk and blocks in files are not the same

4

3

2

1

04

3

2

1

0

9

8

7

6

5

File block numbers

Logical block numbers

Page 62: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

62

Ext2 – How to Find a Block

• File block disk block mapping is stored (partly) in the inode

• Remember the

__u32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */

• Usually fifteen 32-bits words used as indexes

Page 63: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

63

Ext2 – How to Find a Block

10 32 54 76 98 1110 1312 14

0 6

12(b/4)+12(b/4)2 +

2*(b/4) + 11

(b/4) + 11(b/4)2 +

(b/4) + 12

Logical block number is 4 bytes

Page 64: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

64

Holes again

echo -n "S" | dd of=hole bs=1024 seek=6

For block size = 4096 bytes:

10 32 54 76 98 1110 1312 14

\0\0 \0\0 \0 \0\0 \0\0 \0 \0\0 S… …

4096

6145

\0 \0\0 S…

Page 65: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

65

Disk Space Management

• Want to avoid fragmentation– Block groups

• Management should be fast– SW Caches

• Where to create a new inode?– For regular files: in the block group of its parent

directory– Spread “root” directories in different block groups– Other nested directories in the same BG if not too full

Page 66: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

66

Disk Space Management

• Where to allocate new blocks to a file– Near the already allocated blocks– In the same block group– Other block groups

• Pre-allocation– 8 blocks are (pre-)allocated– Released when a file is closed.

Page 67: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

67

Disk Space Management

Data Block Allocation algorithm

1. Try to use an already (pre-)allocated block

2. Find a new block in the same block group

3. Consider all block groups1. Find a free “byte” (8 blocks)

2. Find a free “bit” (1 block)

Page 68: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

68

Disk Space Management

Page 69: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

69

Consistency

• What if page/block caches are not written to disk before a crash?

• Inconsistency = data was partially written

• How do we know if data is inconsistent?

Page 70: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

70

Consistency

• File system states(a) consistent(b) missing block(c) duplicate block in free list(d) duplicate data block

Count blocks in:• inodes• Free list

Page 71: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

71

Consistency Semantics

• Consistency semantics specify how multiple users are to access a shared file simultaneously– Unix file system (UFS) implements:

• Writes to an open file visible immediately to other users of the same open file

• Sharing file pointer to allow multiple users to read and write concurrently

– AFS has session semantics• Writes only visible to sessions starting after the file is closed

Page 72: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

72

Ext3

• Ext3 is pretty much Ext2 + Journaling

• Ext2 works fine as long as the filesystem is cleanly unmounted (remember page cache)

• Consistency checks are expensive!

• Especially for large filesystems (think servers)!

Page 73: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

73

Journaling

Idea: keep the most recent updates on diskEach filesystem change:1. Write “change” (log) to disk (journal)

• Called “commit log”2. After 1., write to disk, throw logConsistency check (boot time)• Crash before commit

– Discard changes• Crash after commit

– Redo changes

Page 74: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

74

Ext3

Three alternatives• Journaling (slow)

– Log both data and metadata blocks

• Ordering (faster)– Log only metadata, but data is written before

metadata (default mode)

• Writeback (fastest)– Only metadata is logged

Page 75: Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18

75

End of filesystems… but please read the chapters in the suggested book(s)