AOS Lab 9: File system -- Of buffers, logs, and blocks

Lab 9: File system – Of buffers, logs, and blocksAdvanced Operating Systems

Zubair Nabi

[email protected]

April 3, 2013

Introduction

The purpose of a file system is to:

1 Organize and store data

2 Support sharing of data among users and applications

3 Ensure persistence of data after a reboot

Introduction

Challenges

• Need on-disk data structures to:• Represent the tree of named directories and files

• Record the identities of the blocks that hold each file’s content• Keep track of the areas of the disk which are free

• The file system needs to support crash recovery• A restart must not corrupt the file system or leave it in an

inconsistent state

• The file system can be accessed by multiple processes at thesame time and this access needs to be synchronized

• Disk access is orders of magnitude slower than memory access,so the file system must maintain an in-memory cache of popularblocks

Challenges

• Need on-disk data structures to:• Represent the tree of named directories and files• Record the identities of the blocks that hold each file’s content

• Keep track of the areas of the disk which are free

inconsistent state

Challenges

• Need on-disk data structures to:• Represent the tree of named directories and files• Record the identities of the blocks that hold each file’s content• Keep track of the areas of the disk which are free

inconsistent state

Challenges

• The file system needs to support crash recovery

• A restart must not corrupt the file system or leave it in aninconsistent state

Challenges

inconsistent state

Challenges

inconsistent state

Challenges

inconsistent state

xv6 FS layers

File descriptors

Recursive lookup

Directory inodes

Inodes and block allocator

Logging

Buffer cache

System calls

Pathnames

Directories

Transactions

Blocks

xv6 FS layers (2)

1 Buffer cache: Reads and writes blocks on the IDE disk via thebuffer cache, which synchronizes access to disk blocks

• Ensures that only one kernel process can edit any particular blockat a time

2 Logging: Ensures atomicity by enabling higher layers to wrapupdates to several blocks in a transaction

3 Inodes and block allocator: Provides unnamed files, eachunnamed file is represented by an inode and a sequence ofblocks holding the file content

xv6 FS layers (2)

xv6 FS layers (3)

4 Directory inodes: Implements directories as a special kind ofinode

• The content of this inode is a sequence of directory entries, eachof which contains a name and a reference to the named file’sinode

5 Recursive lookup: Provides hierarchical path names such as/foo/bar/baz.txt, via recursive lookup

6 File descriptors: Abstracts many Unix resources, such as pipes,devices, file, etc., using the file system interface

xv6 FS layers (3)

File system layout

• xv6 lays out inodes and content blocks on the disk by dividing thedisk into several sections

boot super bitmap... data...inodes... log...

0 1 2 …..

• Block 0 holds the boot sector• Block 1 (called the superblock) contains metadata about the file

system• File system size in blocks, the number of data blocks, the number

of inodes, and the number of blocks in the log

• Blocks starting at 2 hold inodes, with multiple inodes per block

File system layout

0 1 2 …..

• Block 0 holds the boot sector

• Block 1 (called the superblock) contains metadata about the filesystem

• File system size in blocks, the number of data blocks, the numberof inodes, and the number of blocks in the log

File system layout

0 1 2 …..

system

• File system size in blocks, the number of data blocks, the numberof inodes, and the number of blocks in the log

File system layout

0 1 2 …..

File system layout

0 1 2 …..

File system layout

0 1 2 …..

• inode blocks are followed by bitmap blocks which keep track ofdata blocks in use

• Bitmap blocks are followed by data blocks which hold file anddirectory contents

• Finally at the end, the blocks hold a log which is required by thetransaction layer

File system layout

0 1 2 …..

File system layout

0 1 2 …..

Buffer cache layer

• Has two main jobs:1 Synchronize access to disk blocks

2 Cache popular blocks

• Main interface:1 bread: Obtains a buffer containing a copy of a block2 bwrite: Writes a modified buffer3 brelse: Releases a buffer (after a read or write)

Buffer cache layer

• Has two main jobs:1 Synchronize access to disk blocks2 Cache popular blocks

Buffer cache layer

• Main interface:1 bread: Obtains a buffer containing a copy of a block

2 bwrite: Writes a modified buffer3 brelse: Releases a buffer (after a read or write)

Buffer cache layer

• Main interface:1 bread: Obtains a buffer containing a copy of a block2 bwrite: Writes a modified buffer

3 brelse: Releases a buffer (after a read or write)

Buffer cache layer

Buffer cache layer (2)

• Synchronizes access to each block by allowing only a singlekernel thread to have a reference to the block’s buffer

• If one thread is holding a reference to a buffer, other threads willsleep on it

• The buffer cache has a fixed number of buffers to host disk blocks

• If higher layers ask for a block that is not cached, the buffer cacherecycles the least recently used buffer for this block

Buffer cache

• The buffer cache is a doubly-linked of struct buf, with NBUFbuffers, accessed via bcache.head

• A buffer has three state bits1 B_VALID2 B_DIRTY3 B_BUSY

Buffer cache

• A buffer has three state bits

1 B_VALID2 B_DIRTY3 B_BUSY

Buffer cache

• A buffer has three state bits1 B_VALID

2 B_DIRTY3 B_BUSY

Buffer cache

• A buffer has three state bits1 B_VALID2 B_DIRTY

3 B_BUSY

Buffer cache

• A buffer has three state bits1 B_VALID2 B_DIRTY3 B_BUSY

• Makes a call to bget() to get a buffer for the given sector

• If the buffer is not B_VALID, it makes a call to iderw to read itinto the buffer cache

• Makes a call to bget() to get a buffer for the given sector

• If the buffer is not B_VALID, it makes a call to iderw to read itinto the buffer cache

Code: bread

struct buf*bread(uint dev, uint sector){struct buf *b;

b = bget(dev, sector);if(!(b->flags & B_VALID))iderw(b);

return b;}

• Scans the buffer list for uint dev and uint sector

1 If such a buffer is present and B_BUSY is not set, it sets it andreturns the buffer

2 If B_BUSY is set, it goes to sleep on the buffer• Important: After bget wakes up, it cannot assume that the buffer

is available now – it might have been reused for a different sector –so it starts all over

3 If the buffer is not present, it reuses an existing buffer and edits itsmetadata to record the new uint dev and uint sector andsets B_BUSY and clears B_VALID and B_DIRTY

• Scans the buffer list for uint dev and uint sector1 If such a buffer is present and B_BUSY is not set, it sets it and

returns the buffer

2 If B_BUSY is set, it goes to sleep on the buffer• Important: After bget wakes up, it cannot assume that the buffer

is available now – it might have been reused for a different sector –so it starts all over

returns the buffer2 If B_BUSY is set, it goes to sleep on the buffer

• Important: After bget wakes up, it cannot assume that the bufferis available now – it might have been reused for a different sector –so it starts all over

bwrite

• Once bread returns a buffer, the caller has exclusive use of it

• If the caller writes to the buffer, it must call bwrite

• bwrite sets B_DIRTY and makes a call to iderw

bwrite

Code: bwrite

voidbwrite(struct buf *b){

if((b->flags & B_BUSY) == 0)panic("bwrite");

b->flags |= B_DIRTY;iderw(b);

brelse

• Moves the buffer from its current position to the front of the buffercache linked list, clears the B_BUSY bit, wakes up any processessleeping on that particular buffer

• This moving orders the buffers by how recently they were used• Why do we need to do this?

• Makes the scan in bget efficient – Remember its a doubly linkedlist

brelse

• This moving orders the buffers by how recently they were used

• Why do we need to do this?• Makes the scan in bget efficient – Remember its a doubly linked

brelse

Code: brelse

void brelse(struct buf *b){if((b->flags & B_BUSY) == 0)panic("brelse");

acquire(&bcache.lock);b->next->prev = b->prev;b->prev->next = b->next;b->next = bcache.head.next;b->prev = &bcache.head;bcache.head.next->prev = b;bcache.head.next = b;b->flags &= ~B_BUSY;wakeup(b);release(&bcache.lock);

Logging layer

• xv6 implements file system fault tolerance through a simplelogging mechanism

• System calls do not directly write file system data structures• Instead:

1 A system call first writes a description of all the disk writes that itwishes to perform to a log on the disk

2 It then writes a special commit record to the log to specify that itcontains a complete operation

3 Next it copies the required writes to the on-disk file system datastructures

4 Finally, it deletes the log

Logging layer

• System calls do not directly write file system data structures

• Instead:1 A system call first writes a description of all the disk writes that it

wishes to perform to a log on the disk2 It then writes a special commit record to the log to specify that it

contains a complete operation3 Next it copies the required writes to the on-disk file system data

structures4 Finally, it deletes the log

Logging layer

Recovery

• In case of a reboot, the file system performs recovery by lookingat the log file

• If the log contains the commit record, the recovery code copiesthe required writes to the on-disk data structures

• If the log does not contain a complete operation, it is ignored anddeleted

Recovery

Correctness of recovery mechanism

• If the crash occurs before the commit record, the log will beignored, and the state of the disk will stay unmodified

• If the crash occurs after the commit record, then the recovery willreplay all of the operation’s writes, even repeating them if thecrash occurred during the write to the on-disk data structure

• In both cases, the correctness of the file system is preserved:Either all writes are reflected on the disk or none

Log design

• The log resides at a fixed location at the end of the disk

• It consists of a header block and a set of data blocks• The header block contains

1 An array of sector numbers, one for each of the logged data blocks2 Count of logged blocks

• The header block is written to after a commit• The count is set to zero once all logged blocks have been

reflected in the file system• The count will be zero in case of a crash before a commit• The count will be non-zero in case of a crash after a commit

Log design

• It consists of a header block and a set of data blocks

• The header block contains1 An array of sector numbers, one for each of the logged data blocks2 Count of logged blocks

Log design

1 An array of sector numbers, one for each of the logged data blocks

2 Count of logged blocks

Log design

• The header block is written to after a commit

• The count is set to zero once all logged blocks have beenreflected in the file system

• The count will be zero in case of a crash before a commit• The count will be non-zero in case of a crash after a commit

Log design

reflected in the file system

• The count will be zero in case of a crash before a commit• The count will be non-zero in case of a crash after a commit

Log design

reflected in the file system• The count will be zero in case of a crash before a commit

• The count will be non-zero in case of a crash after a commit

Log design

Log design (2)

• A transaction sequence is indicated by the start and endsequence of writes in the system call

• Only one system call can be in a transaction at any given time toensure correctness

• The log holds at most one transaction at a time

• Only read system calls can execute concurrently with atransaction

• A fixed amount of space on the disk is dedicated to hold the log• No system call can write more distinct blocks than the size of the

log• Large writes are broken into multiple smaller writes so that each

write can fit in the log

Log design (2)

• A fixed amount of space on the disk is dedicated to hold the log

• No system call can write more distinct blocks than the size of thelog

• Large writes are broken into multiple smaller writes so that eachwrite can fit in the log

Log design (2)

• Large writes are broken into multiple smaller writes so that eachwrite can fit in the log

Log design (2)

Code: Typical system call usage of log

begin_trans();...bp = bread(...);bp->data[...] = ...;log_write(bp);...commit_trans();

Log functions

• begin_trans: Waits until it obtains exclusive use of the log

• log_write:• Appends the block’s new content to the log on the disk• Leaves the modified block in the buffer cache so that subsequent

reads of the block during the transaction will yield the updatedstate

• Records the block’s sector number in memory to find out when ablock is written multiple times during a transaction and overwritethe block’s previous copy in the log

• commit_trans:1 Writes the log’s header block to disk, updating the count2 Calls install_trans to copy each block from the log to the

relevant location on the disk3 Sets to count in the log header to zero

Log functions

• begin_trans: Waits until it obtains exclusive use of the log• log_write:

• Appends the block’s new content to the log on the disk

• Leaves the modified block in the buffer cache so that subsequentreads of the block during the transaction will yield the updatedstate

Log functions

• Appends the block’s new content to the log on the disk• Leaves the modified block in the buffer cache so that subsequent

Log functions

• commit_trans:1 Writes the log’s header block to disk, updating the count

2 Calls install_trans to copy each block from the log to therelevant location on the disk

3 Sets to count in the log header to zero

Log functions

relevant location on the disk

3 Sets to count in the log header to zero

Log functions

Code snippet: filewrite

begin_trans();ilock(f->ip);if ((r = writei(f->ip, addr + i, f->off, n1)) > 0)f->off += r;

iunlock(f->ip);commit_trans();

Recovery

Code snippet: recover_from_log

static voidrecover_from_log(void){

read_head();// if committed, copy from log to diskinstall_trans();log.lh.n = 0;write_head(); // clear the log

Code snippet: install_trans

static void install_trans(void) {int tail;for (tail = 0; tail < log.lh.n; tail++) {// read log blockstruct buf *lbuf = bread(log.dev,

log.start+tail+1);// read dststruct buf *dbuf = bread(log.dev,

log.lh.sector[tail]);// copy block to dstmemmove(dbuf->data, lbuf->data, BSIZE);bwrite(dbuf); // write dst to diskbrelse(lbuf);brelse(dbuf);

Block allocator

• Maintains a free bitmap on disk; one bit per block

• A zero bit means that the block is free while a one indicates thatthe block is in use

• The bits for the boot sector, superblock, inode blocks, and bitmapblocks are always set

• Provides two functions to allocate (balloc()) and de-allocate(bfree()) a block

Block allocator

• Maintains a free bitmap on disk; one bit per block• A zero bit means that the block is free while a one indicates that

the block is in use

• The bits for the boot sector, superblock, inode blocks, and bitmapblocks are always set

Block allocator

the block is in use• The bits for the boot sector, superblock, inode blocks, and bitmap

blocks are always set

Block allocator

the block is in use• The bits for the boot sector, superblock, inode blocks, and bitmap

blocks are always set

balloc

• Calls readsb to read the superblock to get metadata

• Uses this metadata to traverse the entire bitmap and look for abitmap in which the bit is zero

• If it finds a free block it updates the bitmap and returns the block

balloc

• Finds the corresponding bitmap block

• Clears its bitmap bit

• Finds the corresponding bitmap block

• Clears its bitmap bit

Today’s task

• xv6 does not allow concurrent transactions to the log whichmeans that if a system call performs a long write operation, allother write system calls will block

• Come up with a strategy to implement concurrent transactions tothe log in terms of pseudo-code

Reading(s)

• Chapter 6, “File system”, up to section “Code: directory layer"from “xv6: a simple, Unix-like teaching operating system”

AOS Lab 9: File system -- Of buffers, logs, and blocks

Technology

Transcript of AOS Lab 9: File system -- Of buffers, logs, and blocks