CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf ·...
Transcript of CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf ·...
![Page 1: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/1.jpg)
CS 318 Principles of Operating Systems
Fall 2018
Lecture 17: File System Crash ConsistencyRyan Huang
![Page 2: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/2.jpg)
Administrivia• Lab 3
• Extra office hour� Wednesday 4:30-6pm
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 2
![Page 3: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/3.jpg)
Review: File I/O Path (Reads)• read() from file
� Check if block is in cache� If so, return block to user
[1 in figure]� If not, read from disk, insert into cache,
return to user [2]
Disk
MainMemory (Cache)
1
2
Block in cache
Block Not in cache Leave c
opy i
n c
ach
e
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 3
![Page 4: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/4.jpg)
Review: File I/O Path (Writes)• write() to file
� Write is buffered in memory (“write behind”) [1]� Sometime later, OS decides to write to disk [2]
• Periodic flush or fsync call
• Why delay writes?� Implications for performance� Implications for reliability
Disk
MainMemory (Cache)
1
2
Buffer in memory
Later Write to disk
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 4
![Page 5: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/5.jpg)
The Consistent Update Problem• Atomically update file system from one consistent state to
another, which may require modifying several sectors, despite that the disk only provides atomic write of one sector at a time� What do we mean by consistent state?
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 5
![Page 6: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/6.jpg)
Example: File Creation• Initial state
Disk 01000 01000 /inodemap
blockmap inode array data blocks
Memory
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 6
![Page 7: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/7.jpg)
Example: File Creation• Read to in-memory Cache
Disk 01000 01000 /inodemap
blockmap inode array data blocks
Memory
01000 /
<‘.’, #2><‘..’, #2>
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 7
![Page 8: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/8.jpg)
<‘.’, #2><‘..’, #2>
01000
Example: File Creation• Modify metadata and blocks
Disk 01000 01000 /inodemap
blockmap inode array data blocks
Memory
01010 /
<‘.’, #2><‘..’, #2>
<‘a.txt’, #4>
Dirty blocks, memory state and disk state are inconsistent: must write to disk
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 8
![Page 9: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/9.jpg)
Crash?• Disk: atomically write one sector
� Atomic: if crash, a sector is either completely written, or none of this sector is written
• An FS operation may modify multiple sectors
• Crash è FS partially updated
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 9
![Page 10: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/10.jpg)
Possible Crash Scenarios• File creation dirties three blocks
� inode bitmap (B)� inode for new file (I)� parent directory data block (D)
• Old and new contents of the blocks� B = 01000 B’= 01010� I= free I’= allocated, initialized� D = {} D’= {<‘a.txt’, 4>}
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 10
![Page 11: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/11.jpg)
Possible Crash Scenarios• Crash scenarios: any subset can be written
� BID� B’ID� B I’D� BID’� B’I’D� B’I D’� B I’D’� B’I’D’
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 11
![Page 12: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/12.jpg)
The General Problem• Writes: Have to update disk with N writes
� Disk does only a single write atomically
• Crashes: System may crash at arbitrary point� Bad case: In the middle of an update sequence
• Desire: To update on-disk structures atomically� Either all should happen or none
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 12
![Page 13: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/13.jpg)
Example: Bitmap First• Write Ordering: Bitmap (B), Inode (I), Data (D)
� But CRASH after B has reached disk, before I or D
• Result?
Disk 01010 /B I D
Memory 01010
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 13
![Page 14: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/14.jpg)
Example: Inode First• Write Ordering: Bitmap (B), Inode (I), Data (D)
� But CRASH after I has reached disk, before B or D
• Result?
Disk 01000 /B I D
Memory 01010
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 14
![Page 15: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/15.jpg)
Example: Inode First• Write Ordering: Bitmap (B), Inode (I), Data (D)
� But CRASH after I AND B have reached disk, before D
• Result?
Disk 01010 /B I D
Memory 01010
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 15
![Page 16: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/16.jpg)
Example: Inode First• Write Ordering: Bitmap (B), Inode (I), Data (D)
� But CRASH after I AND B have reached disk, before D
• Result?� What if data block is a new block for the new file (i.e., create file with data)
Disk 01010 /B I D
Memory 01010
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 16
![Page 17: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/17.jpg)
Example: Data First• Write Ordering: Data (D) , Bitmap (B), Inode (I)
� CRASH after D has reached disk, before I or B
• Result?
Disk 01000 /
Memory 01010
<‘.’, #2><‘..’, #2>
<‘a.txt’, #4>
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 17
![Page 18: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/18.jpg)
Example: Data First• Write Ordering: Data (D) , Bitmap (B), Inode (I)
� CRASH after D has reached disk, before I or B
• Result?� What if data block is a new block for the new file (i.e., create file with data)
Disk 01000 /
Memory 01010 ‘Hello, 318’
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 18
![Page 19: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/19.jpg)
Traditional Solution: FSCK• FSCK: “file system checker”
• When system boots:� Make multiple passes over file system, looking for inconsistencies
• e.g., inode pointers and bitmaps, directory entries and inode reference counts� Try to fix automatically
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 19
![Page 20: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/20.jpg)
FSCK Example 1
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 20
inodelink_count = 1
block(number 123)
0011001100
for block 123
X1
data bitmap
![Page 21: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/21.jpg)
FSCK Example 2
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 21
Dir Entry
Dir Entry
inodelink_count = 1
2X
![Page 22: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/22.jpg)
FSCK Example 3
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 22
inodelink_count = 1
Dir Entry
ls -l /total 150drwxr-xr-x 401 18432 Dec 31 1969 afs/drwxr-xr-x. 2 4096 Nov 3 09:42 bin/drwxr-xr-x. 5 4096 Aug 1 14:21 boot/dr-xr-xr-x. 13 4096 Nov 3 09:41 lib/dr-xr-xr-x. 10 12288 Nov 3 09:41 lib64/drwx------. 2 16384 Aug 1 10:57 lost+found/...
???? How to fix?
fix
![Page 23: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/23.jpg)
Traditional Solution: FSCK• FSCK: “file system checker”
• When system boots:� Make multiple passes over file system, looking for inconsistencies� Try to fix automatically or punt to admin
• Example: B’ID, B I’D
• Problem:� Cannot fix all crash scenarios
• Can B’I D’be fixed?� Performance
• Sometimes takes hours to run on large disk volumes• Does fsck have to run upon every reboot?
� Not well-defined consistency
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 23
![Page 24: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/24.jpg)
Another Solution: Journaling• Idea: Write “intent” down to disk before updating file system
� Called the “Write Ahead Logging” or “journal”� Originated from database community
• When crash occurs, look through log to see what was going on� Use contents of log to fix file system structures
• Crash before “intent” is written è no-op• Crash after “intent” is written è redo op
� The process is called “recovery”
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 24
![Page 25: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/25.jpg)
Case Study: Linux Ext3• Physical journaling: write real block contents of the update to log
� Four totally ordered steps• Commit dirty blocks to journal as one transaction (TxBegin, I, B, D blocks)• Write commit record (TxEnd)• Copy dirty blocks to real file system (checkpointing)• Reclaim the journal space for the transaction
• Logical journaling: write logical record of the operation to log � “Add entry F to directory data block D”� Complex to implement � May be faster and save disk space
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 25
![Page 26: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/26.jpg)
Step 1: Write Blocks to Journal
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 26
Disk 01000 01000 /
Memory 01010 /
<‘.’, #2><‘..’, #2>
<‘a.txt’, #4>
journal 01010TxBid=1
![Page 27: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/27.jpg)
Step 2: Write Commit Record
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 27
Disk 01000 01000 /
Memory 01010 /
<‘.’, #2><‘..’, #2>
<‘a.txt’, #4>
journal 01010TxBid=1
TxEid=1
![Page 28: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/28.jpg)
Step 3: Copy Dirty Blocks to Real FS
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 28
Disk 01000 01000 /
Memory 01010 /
<‘.’, #2><‘..’, #2>
<‘a.txt’, #4>
journal 01010TxBid=1
TxEid=1
![Page 29: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/29.jpg)
Step 4: Reclaim Journal Space
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 29
Disk 01000 01000 /
Memory 01010 /
<‘.’, #2><‘..’, #2>
<‘a.txt’, #4>
journal
![Page 30: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/30.jpg)
What If There Is A Crash?• Recovery: Go through log and “redo” operations that have been
successfully committed to log
• What if …� TxBegin but not TxEnd in log?� TxBegin through TxEnd are in log, but I, B, and D have not yet been
checkpointed?• How could this happen?• Why don’t we merge step 2 and step 1?
� What if Tx is in log, I, B, D have been checkpointed, but Tx has not been freed from log?
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 30
![Page 31: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/31.jpg)
Summary of Journaling Write Orders• Journal writes < FS writes
� Otherwise, crash è FS broken, but no record in journal to patch it up
• FS writes < Journal clear� Otherwise, crash è FS broken, but record in journal is already cleared
• Journal writes < commit record write < FS writes� Otherwise, crash è record appears committed, but contains garbage
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 31
![Page 32: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/32.jpg)
Ext3 Journaling Modes• Journaling has cost
� one write = two disk writes, two seeks
• Several journaling modes balance consistency and performance• Data journaling: journal all writes, including file data
� Problem: expensive to journal data
• Metadata journaling: journal only metadata� Used by most FS (IBM JFS, SGI XFS, NTFS)� Problem: file may contain garbage data
• Ordered mode: write file data to real FS first, then journal metadata� Default mode for ext3� Problem: old file may contain new data
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 32
![Page 33: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/33.jpg)
Summary• The consistent update problem
� Example of file creation and different crash scenarios
• Two approaches to crash consistency� FSCK: slow, not well-defined consistency� Journaling: well-defined consistency, different modes
• Other approach� Soft updates (advanced OS topics)
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 33
![Page 34: CS 318 Principles of Operating Systemscs.jhu.edu/~huang/cs318/fall18/lectures/lec17_journal.pdf · 2019. 1. 6. · Case Study: Linux Ext3 •Physical journaling: write real block](https://reader035.fdocuments.net/reader035/viewer/2022071604/613f12d4c500cf75ab364af0/html5/thumbnails/34.jpg)
Next Time…• Read Appendix B
11/13/18 CS 318 – Lecture 17 – File System Crash Consistency 34