Post on 15-Jul-2020
1
Flash-aware File System
Flash-aware Computing
Instructor:Prof. Sungjin Lee (sungjin.lee@dgist.ac.kr)
2
Today File System Basics Traditional Flash File Systems SSD-Friendly Flash File Systems Reference
3
What is a File System? Provides a virtualized logical view of information stored on various
storage media, such as disks, tapes, and flash-based SSDs
Two key abstractions have developed over time in the virtualization of storage File: A linear array of bytes, each of which you can read or write
Its contents are defined by a creator (e.g., text and binary) It is often referred to as its inode number
Directory: A special file that is a collection of files and other directories Its contents are quite specific – it contains a list of (user-readable
name, inode #) pairs (e.g., (“foo”, 10)) It has a hierarchical organization (e.g., tree, acyclic-graph, and graph) It is also identified by an inode number
4
Operations on Files and Directories
POSIX APIs Descriptioncreat () Create a fileopen () Create/open a filewrite () Write bytes to a fileread () Read bytes from a filelseek () Move byte position inside a fileunlink () Remove a filetruncate () Resize a fileclose () Close a file
…
POSIX APIs Descriptionopendir () Open a directory for readingclosedir () Close a directoryreaddir () Read one directory entryrewinddir () Rewind a directory so it can be
rereadmkdir () Create a new directoryrmdir () Remove a directory
…
POSIX Operations on DirectoriesPOSIX Operations on Files
5
Virtual File System The POSIX API is to the VFS interface, rather than any specific
type of file system
open(), close (), read (), write ()
File-system specificimplementations
6
File System Implementation UNIX File System Journaling File System Log-structured or Copy-on Write File Systems
7
BootSector
SuperBlock
InodeBmap
DataBmap 0 1 2 … Data (4 KiB Blocks)
UNIX File System A traditional file system first developed for UNIX systems
Boot sector: Information to be loaded into RAM to boot up the OS Superblock: File system’s metadata (e.g., file system type, size, …) Inode & data Bmaps: Keep the status of blocks belonging to an
inode table and data blocks Inode table: Keep file’s metadata (e.g., size, permission, …) and
data block pointers Data blocks: Keep users’ file data
Inode table
8
Inode & Block Pointers
BootSector
SuperBlock
InodeBmap
DataBmap 0 1 2 …
Inode table
Last modified timeLast access time
File sizePermissionLink count
27
…
…
…
…
…
…
1037
3942
27 100
Bloc
k nu
mbe
rs
100
101
102
…
1037
Direct Blocks
Indirect Blocks
104
131
152
…
39421034
94133
14483
…
104
1034
Double Indirect Blocks
9
BootSector
SuperBlock
InodeBmap
DataBmap 0 1 2 … Data (4 KiB Blocks)
Consistent Update Problem What happens if sudden power loss occurs while writing
data to a file
Inode table
write (0, “foo”, strlen (“foo”) );
foo0
The file system will be inconsistent!!! Consistent update problem
10
Journaling File System Journaling file systems address the consistent update problem
by adopting an idea of write-ahead logging (or journaling) from database systems
Ext3, Ext4, ReiserFS, XFS, and NTFS are based on journaling
Double writes could degrade overall write performance!
BootSector
SuperBlock
InodeBmap
DataBmap 0 1 2 … Data (4 KiB Blocks)
foo0
Journaling spaceInode table
TxB TxEwrite (0, “foo”, strlen (“foo”) );
Journal write & commit
foo0
Checkpoint
11
Log-structured File System Log-structured file systems (LFS) treat a storage space as a
huge log, appending all files and directories sequentially
The state-of-the-art file systems are based on LFS or CoW e.g., Sprite LFS, F2FS, NetApp’s WAFL, Btrfs, ZFS, …
BootSector
SuperBlock
Check point#1
inode for A
Inode for B
File A
File B
inodeMap
Check point#2
Write all the files and inodes sequentially
12
Log-structured File System (Cont.) Advantages (+) No consistent update problem (+) No double writes – an LFS itself is a log! (+) Provide excellent write performance – disks are
optimized for sequential I/O operations (+) Reduce the movements of disk headers further (e.g.,
inode update and file updates)
Disadvantages (–) Expensive garbage collection cost (–) Slow read performance
13
Disadvantages of LFS
Expensive garbage collection cost: invalid blocks must be reclaimed for future writes; otherwise, free disk space will be exhausted
Slow read performance: involve more head movements for future reads (e.g., when reading the file A)
BootSector
SuperBlock
Check point#1
inode for A
Inode for B
inodeMap
Check point#2
inode for A
Inode for B
inodeMap
Write sequentially
Check point#2
File A
File B
Invalid
14
Write Cost Write cost with GC is modeled as follows Note: a segment (seg) is a unit of space allocation and GC
N is the number of segments µ is the utilization of the segments (0 ≤ µ < 1) If segments have no live data (µ = 0), write cost becomes 1.0
15
Write Cost Comparison
(measured)
(delayed writes, sorting)
16
Greedy Policy The cleaner chooses the least-utilized segments and sorts the live data
by age before writing it out again Workloads: 4 KB files with two overwrite patterns
(1) Uniform: No locality – equal likelihood of being overwritten (2) Hot-and-cold: Locality – 10:90
The variance in segment utilization
Worse than a systemwith no locality
17
Cost-Benefit Policy Hot segments are frequently selected as victims even though their
utilizations would drop further It is necessary to delay cleaning and let more of the blocks die On the other hand, free space in cold segments are valuable
Cost-benefit policy:
18
Cost-Benefit Policy (Cont.)
19
LFS Performance
Inode updates
Random reads
20
Today File System Basics Traditional Flash File Systems JFFS2: Journaling Flash File System YAFFS2: Yet Another Flash File System UBIFS: Unsorted Bock Image File System
SSD-Friendly File Systems Reference
21
Traditional File Systems for Flash Originally designed for block devices like HDDs e.g., ext2/3/4, FAT32, and NTFS
But, NAND flash memory is not a block device The FTL provides block-device views outside, hiding the unique
properties of NAND flash memory
Read Write Erase
Read Write
Traditional File System for HDDs(e.g., ext2/3/4, FAT32, and NTFS)
Flash Translation Layer
NAND flash
Flash-based SSDs
NAND control(e.g., ONFI)
Block I/O Interface
22
Flash File Systems Directly manage raw NAND flash memory Internally performing address mapping, garbage collection, and
wear-leveling by itself
Representative flash file systems JFFS2, YAFFS2, and UBIFS
Flash File System(e.g., JFFS2, YAFFS2, and UBIFS)
NAND-specific Low-Level Device Driver(e.g., MTD and UBI)
Read Write Erase
NAND Flash
Read Write Erase NAND control(e.g., ONFI)
23
Memory Technology Device (MTD) MTD is the lowest level for accessing flash chips Offer the same APIs for different flash types and technologies
e.g., NAND, OneNAND, and NOR
JFFS2 and YAFFS2 run on top of MTD
NAND OneNAND NOR …
MTD
JFFS2 YAFFS2 FTLs
mtd_read (), mtd_write (), …
device-specific commands, …
Typical FS
24
Traditional File Systems vs. Flash File Systems
File System + FTL Flash File System
Method - Access a flash device via FTL - Access a flash device directly
Pros- High interoperability- No difficulties in managing recent
NAND flash with new constraints
- High-level optimization with system-level information
- Flash-aware storage management
Cons- Lack of system-level information- Flash-unaware storage
management
- Low interoperability- Must be redesigned to handle
new NAND constraints
Flash file systems now become obsoletebecause of difficulties for the adoptionto new types of NAND devices
25
JFFS2: Journaling Flash File System A log-structured file system (LFS) for use with NAND flash Unlike LFS, however, it does not allow any in-place updates!!!
Main features of JFFS2 File data and metadata stored as nodes in NAND flash memory Keep an inode cache holding the information of nodes in DRAM A greedy garbage collection algorithm
Select cheapest blocks as a victim for garbage collection A simple wear-leveling algorithm combined with GC
Consider the wearing rate of flash blocks when choosing a victim block for GC
Optional data compression
26
JFFS2: Write Operation
NAND Flash Memory
Updates on File AVer: 1
Offset: 0Len: 200
32-bit CRCData
Ver: 2Offset: 200
Len: 20032-bit CRC
Data
Ver: 3Offset: 75
Len: 5032-biit CRC
Data
0 200 400
Ver 2 Ver 3
Ver: 4Offset: 0Len: 75
32-bit CRCData
Ver: 5Offset: 125
Len: 7532-bit CRC
Data
Ver 4 Ver 5
0-75: Ver 1
Logical View of File A
75-125: Ver 3
125-200: Ver 1
200-400: Ver 2
Inode Cache (DRAM)
Ver 1
0-75: Ver 4
125-200: Ver 5
All data are written sequentially to a log which records all changes
27
JFFS2: Read Operation
NAND Flash Memory
Updates on File AVer: 1
Offset: 0Len: 200
32-bit CRCData
Ver: 2Offset: 200
Len: 20032-bit CRC
Data
Ver: 3Offset: 75
Len: 5032-biit CRC
Data
0 200 400
Ver 2 Ver 3
Ver: 4Offset: 0Len: 75
32-bit CRCData
Ver: 5Offset: 125
Len: 7532-bit CRC
Data
Ver 4 Ver 5
0-75: Ver 1
Logical View of File A
75-125: Ver 3
125-200: Ver 1
200-400: Ver 2
Inode Cache (DRAM)
Ver 1
0-75: Ver 4
125-200: Ver 5
Read data from File A(offset: 75 – 200)
The latest data can be read from NAND flash by referring to the inode cache in DRAM
28
JFFS2: Mount Scan the flash memory medium after rebooting Check the CRC for written data and mark the obsolete data Build the inode cache
NAND Flash Memory
0 200 400
Ver 2 Ver 3 Ver 4 Ver 5
Ver: 1Offset: 0Len: 200
32-bit CRCData
0-200: Ver 1
200-400: Ver 2
Inode Cache
Ver: 2Offset: 200
Len: 20032-bit CRC
Data
Ver: 3Offset: 75
Len: 5032-bit CRC
Data
Ver: 4Offset: 0Len: 75
32-bit CRCData
Ver: 5Offset: 125
Len: 7532-bit CRC
Data
125-200: Ver 1
75-125: Ver 3
0-75: Ver 1
200-400: Ver 2
125-200: Ver 1
75-125: Ver 3
0-75: Ver 4
200-400: Ver 2
Ver 1
125-200: Ver 5
29
JFFS2: Problems Slow mount time All nodes must be scanned at mount time Mount time increases in proportion to flash size and file system
contents
High memory consumption All node information must be maintained in DRAM Memory consumption linearly depends on file system contents
Low scalability Infeasible for a large-scale flash device Mount time and memory consumption increase according to a
flash size
30
YAFFS2: Yet Another Flash File System Another log-structured file system for flash memory Store data to flash memory like a log with a unique sequence
number like JFFS2 Reads and writes are performed similar to JFFS2
Mitigate the problems raised by JFFS2 Relatively frugal with memory resource Checkpoints for a fast file system mount Dynamic wear-leveling
Support multiple platforms Linux, WinCE, RTOSs, etc
31
YAFFS2: Physical Layout The entries in the log are all one chunk (one page) in size
and can hold one of two types of chunk Data chunk: A chunk holding regular data file contents Object header: A descriptor for an object, such as a directory, a
regular file, etc; similar to struct stat but include dentry
Page (chunk)
Block
Objec
t hea
der
Objec
t ID:
500
Chun
k ID:
0
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:1 /
Seq
: 1
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:2 /
Seq
: 1
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:3 /
Seq
: 1
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:4 /
Seq
: 1
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:1 /
Seq
: 2
…
- Object ID: Identify which object the chunk belongs to - Chunk ID: Identify where in the file this chunk belongs- Seq : A sequence number of a data chunk
32
YAFFS2: File System Layout Maintain the information about objects and chunks in DRAM
Object header (/)
Object header (file1)
Data chunk (file1)
Data chunk (file1)
Object header (dir1)
Object header (file2)
Data chunk (file2)
NAND flash memory
Main memory
Object(/)
Object(file1)
Tnode(file1)
Object(dir1)
Object(file2)
Tnode(file2)
33
YAFFS2: Block Summary A block summary (including object id and chunk id) for chunks in the block
are written to the last chunk This allows all the tags for chunks in that block to be read in one hit, avoiding
full disk scan If a block summary is not available for a block, then a full scan is used to get
the tags on each chunk
Block
Objec
t hea
der
Objec
t ID:
500
Chun
k ID:
0
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:1 /
Seq
: 1
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:2 /
Seq
: 1
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:3 /
Seq
: 1
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:4 /
Seq
: 1
Data
chun
kOb
ject I
D:50
0Ch
unk I
D:1 /
Seq
: 2
…Bloc
kSu
mm
ary
Bloc
kSu
mm
ary
Object header: 1st page- Object ID: 500- Chunk ID: 0
Data chunk: 2nd page…
34
YAFFS2: Checkpoints DRAM structures are saved on flash at unmount Structures re-read, avoiding boot scan Lazy loading also reduces mount time
Objec
t hea
der
Objec
t ID:5
00Ch
unk I
D:0
Data
chun
kOb
ject ID
:500
Chun
k ID:
1 / S
eq: 1
Data
chun
kOb
ject ID
:500
Chun
k ID:
2 / S
eq: 1
Data
chun
kOb
ject ID
:500
Chun
k ID:
3 / S
eq: 1
Data
chun
kOb
ject ID
:500
Chun
k ID:
4 / S
eq: 1
Data
chun
kOb
ject ID
:500
Chun
k ID:
1 / S
eq: 2
…
On-fl
ash
RAM
stru
ctur
e
On-fl
ash
RAM
stru
ctur
e
Saved at unmount
Re-read at remount
35
UBIFS: Unsorted Bock Image File System A new flash file system developed by Nokia Considered as the next generation of JFFS2
New features of UBIFS Scalability
Scale well with respect to flash size Memory size and mount time do not depend on flash size
Fast mount Do not have to scan the whole media when mounting
Write-back support Dramatically improve the throughput of the file system in
many workloads …
36
UBI: Unsorted Block Image UBIFS runs on top of UBI volume UBI supports multiple volumes, bad block management, wear-
leveling, and bit-flips error management The upper level software can be simpler with UBI
NAND OneNAND NOR …
MTD
JFFS2 YAFFS2 UBIFS
mtd_read (), mtd_write (), …
device-specific commands, …
UBI
ubi_read (), ubi_write (), …
37
How UBI works Logical erase blocks (LEBs) are mapped to physical erase
blocks (PEBs) Any LEB can be mapped to any PEB
PEB: physical erase blockLEB: logical erase block
PEB 0 PEB 6 PEB 7PEB 1 PEB 2 PEB 3 PEB 4 PEB 5 PEB 8 PEB 9 PEB 10
MTD device
LEB 0 LEB 1 LEB 2 LEB 3 LEB 4 LEB 0 LEB 1 LEB 2
Volume A Volume B
UBI layer
eras
ere
adReturn 0xFFs writ
ere
ad
38
How UBI works UBI has its own wear-leveling algorithm that moves the
data kept in highly erased blocks to lower one
PEB: physical erase blockLEB: logical erase block
Static read-only data
PEB 0 PEB 6PEB 6 PEB 7PEB 1 PEB 2 PEB 3 PEB 4 PEB 5 PEB 8 PEB 9 PEB 10
MTD device
LEB 0 LEB 1 LEB 2 LEB 3 LEB 4 LEB 0 LEB 1 LEB 2
Volume A Volume B
UBI layer
Low erase counter High erase counter
Move dataRe-map LEB
39
UBIFS: Indexing with B+ Tree UBIFS index is a B+ tree and is stored on NAND flash
c.f., JFFS2 does not store the index on flash Leaf level contains data
Full scanning is not needed B+ tree is cached in RAM
Shrunk in case of memory pressure B+ tree
Index Data
Leaf level nodes
Realize good scalability and fast mount
40
UBIFS: Wandering Tree How to find the root of the tree?
- Write data node “D”- Old “D” becomes obsolete- Write indexing node “C”- Old “C” becomes obsolete- Write indexing node “B”- Old “B” becomes obsolete- Write indexing node “A”- Old “A” becomes obsolete
A
B
C
D D A B C BD A B
A
B
C
A
C
B
CC A
The position of the tree in flash is changed
41
UBIFS: Master Area LEBs 1/2 are reserved for a master area pointing to a root index node The master area can be quickly found on mount since its location is
fixed (LEBs 1 and 2) Using the root index node, B+ tree can be quickly constructed
A mater area could have multiple nodes A valid master node is found by scanning master area
D D A B C BD A B CC A
Master area (LEBs 1 and 2)
M(ver.1)
M(ver.2)
Root index node
42
Mount Time Comparison
43
Summary JFFS2: Journaling Flash File System version 2 Commonly used for low volume flash devices Compression is supported Long mount time & High memory consumption
YAFFS2: Yet Another Flash File System version 2 Fast mount time with check-pointing
UBIFS: Unsorted Block Image File System Fast mount time and low memory consumption by adopting B+
tree indexing
44
Today File System Basics Traditional Flash File Systems SSD-Friendly Flash File Systems F2FS: Flash-friendly File System
Reference
45
F2FS: Flash-friendly File System Log-structured file system for FTL devices Unlike other flash file systems, it runs atop FTL-based flash storage
and is optimized for it Exploit system-level information for better performance and reliability
(e.g., better hot-cold separation, background GC, …)
NAND Flash Memory
MTD
JFFS2/YAFFS2 UBIFS
UBI
NAND Flash Memory
FTL
F2FS
Block Device Driver
Flash Device
Traditional flash file system F2FS
46
Design Concept of F2FS Designed to take advantage of both of two approaches Exploit high-level system information Flash-aware storage management Handle new NAND flash without redesign
File System + FTL Flash File System
Method - Access a flash device via FTL - Access a flash device directly
Pros- High interoperability- No difficulties in managing recent
NAND flash with new constraints
- High-level optimization with system-level information
- Flash-aware storage management
Cons- Lack of system-level information- Flash-unaware storage
management
- Low interoperability- Must be redesigned to handle
new constraints
47
Index Structure of LFS
Update on file data
LFS and UBIFS suffer from the wandering tree problem Update on a file causes several extra writes
48
Index Structure of F2FS Introduce Node Address Table (NAT) containing the locations
of all the node blocks, including indirect and direct nodes NAT allows us to eliminate the wandering tree problem
In-place update
49
Logical Storage Layout File-system’s metadata is located together for locality Use an “in-place update” strategy for metadata
Files are written sequentially for performance Use an “out-of-place-update” strategy to exploit high throughput
of multiple NAND chips Six active logs for static hot and cold data separation
Superblock 0
Superblock 1Checkpointarea
SegmentInfo.Table(SIT)
NodeAddress
Table(NAT)
SegmentSummary
Area(SSA)
Main Area
Hot/Warm/Coldnode segments
Hot/Warm/Colddata segments
Random writes Sequential writes
50
Cleaning Process Background cleaning process A kernel thread doing the cleaning job periodically at idle time
Victim selection policies Greedy algorithm for foreground cleaning job
Reduce the amount of data moved for cleaning Cost-benefit algorithm for background cleaning job
Reclaim obsolete space in a file system– Improve the lifetime of a storage device– Improve the overall I/O performance
51
F2FS Performance
52
Today File System Basics Traditional Flash File Systems SSD-Friendly Flash File Systems Reference
53
Reference Rosenblum, Mendel, and John K. Ousterhout. "The design and
implementation of a log-structured file system." ACM Transactions on Computer Systems (TOCS) 10.1 (1992): 26-52.
Woodhouse, David. "JFFS2: The Journalling Flash File System, Version 2, 2001.“
YAFFS2 Specification. http://www.yaffs.net/yaffs-2-specification Lee, Changman, et al. "F2FS: A New File System for Flash Storage.“
USENIX FAST. 2015. Arpaci-Dusseau, Remzi H., and Andrea C. Arpaci-Dusseau. Operating
systems: Three easy pieces. Vol. 151. Wisconsin: Arpaci-Dusseau Books, 2014.
Hunter, Adrian. "A brief introduction to the design of UBIFS." Rapport technique, March (2008).