Oracle Solaris ZFS クリックリファレンス...Oracle Solaris ZFS クイックリファレンス 目次 1. zpool ZFSストレージプールの構成 サブコマンド 概要
ZFS Internals
description
Transcript of ZFS Internals
4/30/2014
1
ZFS Internals
Yupu Zhang
4/30/2014 1
Outline
• ZFS On‐disk Structure
– Storage Pool
– Physical Layout
– On‐disk Walk
• ZFS Architecture
• Summary
4/30/2014 2
ZFS Storage Pool
• Manages physical devices like virtual memory
– Provides a flat space
– Shared by all file system instances
• Consists of a tree of virtual devices (vdev)
– Physical virtual device (leaf vdev)
• Writable media block device, e.g., a disk
– Logical virtual device (interior vdev)
• Conceptual grouping of physical vdevs, e.g. mirror
4/30/2014 3
A simple configuration
4/30/2014
“A”(disk)
“B”(disk)
“root”(mirror A/B)
logical vdev
physical vdev
4/30/2014
2
Vdev Label
• A 256KB structure contained in physical vdev
– Name/value pairs
• Store information about the vdevs
• e.g., vdev id, amount of space
– Array of uberblocks
• A uberblock is like a superblock in ext2/3/4
• Provide access to a pool’s contents
• Contain information to verify a pool’s integrity
4/30/2014 5
Vdev Label
• Redundancy
– Four copies on each physical vdev
– Two at the beginning, and two at the end
• Prevent accidental overwrites occurring in contiguous chunks
4/30/2014 6
Label 0 Label 1 storage space for data Label 2 Label 3
Outline
• ZFS On‐disk Structure
– Storage Pool
– Physical Layout
– On‐disk Walk
• ZFS Architecture
• Summary
4/30/2014 7
Block Addressing
• Physical block– Contiguous sectors on disk– 512 Bytes – 128KB– Data Virtual Address (DVA)
• vdev id + offset (in the vdev)
• Logical block– e.g. a data block, a metadata block– Block Pointer (blkptr)
• Up to three DVAs for replication• A single checksum for integrity
4/30/2014 8
BlockBlock
DVA 1
DVA 2
DVA 3
BlockChecksum
Block
…
4/30/2014
3
Object
• Object– A group of blocks organized by a dnode
• A block tree connected by blkptrs
– Everything in ZFS is an object• e.g., a file, a dir, a file system …
• Dnode Structure– Common fields
• Up to 3 blkptrs• Block size, # of levels, …
– Bonus buffer• Object‐specific info
4/30/2014 9
dnode bonus
Examples of Object
• File object– Bonus buffer
• znode_phys_t: attributes of the file
– Block tree• data blocks
• Directory object– Bonus buffer
• znode_phys_t : attributes of the dir
– Block tree• ZAP blocks (ZFS Attributes Processor)
– name‐value pairs– dir contents: file name ‐ object id
4/30/2014 10
data
dnode znode
data data
ZAP
dnode znode
ZAP ZAP
Object Set
• Object Set (Objset)– A collection of related objects
• A group of “dnode blocks” managed by the metadnode
– Four types• File system, snapshot, clone, volume
• Objset Structure– A special dnode, called metadnode– ZIL (ZFS Intent Log) header
• Points to a chain of log blocks
4/30/2014 11
dnode
dnode
dnode
metadnode
ZIL header
Dataset
• Dataset (it’s an object!)
– Encapsulates an object set (i.e., FS)
– Tracks its snapshots and clones
• Bonus buffer
– dsl_dataset_phys_t
• Records info about snapshots and clones
• Points to the object set block
• Block tree
– None
4/30/2014 12
dnode
dsl_dataset_phys_t
dnode
dnode
dnode
metadnode
ZIL header
zpool
zfs
4/30/2014
4
Physical Layout
dnode
dnode
dnodezpool
zfs
dnode uberblock
vdev label
object set block
dnode block
data block
indirect block
dnode
dnode
dnode
dnode
Meta Object Set
objectfile
object setfile system
4/30/2014 13
data set
Outline
• ZFS On‐disk Structure
– Storage Pool
– Physical Layout
– On‐disk Walk
• ZFS Architecture
• Summary
4/30/2014 14
On‐Disk Walkthrough (/tank/z.txt)Meta Object Set Object
Directory
zpool
zfs
metadnode
uberblock
vdev label object set block
dnode block data/ZAP block
4/30/2014 15
root Dataset Directory
root Dataset Childmap
tank DatasetDirectory
tank Dataset
root = 2 tank = 27
tank Object Set Master Nodemetadnode
rootDirectory
z.txtFile
root = 3 z.txt = 4 data
block pointer
object reference
Read a Block
4/30/2014 16
z.txtFile
indirect block
data block
0 1 2 …
…
4/30/2014
5
Write a Block
dnode
dnode
dnodezpool
zfs
dnode
dnode
dnode
dnode
dnode
• Never overwrite
• For every dirty block– New block is allocated
– Checksum is generated
– Block pointer must be updated
– Its parent block is thus dirtied
• Updates to low‐level blocks are propagated up to the uberblock
4/30/2014 17
Outline
• ZFS On‐disk Structure
– Storage Pool
– Physical Layout
– On‐disk Walk
• ZFS Architecture
• Summary
4/30/2014 18
Overview
4/30/2014 19
ZPL(ZFS POSIX Layer)
DMU(Data Management Unit)
VFS(Virtual File System)
ZIO(ZFS I/O Pipeline)
ZIL(ZFS Intent Log)
write(file, offset, length)
write toblk Z of obj X in dataset Y
TX startdmu_writeTX end
disk write to blk N
ZIL (ZFS Intent Log)
• Why does ZFS need a log?
• NOT for consistency– COW transaction model guarantees consistency
• For performance of synchronous writes– Waiting seconds for TXG commit is not acceptable
– Just flush changes to the log and return
– Replay the log upon a crash or power failure
4/30/2014 20
4/30/2014
6
Outline
• ZFS On‐disk Structure
– Storage Pool
– Physical Layout and Logical Organization
– On‐disk Walk
• ZFS Architecture
• Summary
4/30/2014 21
Summary
• ZFS is more than a file system– Storage manageability: zpool
– Data integrity: checksum, replication
– Data consistency: COW, transactional model
• More about ZFS– Wiki: http://en.wikipedia.org/wiki/ZFS
– ZFS on Linux: http://zfsonlinux.org
– ZFS on FreeBSD: https://wiki.freebsd.org/ZFS
4/30/2014 22