Optimizing ZFS for Block Storage
description
Transcript of Optimizing ZFS for Block Storage
![Page 1: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/1.jpg)
Optimizing ZFS for Block Storage
Will Andrews, Justin GibbsSpectra Logic Corporation
![Page 2: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/2.jpg)
• Quick Overview of ZFS• Motivation for our Work• Three ZFS Optimizations
– COW Fault Deferral and Avoidance– Asynchronous COW Fault Resolution– Asynchronous Read Completions
• Validation of the Changes• Performance Results• Commentary• Further Work• Acknowledgements
Talk Outline
![Page 3: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/3.jpg)
• File System/Object store + Volume Manager + RAID
• Data Integrity via RAID, checksums stored independently of data, and metadata duplication
• Changes are committed via transactions allowing fast recovery after an unclean shutdown
• Snapshots• Deduplication• Encryption• Synchronous Write Journaling• Adaptive, tiered caching of hot data
ZFS Feature Overview
![Page 4: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/4.jpg)
Simplified ZFS Block DiagramZFS
POSIXLayer
PresentationLayer
Configuration &Control
Data Management Unit
Storage Pool Allocator
Objects and Caching
Layout Policy
ZFS Volumes Lustre
CAMTargetLayer
zfs(8), zpool(8)
File, Block, or Object Access
TX Management & Object Coherency
Volumes, RAID, Snapshots, I/O Pipeline
Spectra OptimizationsHere
![Page 5: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/5.jpg)
• ZFS’s unit of allocation and modification is the ZFS record.
• Records range from 512B to 128KB.• Checksum for each record are verified
when the record is read to ensure data integrity.
• Checksums for a record are stored in the parent record (indirect block, or DMU node) that reference it, which are themselves checksummed.
ZFS Records or Blocks
![Page 6: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/6.jpg)
• ZFS never overwrites a currently allocated block– A new version of the storage pool is built in free
space– The pool is atomically transitioned to the new
version– Free space from the old version is eventually
reused• Atomicity of the version update is
guaranteed by transactions, just like in databases.
Copy-on-Write, Transactional, Semantics
![Page 7: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/7.jpg)
• Each write is assigned a transaction.• Transactions are written in batches called
“transaction groups” that aggregate the I/O into sequential streams for optimum write bandwidth.
• TXGs are pipelined to keep the I/O subsystem saturated– Open TXG: Current version of Objects. Most changes
happen here.– Quiescing TXG: Waiting for writers to finish changes to
in-memory buffers.– Synching TXG: buffers being committed to disk.
ZFS Transactions
![Page 8: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/8.jpg)
Copy on Write In Action
Indirect Block
Data Block Data Block Data Block Data Block
Indirect Block
DMU Node
überblock
Data Block
Indirect Block
DMU Node
überblockRoot of
Storage Pool
Root of an Object (file)
Indirect Linkage for
Object Expansion
Write
![Page 9: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/9.jpg)
• DMU Buffer (DBUF): Metadata for ZFS blocks being modified
• Dirty Record: Syncher information for committing the data.
Record Data
Record Data
Record Data
DMU BufferDirty Record Dirty Record Dirty Record
Current Object
Version
Tracking Transaction Groups
Open TXG Quiescing TXG Syncing TXG
Time
![Page 10: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/10.jpg)
Performance Demo
![Page 11: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/11.jpg)
When we write an existing block, we must mark it dirty…
voiddbuf_will_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx){ int rf = DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH;
ASSERT(tx->tx_txg != 0); ASSERT(!refcount_is_zero(&db->db_holds));
DB_DNODE_ENTER(db); if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock)) rf |= DB_RF_HAVESTRUCT; DB_DNODE_EXIT(db); (void) dbuf_read(db, NULL, rf); (void) dbuf_dirty(db, tx);}
Performance Analysis
![Page 12: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/12.jpg)
• Why does ZFS Read on Writes?– ZFS records are never overwritten directly– Any missing old data must be read before the
new version of the record can be written– This behavior is a COW Fault
• Observations– Block consumers (Databases, Disk Images, FC
LUN, etc.) are always overwriting existing data.– Why read data in a sequential workload when
you are destined to discard it?– Why force the writer to wait to read data?
Doctor, it hurts when I do this…
![Page 13: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/13.jpg)
Optimization #1Deferred Copy On Write Faults
How Hard Can It Be?Famous Last Words
![Page 14: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/14.jpg)
DMU Buffer State Machine (Before)
UNCACHED
READ
FILL
CACHED EVICT
Read Issued
Full Block Write
Truncate
Read Complete
Copy Complete
Teardown
![Page 15: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/15.jpg)
DMU Buffer State Machine (After)
![Page 16: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/16.jpg)
DMU BufferUNCACHED Dirty Record
Tracking Transaction Groups
Open TXG
Time
![Page 17: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/17.jpg)
Record Data
DMU BufferPARTIAL|FILL Dirty Record
Tracking Transaction Groups
Open TXG
Time
![Page 18: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/18.jpg)
Record Data
DMU BufferPARTIAL Dirty Record
Tracking Transaction Groups
Open TXG
Time
![Page 19: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/19.jpg)
Record Data
Record Data
DMU BufferPARTIAL Dirty Record Dirty Record
Tracking Transaction Groups
Open TXG Quiescing TXG
Time
![Page 20: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/20.jpg)
Record Data
Record Data
Record Data
DMU BufferPARTIAL Dirty Record Dirty Record Dirty Record
Tracking Transaction Groups
Open TXG Quiescing TXG Syncing TXG
Time
SyncerProcessesRecord
![Page 21: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/21.jpg)
Record Data
Record Data
Record Data
DMU BufferREAD Dirty Record Dirty Record Dirty Record
Tracking Transaction Groups
Open TXG Quiescing TXG Syncing TXG
Time
Read Buffer
Dispatch Synchronous Read
SyncerProcessesRecord
![Page 22: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/22.jpg)
Record Data
Record Data
Record Data
DMU BufferREAD Dirty Record Dirty Record Dirty Record
SyncerProcessesRecord
Tracking Transaction Groups
Open TXG Quiescing TXG Syncing TXG
Time
Read Buffer
Synchronous Read Returns
Merge
Merge
![Page 23: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/23.jpg)
Record Data
Record Data
Record Data
DMU BufferCACHED Dirty Record Dirty Record Dirty Record
Tracking Transaction Groups
Open TXG Quiescing TXG Syncing TXG
Time
![Page 24: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/24.jpg)
Optimization #2Asynchronous Fault Resolution
![Page 25: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/25.jpg)
• Syncer stalls due to synchronous resolve behavior.
• Resolving reads that are known to be needed are delayed.– Example: a modified version of the record is
created in a new TXG• Writers should be able to cheaply start the
resolve process without blocking.• The syncer should operate on multiple
COW faults in parallel.
Issues with Implementation #1
![Page 26: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/26.jpg)
• Split Brain– ZFS record can have multiple personality
disorder• Example: Write, truncate, write again, all in flight at
the same time with a resolving read. – Term reflects how dealing with this issue made
us feel.• Chaining syncer’s write to the resolving
read– This read may have been started in advance of
syncer processing due to a writer noticing that resolution is necessary.
Complications
![Page 27: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/27.jpg)
Optimization #3Asynchronous Reads
![Page 28: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/28.jpg)
ZFS – Block DiagramZFS
PosixLayer
PresentationLayer
Configuration &Control
Data Management Unit
Storage Pool Allocator
Objects and Caching
Layout Policy
ZFS Volumes Lustre
CAMTargetLayer
zfs(8), zpool(8)
Thread Blocking SemanticsCallback Semantics
![Page 29: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/29.jpg)
• Goal: Get as much I/O in flight as possible• Uses Thread Local Storage (TLS)– Avoid lock order reversals– Avoid modifications in APIs just to pass down a
queue.– No lock overhead due to it being per-thread
• Refcounting while issuing I/Os to make sure callback is not called until entire I/O completes
Asynchronous DMU I/O
![Page 30: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/30.jpg)
Results
![Page 31: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/31.jpg)
Bugs, bugs, bugs…Deadlocks
Data corruption Missed events
wrong arguments to bcopyPage faults
Bad comments
Invalid state machine transitionsSplit brain conditions
Incorrect refcountingUnprotected critical sections
Sleeping holding non-sleepable locks
Memory leaks
Insufficient interlocking
Disclaimer: This is not a complete list.
![Page 32: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/32.jpg)
• ZFS has many complex moving parts• Simply thrashing a ZFS is not a sufficient
test– Many hidden parts make use of the DMU layer
and are not directly involved in data I/O or at all
• Extensive modifications of the DMU layer require thorough verification– Every object in ZFS uses the DMU layer to
support its transactional nature
Validation
![Page 33: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/33.jpg)
• Many more asserts added• Solaris Test Framework ZFS test suite– Extensively modified to (mostly) pass on
FreeBSD– Has ~300 tests, needs more
• ztest: Unit (ish) test suite– Element of randomization requires multiple
test runs– Some test frequencies increased to verify fixes
• xdd: Performance tests– Finds bugs involving high workloads
Testing, testing, testing…
![Page 34: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/34.jpg)
• DMU I/O APIs rewritten to allow issuing async IOs, minimize hold/release cycles, & unify API for all callers
• DBUF dirty restructured– Now looks more like a checklist than an
organically grown process– Broken apart to reduce complexity and ease
understanding of its many nuances
Cleanup & refactoring
![Page 35: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/35.jpg)
• It goes 3-10X faster! Without breaking anything!
• Results that follow are for the following config:– RAIDZ2 of 4 2TB SATA drives on 6Gb LSI SAS
HBA– Xen HVM DomU w/ 4GB RAM, 4 cores of 2GHz
Xeon– 10GB ZVOL, 128KB record size– Care taken to avoid cache effects
Performance resultsalmost
^
![Page 36: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/36.jpg)
Aligned 128K Sequential Write (MB/s)
Aligned 16K Sequential Write (MB/s)
Unaligned 128K Sequential Write (MB/s)
16K Random Write (IOPS) 16K Random Read (IOPS)0
50
100
150
200
250
300
350
400
450
1 Thread Performance Results
Before After
![Page 37: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/37.jpg)
Aligned 128K Sequential Write (MB/s)
Aligned 16K Sequential Write (MB/s)
Unaligned 128K Sequential Write (MB/s)
16K Random Write (IOPS) 16K Random Read (IOPS)0
50
100
150
200
250
300
350
400
10 Thread Performance Results
Before After
![Page 38: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/38.jpg)
• Commercial consumption of open source works best when it is well written and documented– Drastically improved comments, code
readability• Community differences & development
choices– Sun had a small ZFS team that stayed together– FreeBSD has a large group of people who will
frequently work on one area and move on to another
– Clear coding style, naming conventions, & test cases are required for long-term maintainability
Commentary
![Page 39: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/39.jpg)
• Apply deferred COW fault optimization to indirect blocks– Uncached metadata still blocks writers and this can cut write
performance in half• Required indirect blocks should be fetched
asynchronously• Eliminate copies and allow larger I/O cluster sizes in the
SPA clustered I/O implementation• Improve read prefetch performance for sequential read
workloads• Hybrid RAIDZ and/or more standard RAID 5/6 transform• All the other things that have kept Kirk working on file
systems for 30 years.
Further Work
![Page 40: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/40.jpg)
• Sun’s original ZFS team for developing ZFS• Pawel Dawidek for the FreeBSD port• HighCloud Security for the FreeBSD port of
the STF ZFS test suite• Illumos for continuing open source ZFS
development• Spectra Logic for funding our work
Acknowledgments
![Page 41: Optimizing ZFS for Block Storage](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816921550346895de05076/html5/thumbnails/41.jpg)
Questions?Preliminary Patch Set:
http://people.freebsd.org/~will/zfs/