Fast and Reliable Stream Storage through Differential Data Journaling
description
Transcript of Fast and Reliable Stream Storage through Differential Data Journaling
Fast and Reliable Stream Storage through Differential Data JournalingAndromachi Hatzieleftheriou
MSc ThesisSupervisor: Stergios Anastasiadis
2 of 40
Thesis Motivation
• We study the real-time storage of massive stream data▫ real-time or retrospective processing▫ e.g. monitoring applications
continuous data received from sensors in real-time video and audio streams of high quality at high rates environmental measurements at much lower rates
• Traditional file and database systems insufficient ▫ excessive resource requirements in case of high-volume streaming traffic▫ need for system facilities for the storage of heterogeneous streams
different rate and content characteristics
• General-purpose file systems use journaling to synchronously move data or metadata from memory to disk with sequential throughput▫ data journaling high disk overhead
3 of 40
• Data journaling should be enabled with random writes but disabled with large sequential writes
• Need to efficiently and reliably store multiple concurrent streams▫ individual stream appends perfectly sequential▫ aggregate workload random-access▫ unclear what is the most appropriate way to handle the incoming data
• We examine the possibility of employing data journaling techniques▫ combine sequential throughput with low latency during synchronous writes
• We introduce differential data journaling in order to minimize the cost of data journaling▫ only the actually modified bytes are logged, not the entire corresponding blocks
Thesis Motivation
4 of 40
Outline
• Related Work• Ext3• Architectural Definition• Prototype Implementation• Performance Evaluation• Conclusions & Future Work
5 of 40
Fast and Reliable Storage
• File system operations can be:▫ data operations that update user data▫ metadata operations that modify the structure of the file system
• Several techniques have been proposed to achieve high performance during data and metadata updates
• Operating systems susceptible to hardware and power failures that damage their efficiency and reliability▫ special utility needed during reboot to recover the file system▫ the system remains offline while the disk is scanned and repaired
6 of 40
Synchronous Writes & Soft Updates
• Synchronous writes ▫ pending writes must complete before the next ones can be submitted▫ significant performance loss
• Soft updates ▫ ordering between metadata writes▫ list of metadata dependencies per disk block▫ after a crash
system mounted and used immediately remaining inconsistencies corrected in the background
7 of 40
Log-Structured File Systems
• Data and metadata updates ▫ initially buffered in the cache▫ then written sequentially to a continuous stream
• Main features ▫ disk treated as a segmented append-only log▫ indexing information needed for efficient read▫ costly seeks are avoided maximized disk
write throughput
• After a crash▫ the system reconstructs its state from the last
consistent point in the log
• Log space needs to be constantly reclaimed ▫ garbage collection
8 of 40
Journaling File Systems
• Metadata updates written to a circular append-only journal before committed to the main file system▫ batching opportunities▫ synchronous writes complete faster
sequential throughput
• Logging of data modifications also supported▫ performance improvement for synchronous
writes▫ significant journal throughput
full blocks logged even with small writes instead of the modified parts
• After a crash ▫ replay the last updates from the journal
9 of 40
Outline
• Related Work• Ext3• Architectural Definition• Prototype Implementation• Performance Evaluation• Conclusions & Future Work
10 of 40
General Features
• Each high-level change to the file system is performed in two steps:1. the modified blocks are copied into the
journal2. the modified blocks are sent to their final
disk location
• Journal features:▫ treated as a circular buffer▫ file within the same file system or separate
disk partition
11 of 40
• Ordered Mode▫ only metadata logged▫ data writes forced to the fixed
location right before metadata is written to the journal
▫ strong consistency semantics
• Data Mode▫ both data and metadata logged▫ data blocks written twice▫ strong consistency semantics
Journal(Metadata)
Commit
Checkpoint
Final Location(Metadata)
Final Location(Data)
Final Location(Data)
Final Location(Data)
sync
time
• Writeback Mode▫ only metadata logged
▫ data blocks written directly to final location
▫ no ordering between the journal and fixed-location data writes
▫ weak consistency guarantees
Commit
Checkpoint
Journal(Metadata)
Final Location(Metadata)
Final Location(Data)
sync
sync
time
Journaling Modes
Commit
Checkpoint
Journal(Data+Metadata)
Final Location(Data +Metadata)
sync
time
12 of 40
Journal Structure
• Journal superblock ▫ tracks summary information for the
journal
• Journal descriptor block▫ marks the beginning of a transaction▫ describes the subsequent journaled
blocks
• Journal data and metadata blocks
• Journal commit block ▫ written at the end of a transaction▫ marks that data and metadata are safe
on disk
13 of 40
Kernel Buffers
• Page cache ▫ keeps page copies from recently
accessed disk files in memory
• Block buffer ▫ in-memory buffer of each disk block▫ allocated in units called buffer pages
• Buffer head descriptor ▫ specifies all the handling information
required by the kernel to locate the corresponding block on disk
14 of 40
Flushing Dirty Buffers to Disk
• Goal: Dirty pages that accumulate in memory need to be written to disk
• pdflush kernel threads▫ systematically scan the page cache for dirty pages to flush every writeback period▫ ensure that no page remains dirty for too long more than expiration period
• kjournald kernel thread▫ commits the current state of the file system every commit interval period of time ▫ flushes the dirty buffers of the committed transactions to their final location
checkpoint process
• fsync system call ▫ forces all data and metadata dirty buffers of a specified file descriptor to disk
15 of 40
Commit Policy
• Process of writting to journal the dirty buffers modified by a transaction
• Commit is initiated when:▫ the commit interval expires▫ write updates need to be synchronously
written to disk
• For each journal block buffer:▫ a buffer head specifies the respective block number in the journal
points to the original copy of the block buffer▫ a journal head points to the corresponding transaction
16 of 40
Commit Process
• A journal descriptor block is allocated▫ contains tags that map block buffers to their
final location
• When it fills up1. it is written to the journal2. the corresponding block buffers follow3. a journal commit block is synchronously
written to the journal
• Additional journal descriptor blocks can be allocated
17 of 40
Recovery Policy
• Recovery process ▫ automatically started after an unclean shutdown ▫ scans the log for complete transactions that need to be replayed
• Three phases needed:▫ PASS_SCAN scans the end of the journal
▫ PASS_REVOKE is used to prevent older journal records from being replayed on top of newer data using the same block
▫ PASS_REPLAY writes to their final disk location the newest versions of all the blocks that need to be replayed
• The system can crash before the recovery finishes ▫ the same journal can be reused
18 of 40
Outline
• Related Work• Ext3• Architectural Definition• Prototype Implementation• Performance Evaluation• Conclusions & Future Work
19 of 40
Design Goals
• We investigate the performance characteristics of data journaling in the context of synchronous writes
• Data journaling features:▫ Synchronous writes complete faster
take advantage of the sequential journal throughput ▫ Significant amount of traffic sent to the journal
high journal device throughput▫ Traffic changes sublinearly as a function of the write rate▫ Substantial overhead even with small write requests
due to the full-block logging scheme
• Proposal: a new journaling mode ▫ accumulation of multiple write modifications in a
single journal block
0.12
5
0.50
0
1.00
0
2.00
0
4.00
0
8.00
0
16.0
00
32.0
00
64.0
00
128.
000
1
10
100
1000
10000
Requirements
Data Journaling
Writeback Journaling
Ordered Journaling
Request Size (KB)
Tota
l Jou
rnal
Tra
ffic
(MB)
20 of 40
Design Goals
• Partial Block▫ new journal block type▫ accumulates the modifications from multiple writes
• Commit Policy▫ only the modified part of individual data blocks should be journaled▫ for fully modified data or metadata blocks entire blocks can be logged
• Recovery Policy▫ whole blocks read from the journal and written back to their final location▫ for partially modified blocks
the original disk block should be first read from the final location then written back updated with the difference retrieved from the journal
21 of 40
Outline
• Related Work• Ext3• Architectural Definition• Prototype Implementation• Performance Evaluation• Conclusions & Future Work
22 of 40
Partial Blocks
• Used to gather the partial updates of data blocks
• Two different types of journal blocks used:▫ partial that store multiple data writes smaller than the default block size▫ non-partial that correspond to metadata or fully written data buffers
23 of 40
Journal Heads & Tags
• Journal heads ▫ we use them to prepare the blocks that are actually sent to the
journal▫ we added two new fields for partial modifications
offset and length of the partially modified block
• Tags▫ allocated during commit per block buffer▫ contain the following fields:
final disk location of the modified block four flags for journal-specific block properties a flag indicating whether the corresponding block is partially
modified or not (added) length of the new bytes (added) starting offset in the data block of the final disk location (added)
24 of 40
Commit Process
• A descriptor and a partial data block allocated
• Partially modified data blocks ▫ modifications copied consecutively in the partial data block
• Metadata or fully written data blocks▫ the corresponding full blocks are logged
• When the descriptor fills up1. it is written to the journal2. all the corresponding block buffers follow3. a journal commit block is written to the journal
25 of 40
Recovery Policy
• Data modifications retrieved from the journal and applied to the final blocks
• From each retrieved journal descriptor block the included tags are extracted▫ describe partial or full write or metadata modifications
• Non-partial modification▫ next block retrieved from the journal and written to the final location
• Partial modification▫ next block retrieved from the journal partial data block▫ the original disk block is read into a kernel buffer▫ the modification is copied from the journal block buffer to the proper final buffer
starting offset and length tag fields used▫ when the end of the current partial block is exceeded the next one is retrieved
from the journal
26 of 40
Outline
• Related Work• Ext3• Architectural Definition• Prototype Implementation• Performance Evaluation• Conclusions & Future Work
27 of 40
Performance Measurements
• We examine the disk throughput requirements and the average latency of each write under streaming workloads
• We measure performance in an environment of temporary small files▫ investigate the benefit of data journaling in applications other than streaming
• We examine the possible overheads of our implementation▫ recovery time▫ CPU load
28 of 40
Streaming Workloads
• Massive numbers of streams synchronously written on the same disk facility
• We examine the performance characteristics of streams with different rates, while varying the degree of concurrency▫ data rate: the amount of data that is stored per unit of time
• At each execution:▫ a sequence of write updates synchronously applied to the system for a specified
amount of time▫ according to the rate different record sizes are used
low rates small request sizes high rates large request sizes
29 of 40
Flushing Policy
• Manual tuning of the dirty page flush timers according to the rate and the number of the streams
• Low-rate streams▫ accumulation of multiple write updates in memory for a long period
batching opportunities▫ long expiration interval and frequent writeback period
avoid filling up the journal and the memory
• High-rate streams▫ high volumes of data fill up the journal and the memory rather soon▫ default or slightly reduced expiration and writeback periods
according to the generated amount of data
30 of 40
Journal Device Throughput
• Low-rate streams▫ low for the writeback and ordered modes ▫ much higher for the default data journaling▫ comparable to metadata-only modes for the differential data journaling
• High-rate streams▫ significant high for both data journaling modes
1001000
30005000
70000
5
10
15
20
25
30
1 Kbps/stream
Number of Streams
Jour
nal T
hrou
ghpu
t (M
B/s
)
50 500 1000 15000
5
10
15
20
25
30
10 Kbps/stream
Number of Streams
Jour
nal T
hrou
ghpu
t (M
B/s
)
10 25 50 75 1000
5
10
15
20
25
30
1 Mbps/stream
Data Journal-ingDiff Data JournalingWriteback JournalingOrdered Journaling
Number of Streams
Jour
nal T
hrou
ghpu
t (M
B/s
)
31 of 40
Final Location Throughput
• Low-rate streams▫ low for both data journaling modes ▫ several factors higher for metadata-only journaling modes
• High-rate streams▫ comparable to all the four modes
1001000
30005000
700002468
101214
1 Kbps/stream
Number of Streams
File
Sys
tem
Thr
ough
put (
MB
/s)
50 500 1000 15000
2
4
6
8
10
12
14
10 Kbps/stream
Number of Streams
File
Sys
tem
Thr
ough
put (
MB
/s)
10 25 50 75 1000
2
4
6
8
10
12
14
1 Mbps/stream
Data Journal-ingDiff Data JournalingWriteback JournalingOrdered Journaling
Number of Streams
File
Sys
tem
Thr
ough
put (
MB
/s)
32 of 40
• Much higher for the metadata-only journaling modes compared to the data journaling modes
• Data journaling benefits from the journal’s sequential throughput ▫ fast and reliable storage opportunities
Write Response Time
100
1000
3000
5000
7000
10
100
1000
10000
100000
1 Kbps/stream
Number of Streams
Wri
te L
aten
cy (m
s)
50 500 1000 15001
10
100
1000
10000
100000
10 Kbps/stream
Number of Streams
Wri
te L
aten
cy (m
s)
10 25 50 75 1001
10
100
1000
10000
1000001 Mbps/stream
Data Journaling
Diff Data Journaling
Writeback Journaling
Ordered Journaling
Number of Streams
Wri
te L
aten
cy (m
s)
33 of 40
CPU Utilization
• Expected CPU overhead for differential data journaling▫ memory copy of the modified parts to the
appropriate journal partial block
• For both high and low rate streams▫ CPU load less than 10%▫ mostly idle
• Insignificant extra CPU cost for differential data journaling
34 of 40
Postmark Benchmark
• Postmark used to study the performance of small writes▫ typical for electronic mail, newsgroups and
web-based commerce
• Significant improvement of the supported transaction rate for data journaling modes▫ low write latency more transactions served
per second
128 1024 4096 163840
50100150200250300350400450500
Postmark
Data Journal-ingDiff Data JournalingWriteback JournalingOrdered Journaling
Request Size (Bytes)
Tra
nsac
tions
/s
35 of 40
Recovery Time
• Scan phase▫ high latency for default data journaling▫ low latency for metadata-only modes▫ latency of differential data journaling
comparable to metadata-only modes
• Revoke phase▫ equal to all the four modes
• Replay phase▫ comparable latency for the two data journaling
modes despite the extra block reads of differential data
journaling▫ much lower latency for metadata-only modes
36 of 40
Experimental Results
• Streaming workloads▫ differential data journaling reduces substantially the journal traffic of data
journaling especially for low-rate streams
▫ significant reduction of the write latency for the data journaling modes with respect to metadata-only journaling
• Typical small-write workload▫ substantial improvement in the supported transaction rate
37 of 40
Outline
• Related Work• Ext3• Architectural Definition• Prototype Implementation• Performance Evaluation• Conclusions & Future Work
38 of 40
Conclusions
• Emerging need for the real-time storage of massive stream data▫ fresh look at file systems that support data journaling
• New journaling mode: differential data journaling▫ accumulation of multiple updates into a single journal block▫ fast and reliable storage at relatively low disk throughput requirements
39 of 40
Future Work
• Many directions for future work, mainly regarding the performance evaluation of our implementation
▫ Investigation of the automatic tuning of system parameters related to the timing of dirty page flushes
▫ Direct comparison with a log-structured or other journaling file systems in order to demonstrate the benefits of our architecture
▫ Further examination of the performance of differential data journaling under heterogeneous workloads
▫ Examination of the behavior of differential data journaling under some database workload
▫ Experimentation in a real streaming environment
40 of 40
Thank you..!
41 of 40
References
[1] P. J. Desnoyers and P. Shenoy, Hyperion: High Volume Stream Archival for Retrospective Querying, USENIX Annual Technical Conference, June 2007.
[2] Prabhakaran et al., Analysis and Evolution of Journaling File Systems, USENIX Annual Technical Conference, April 2005, pp. 105-120.
[3] S. Tweedie, Journaling the Linux Ext2fs Filesystem, Fourth Annual Linux Expo, Durham, North Carolina, May 1998.
[4] D. Carney et al., Monitoring Streams – A New Class of Data Management Applications, VLDB Conference, August 2002, pp. 215-226.
42 of 40
Journaling Objects
• Log record▫ corresponds to a low-level operation that updates a disk block ▫ represented as full blocks
• Atomic operation handle▫ corresponds to a high-level operation
multiple low-level operations▫ during recovery
either the whole high-level operation is applied or none of its low-level operations
• Transaction ▫ consists of multiple atomic operation handles
43 of 40
Stream Archival Servers
• Design can be based on two possible architectures:
▫ a relational database not designed for rapid and continuous loading of individual data items ill-equipped to handle numerous continuous queries over data streams insufficient for real-time requirements
▫ a conventional file system mainly care to maintain their integrity across crashes without compromising
performance should not compromise the playback performance should exploit the particular I/O characteristics of individual streams e.g. StreamFS used for the storage of high-volume streams
44 of 40
Checkpoint Policy
• Limited amount of journal space that needs to be reclaimed
• Process of ensuring that a section of the log is committed fully to disk, so that that portion of the log can be reused
• Checkpoint occurs when:▫ there is not enough journal space left
free space is between 1/4 and 1/2 of the journal size▫ the journal is being flushed to disk
45 of 40
Enabling/ Disabling Disk Write Cache
• Synchronous write operations return as soon as the data reaches the on-disk write cache rather than the storage media
• Disabling the write cache scales down the performance of the different modes
• Significant advantage of data journaling with respect to the ordered mode▫ small writes