Fast and Reliable Stream Storage through Differential Data Journaling

Fast and Reliable Stream Storage through Differential Data JournalingAndromachi Hatzieleftheriou

MSc ThesisSupervisor: Stergios Anastasiadis

2 of 40

Thesis Motivation

• We study the real-time storage of massive stream data▫ real-time or retrospective processing▫ e.g. monitoring applications

continuous data received from sensors in real-time video and audio streams of high quality at high rates environmental measurements at much lower rates

• Traditional file and database systems insufficient ▫ excessive resource requirements in case of high-volume streaming traffic▫ need for system facilities for the storage of heterogeneous streams

different rate and content characteristics

• General-purpose file systems use journaling to synchronously move data or metadata from memory to disk with sequential throughput▫ data journaling high disk overhead

3 of 40

• Data journaling should be enabled with random writes but disabled with large sequential writes

• Need to efficiently and reliably store multiple concurrent streams▫ individual stream appends perfectly sequential▫ aggregate workload random-access▫ unclear what is the most appropriate way to handle the incoming data

• We examine the possibility of employing data journaling techniques▫ combine sequential throughput with low latency during synchronous writes

• We introduce differential data journaling in order to minimize the cost of data journaling▫ only the actually modified bytes are logged, not the entire corresponding blocks

Thesis Motivation

4 of 40

Outline

• Related Work• Ext3• Architectural Definition• Prototype Implementation• Performance Evaluation• Conclusions & Future Work

5 of 40

Fast and Reliable Storage

• File system operations can be:▫ data operations that update user data▫ metadata operations that modify the structure of the file system

• Several techniques have been proposed to achieve high performance during data and metadata updates

• Operating systems susceptible to hardware and power failures that damage their efficiency and reliability▫ special utility needed during reboot to recover the file system▫ the system remains offline while the disk is scanned and repaired

6 of 40

Synchronous Writes & Soft Updates

• Synchronous writes ▫ pending writes must complete before the next ones can be submitted▫ significant performance loss

• Soft updates ▫ ordering between metadata writes▫ list of metadata dependencies per disk block▫ after a crash

system mounted and used immediately remaining inconsistencies corrected in the background

7 of 40

Log-Structured File Systems

• Data and metadata updates ▫ initially buffered in the cache▫ then written sequentially to a continuous stream

• Main features ▫ disk treated as a segmented append-only log▫ indexing information needed for efficient read▫ costly seeks are avoided maximized disk

write throughput

• After a crash▫ the system reconstructs its state from the last

consistent point in the log

• Log space needs to be constantly reclaimed ▫ garbage collection

8 of 40

Journaling File Systems

• Metadata updates written to a circular append-only journal before committed to the main file system▫ batching opportunities▫ synchronous writes complete faster

sequential throughput

• Logging of data modifications also supported▫ performance improvement for synchronous

writes▫ significant journal throughput

full blocks logged even with small writes instead of the modified parts

• After a crash ▫ replay the last updates from the journal

9 of 40

Outline


10 of 40

General Features

• Each high-level change to the file system is performed in two steps:1. the modified blocks are copied into the

journal2. the modified blocks are sent to their final

disk location

• Journal features:▫ treated as a circular buffer▫ file within the same file system or separate

disk partition

11 of 40

• Ordered Mode▫ only metadata logged▫ data writes forced to the fixed

location right before metadata is written to the journal

▫ strong consistency semantics

• Data Mode▫ both data and metadata logged▫ data blocks written twice▫ strong consistency semantics

Journal(Metadata)

Commit

Checkpoint

Final Location(Metadata)

Final Location(Data)



sync

time

• Writeback Mode▫ only metadata logged

▫ data blocks written directly to final location

▫ no ordering between the journal and fixed-location data writes

▫ weak consistency guarantees

Commit

Checkpoint

Journal(Metadata)

Final Location(Metadata)


sync

sync

time

Journaling Modes

Commit

Checkpoint

Journal(Data+Metadata)

Final Location(Data +Metadata)

sync

time

12 of 40

Journal Structure

• Journal superblock ▫ tracks summary information for the

journal

• Journal descriptor block▫ marks the beginning of a transaction▫ describes the subsequent journaled

blocks

• Journal data and metadata blocks

• Journal commit block ▫ written at the end of a transaction▫ marks that data and metadata are safe

on disk

13 of 40

Kernel Buffers

• Page cache ▫ keeps page copies from recently

accessed disk files in memory

• Block buffer ▫ in-memory buffer of each disk block▫ allocated in units called buffer pages

• Buffer head descriptor ▫ specifies all the handling information

required by the kernel to locate the corresponding block on disk

14 of 40

Flushing Dirty Buffers to Disk

• Goal: Dirty pages that accumulate in memory need to be written to disk

• pdflush kernel threads▫ systematically scan the page cache for dirty pages to flush every writeback period▫ ensure that no page remains dirty for too long more than expiration period

• kjournald kernel thread▫ commits the current state of the file system every commit interval period of time ▫ flushes the dirty buffers of the committed transactions to their final location

checkpoint process

• fsync system call ▫ forces all data and metadata dirty buffers of a specified file descriptor to disk

15 of 40

Commit Policy

• Process of writting to journal the dirty buffers modified by a transaction

• Commit is initiated when:▫ the commit interval expires▫ write updates need to be synchronously

written to disk

• For each journal block buffer:▫ a buffer head specifies the respective block number in the journal

points to the original copy of the block buffer▫ a journal head points to the corresponding transaction

16 of 40

Commit Process

• A journal descriptor block is allocated▫ contains tags that map block buffers to their

final location

• When it fills up1. it is written to the journal2. the corresponding block buffers follow3. a journal commit block is synchronously

written to the journal

• Additional journal descriptor blocks can be allocated

17 of 40

Recovery Policy

• Recovery process ▫ automatically started after an unclean shutdown ▫ scans the log for complete transactions that need to be replayed

• Three phases needed:▫ PASS_SCAN scans the end of the journal

▫ PASS_REVOKE is used to prevent older journal records from being replayed on top of newer data using the same block

▫ PASS_REPLAY writes to their final disk location the newest versions of all the blocks that need to be replayed

• The system can crash before the recovery finishes ▫ the same journal can be reused

18 of 40

Outline


19 of 40

Design Goals

• We investigate the performance characteristics of data journaling in the context of synchronous writes

• Data journaling features:▫ Synchronous writes complete faster

take advantage of the sequential journal throughput ▫ Significant amount of traffic sent to the journal

high journal device throughput▫ Traffic changes sublinearly as a function of the write rate▫ Substantial overhead even with small write requests

due to the full-block logging scheme

• Proposal: a new journaling mode ▫ accumulation of multiple write modifications in a

single journal block

0.12

5

0.50

0

1.00

0

2.00

0

4.00

0

8.00

0

16.0

00

32.0

00

64.0

00

128.

000

1

10

100

1000

10000

Requirements

Data Journaling

Writeback Journaling

Ordered Journaling

Request Size (KB)

Tota

l Jou

rnal

Tra

ffic

(MB)

20 of 40

Design Goals

• Partial Block▫ new journal block type▫ accumulates the modifications from multiple writes

• Commit Policy▫ only the modified part of individual data blocks should be journaled▫ for fully modified data or metadata blocks entire blocks can be logged

• Recovery Policy▫ whole blocks read from the journal and written back to their final location▫ for partially modified blocks

the original disk block should be first read from the final location then written back updated with the difference retrieved from the journal

21 of 40

Outline


22 of 40

Partial Blocks

• Used to gather the partial updates of data blocks

• Two different types of journal blocks used:▫ partial that store multiple data writes smaller than the default block size▫ non-partial that correspond to metadata or fully written data buffers

23 of 40

Journal Heads & Tags

• Journal heads ▫ we use them to prepare the blocks that are actually sent to the

journal▫ we added two new fields for partial modifications

offset and length of the partially modified block

• Tags▫ allocated during commit per block buffer▫ contain the following fields:

final disk location of the modified block four flags for journal-specific block properties a flag indicating whether the corresponding block is partially

modified or not (added) length of the new bytes (added) starting offset in the data block of the final disk location (added)

24 of 40

Commit Process

• A descriptor and a partial data block allocated

• Partially modified data blocks ▫ modifications copied consecutively in the partial data block

• Metadata or fully written data blocks▫ the corresponding full blocks are logged

• When the descriptor fills up1. it is written to the journal2. all the corresponding block buffers follow3. a journal commit block is written to the journal

25 of 40

Recovery Policy

• Data modifications retrieved from the journal and applied to the final blocks

• From each retrieved journal descriptor block the included tags are extracted▫ describe partial or full write or metadata modifications

• Non-partial modification▫ next block retrieved from the journal and written to the final location

• Partial modification▫ next block retrieved from the journal partial data block▫ the original disk block is read into a kernel buffer▫ the modification is copied from the journal block buffer to the proper final buffer

starting offset and length tag fields used▫ when the end of the current partial block is exceeded the next one is retrieved

from the journal

26 of 40

Outline


27 of 40

Performance Measurements

• We examine the disk throughput requirements and the average latency of each write under streaming workloads

• We measure performance in an environment of temporary small files▫ investigate the benefit of data journaling in applications other than streaming

• We examine the possible overheads of our implementation▫ recovery time▫ CPU load

28 of 40

Streaming Workloads

• Massive numbers of streams synchronously written on the same disk facility

• We examine the performance characteristics of streams with different rates, while varying the degree of concurrency▫ data rate: the amount of data that is stored per unit of time

• At each execution:▫ a sequence of write updates synchronously applied to the system for a specified

amount of time▫ according to the rate different record sizes are used

low rates small request sizes high rates large request sizes

29 of 40

Flushing Policy

• Manual tuning of the dirty page flush timers according to the rate and the number of the streams

• Low-rate streams▫ accumulation of multiple write updates in memory for a long period

batching opportunities▫ long expiration interval and frequent writeback period

avoid filling up the journal and the memory

• High-rate streams▫ high volumes of data fill up the journal and the memory rather soon▫ default or slightly reduced expiration and writeback periods

according to the generated amount of data

30 of 40

Journal Device Throughput

• Low-rate streams▫ low for the writeback and ordered modes ▫ much higher for the default data journaling▫ comparable to metadata-only modes for the differential data journaling

• High-rate streams▫ significant high for both data journaling modes

1001000

30005000

70000

5

10

15

20

25

30

1 Kbps/stream

Number of Streams

Jour

nal T

hrou

ghpu

t (M

B/s

)

50 500 1000 15000

5

10

15

20

25

30

10 Kbps/stream

Number of Streams

Jour

nal T

hrou

ghpu

t (M

B/s

)

10 25 50 75 1000

5

10

15

20

25

30

1 Mbps/stream

Data Journal-ingDiff Data JournalingWriteback JournalingOrdered Journaling

Number of Streams

Jour

nal T

hrou

ghpu

t (M

B/s

)

31 of 40

Final Location Throughput

• Low-rate streams▫ low for both data journaling modes ▫ several factors higher for metadata-only journaling modes

• High-rate streams▫ comparable to all the four modes

1001000

30005000

700002468

101214

1 Kbps/stream

Number of Streams

File

Sys

tem

Thr

ough

put (

MB

/s)

50 500 1000 15000

2

4

6

8

10

12

14

10 Kbps/stream

Number of Streams

File

Sys

tem

Thr

ough

put (

MB

/s)

10 25 50 75 1000

2

4

6

8

10

12

14

1 Mbps/stream


Number of Streams

File

Sys

tem

Thr

ough

put (

MB

/s)

32 of 40

• Much higher for the metadata-only journaling modes compared to the data journaling modes

• Data journaling benefits from the journal’s sequential throughput ▫ fast and reliable storage opportunities

Write Response Time

100

1000

3000

5000

7000

10

100

1000

10000

100000

1 Kbps/stream

Number of Streams

Wri

te L

aten

cy (m

s)

50 500 1000 15001

10

100

1000

10000

100000

10 Kbps/stream

Number of Streams

Wri

te L

aten

cy (m

s)

10 25 50 75 1001

10

100

1000

10000

1000001 Mbps/stream

Data Journaling

Diff Data Journaling

Writeback Journaling

Ordered Journaling

Number of Streams

Wri

te L

aten

cy (m

s)

33 of 40

CPU Utilization

• Expected CPU overhead for differential data journaling▫ memory copy of the modified parts to the

appropriate journal partial block

• For both high and low rate streams▫ CPU load less than 10%▫ mostly idle

• Insignificant extra CPU cost for differential data journaling

34 of 40

Postmark Benchmark

• Postmark used to study the performance of small writes▫ typical for electronic mail, newsgroups and

web-based commerce

• Significant improvement of the supported transaction rate for data journaling modes▫ low write latency more transactions served

per second

128 1024 4096 163840

50100150200250300350400450500

Postmark


Request Size (Bytes)

Tra

nsac

tions

/s

35 of 40

Recovery Time

• Scan phase▫ high latency for default data journaling▫ low latency for metadata-only modes▫ latency of differential data journaling

comparable to metadata-only modes

• Revoke phase▫ equal to all the four modes

• Replay phase▫ comparable latency for the two data journaling

modes despite the extra block reads of differential data

journaling▫ much lower latency for metadata-only modes

36 of 40

Experimental Results

• Streaming workloads▫ differential data journaling reduces substantially the journal traffic of data

journaling especially for low-rate streams

▫ significant reduction of the write latency for the data journaling modes with respect to metadata-only journaling

• Typical small-write workload▫ substantial improvement in the supported transaction rate

37 of 40

Outline


38 of 40

Conclusions

• Emerging need for the real-time storage of massive stream data▫ fresh look at file systems that support data journaling

• New journaling mode: differential data journaling▫ accumulation of multiple updates into a single journal block▫ fast and reliable storage at relatively low disk throughput requirements

39 of 40

Future Work

• Many directions for future work, mainly regarding the performance evaluation of our implementation

▫ Investigation of the automatic tuning of system parameters related to the timing of dirty page flushes

▫ Direct comparison with a log-structured or other journaling file systems in order to demonstrate the benefits of our architecture

▫ Further examination of the performance of differential data journaling under heterogeneous workloads

▫ Examination of the behavior of differential data journaling under some database workload

▫ Experimentation in a real streaming environment

40 of 40

Thank you..!

41 of 40

References

[1] P. J. Desnoyers and P. Shenoy, Hyperion: High Volume Stream Archival for Retrospective Querying, USENIX Annual Technical Conference, June 2007.

[2] Prabhakaran et al., Analysis and Evolution of Journaling File Systems, USENIX Annual Technical Conference, April 2005, pp. 105-120.

[3] S. Tweedie, Journaling the Linux Ext2fs Filesystem, Fourth Annual Linux Expo, Durham, North Carolina, May 1998.

[4] D. Carney et al., Monitoring Streams – A New Class of Data Management Applications, VLDB Conference, August 2002, pp. 215-226.

42 of 40

Journaling Objects

• Log record▫ corresponds to a low-level operation that updates a disk block ▫ represented as full blocks

• Atomic operation handle▫ corresponds to a high-level operation

multiple low-level operations▫ during recovery

either the whole high-level operation is applied or none of its low-level operations

• Transaction ▫ consists of multiple atomic operation handles

43 of 40

Stream Archival Servers

• Design can be based on two possible architectures:

▫ a relational database not designed for rapid and continuous loading of individual data items ill-equipped to handle numerous continuous queries over data streams insufficient for real-time requirements

▫ a conventional file system mainly care to maintain their integrity across crashes without compromising

performance should not compromise the playback performance should exploit the particular I/O characteristics of individual streams e.g. StreamFS used for the storage of high-volume streams

44 of 40

Checkpoint Policy

• Limited amount of journal space that needs to be reclaimed

• Process of ensuring that a section of the log is committed fully to disk, so that that portion of the log can be reused

• Checkpoint occurs when:▫ there is not enough journal space left

free space is between 1/4 and 1/2 of the journal size▫ the journal is being flushed to disk

45 of 40

Enabling/ Disabling Disk Write Cache

• Synchronous write operations return as soon as the data reaches the on-disk write cache rather than the storage media

• Disabling the write cache scales down the performance of the different modes

• Significant advantage of data journaling with respect to the ordered mode▫ small writes

Fast and Reliable Stream Storage through Differential Data Journaling

Documents

Transcript of Fast and Reliable Stream Storage through Differential Data Journaling