Benchmarking for Observability

Data Storage Lab

Benchmarking for Observability: The Case of Diagnosing Storage Failures

Duo Zhang, Mai Zheng

Department of Electrical and Computer Engineering, Iowa State University, USA

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

1

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Outline






2


The Storage Stack & Failures

• The storage stack in OS kernel is too complex to be bug-free • May contain many classic bugs

• E.g., data races, deadlock, dangling pointers, buffer overflow, crash-consistency bugs, …• EXPLODE@OSDI’06• Lu et. al.@FAST’13• HYDRA@SOSP’19• KRace@S&P’20• …

• Being optimized aggressively for non-volatile memories (NVM) which might introduce new bugs• E.g., Duo et. al.@SYSTOR’21

3

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation


• Bugs may lead to various storage failures in practice, which are often difficult & time-consuming to diagnose• E.g., Algolia data center incident:

• Servers crashed and files corrupted for unknown reason

• After weeks of diagnosis, Samsung SSDs were mistakenly blamed

• After one month, a Linux kernel bug was identified as root cause

4



• Bugs may lead to various storage failures in practice, which are often difficult & time-consuming to diagnose• E.g., Algolia data center incident:

• Servers crashed and files corrupted for unknown reason

• After weeks of diagnosis, Samsung SSDs were mistakenly blamed

• After one month, a Linux kernel bug was identified as root cause

5


More advanced debugging support is needed!

Limitations of Existing Tools?

• Three categories of debugging tools exist for failure diagnosis• (1) Interactive debuggers

• Support fine-grained manual inspection

• Set breakpoints and check its variable value

• Go forward/backward steps to check pre/after information

• E.g., GDB/KDB

6


https://www.gnu.org/software/gdb/Debug with GDB


• Three categories of debugging tools exist for failure diagnosis• (2) Tracers

• Collect various events from a running system automatically

• Software tracers, e.g., Ftrace

• Hardware tracers, e.g., Storage Protocol Analyzer

7

Teledyne LeCroy Summit T34 Protocol Analyzer


https://blog.selectel.com/kernel-tracing-ftrace/


• Three categories of debugging tools exist for failure diagnosis• (3) Record & replay tools

• Record program executions and replay segments of the past execution

• Instructions, non-determinism inputs, and snapshots

• E.g., TTVM@USENIX ATC’05, PANDA

8


TTVM framework https://panda-re.mit.edu/


• Little measurement of their limitations for debugging!• Existing efforts simply measure tools’ runtime overhead

• Cannot tell how effective the tools are for diagnosing the root causes of failures

9



• Little measurement of their limitations for debugging!• Existing efforts simply measure tools’ runtime overhead

• Cannot tell how effective the tools are for diagnosing the root causes of failures

10


More benchmarking effort is needed!

What Should We Measure?

• Quinn et. al.@HotOS’19• Observability:

• “The observations that they (debugging tools) allow developers to make”

• Three high-level properties:• Visibility

• Repeatability

• Expressibility

• No qualitative/quantitative metrics for measurement

11


What Should We Measure?

• Quinn et. al.@HotOS’19• Observability:

• “The observations that they (debugging tools) allow developers to make”

• Three high-level properties:• Visibility

• Repeatability

• Expressibility

• No qualitative/quantitative metrics for measurement

12


More concrete metrics are needed!

Outline






13


Outline






14


Methodology

15

Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures

• Collect data corruption related incidents reported on kernel Bugzilla• https://bugzilla.kernel.org/

• Focus on storage related components• E.g., File system, IO/Storage

Advanced search on Bugzilla

https://bugzilla.kernel.org/

Methodology

16

• Collected information• Reported time• Last comment time• Kernel version• Component(s) involved• Number of comments• Number of participants

• Calculated information• Time of duration

An Example of Incident Page


Characteristics of Failure Incidents

• Observation 1• The incidents took multiple months to resolve on average, required

multiple rounds of discussion and multiple participants• Quantitatively show the difficulty of diagnosing storage failures

17

Group Count(%)

Avg. Days

Avg. Comments/Participants

Resolved 136 (49.1%) 146.9 8/3

Unresolved 141 (50.9%) 1444.2 5/2

Overall 277 807.3 6/2

Table: Summary of Incidents



• Observation 2• The incidents involve all major storage components

• Consistent with previous studies [e.g., Duo et.al.@SYSTOR’21]

• Debugging tools need to provide full-stack observability!

18



• Observation 3• The average debugging time is consistently long across different components

• Better debugging support is needed for all components!

19



• Observation 4• 37 out of 136 (26.3%) resolved issues involve multiple OS distributions or kernel

versions• Bugs may elude intensive testing and sneak into

new releases • Testing in the development environment is not enough• Debugging support for failure diagnosis is always

needed in practice

• Observation 5• Only 5 out of 136 (3.7%) resolved issues were caused by hardware

• Software bugs remain dominant for causing storage failures• Observing the behavior of the storage software stack is important!

20


Outline






21


Outline






22


BugBenchk

• A collection of realistic, reproducible storage failure cases • Include all necessary information for

reproducing the cases• Bug-triggering workloads

• C/Bash based on the report• Environmental information

• OS distribution• Specific kernel version• System configurations

• Root causes• Critical functions containing bugs

• Packaged as virtual machine (VM) images• Portable and convenient

• Enable realistic benchmarking & measurements of debugging tools

23

Benchmarking for Observability: The Case of Diagnosing Storage Failures Deriving BugBenchk

OS:Ubuntu18.10 with kernel 4.19.0

Configuration:GRUB_CMDLINE_LINUX_DEFAULT="fsck.mode=force fsck.repair=no emergency scsi_mod.use_blk_mq=1

Workload:Install and uninstall multiple packages multiple times

Other requirements:Run with small memorye.g., 300MB

BugBenchk

• The current prototype includes 9 cases• Covering 4 different file systems as well as the block I/O layer

24

Case ID Critical Function (partial) Bug Type Bug Size WKLD Size

1-EX4 ext4_do_update_inode, ext4_clear_inode_state Semantics 8 70 (C)

2-EX4 parse_options Semantics 6 55 (C & Bash)

3-BTRFS btrfs_ioctl_snap_destroy, btrfs_set_log_full_commit Semantics 71 54 (C)

4-BTRFS btrfs_log_trailin_hole Semantics 121 61 (C)

5-BTRFS btrfs_log_all_parents, btrfs_record_unlink_dir Semantics 13 58 (C)

6-F2FS f2fs_submit_page_bio, f2fs_is_valid_blkaddr Memory 94 141 (C & Bash)

7-GFS gfs2_check_sb, fs_warn Memory 18 2 (Bash)

8-BLK blkdev_fsync, sync_blkdev Semantics 12 43 (C)

9-BLK __blk_mq_issue_directly, blk_mq_requeue_request Semantics 9 17 (Bash)

Benchmarking for Observability: The Case of Diagnosing Storage Failures Deriving BugBenchk

Outline






25


Outline






26


Experiment Overview

• Evaluated two state-of-the-art debugging tools via BugBenchk

• FTrace: a tracing utility built directly into the Linux kernel• PANDA: a VM-based record & replay platform for program analysis

• Measured a concrete set of metrics according to tools’ features and identify limitations in both tools• E.g., zero observability in case of kernel panics

• Enhanced the observability of both tools by adding command-level information

27

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Experiment Result: FTrace

• Key Feature Measured• Kernel function tracing

# tracer: function## TASK-PID CPU# TIMESTAMP EXECUTION TIME FUNCTION# | | | | | |

<idle>-0 [000] 2965.191280657: funcgraph_entry: | switch_mm_irqs_off() { <idle>-0 [000] 2965.191281919: funcgraph_entry: 0.283 us | load_new_mm_cr3(); <idle>-0 [000] 2965.191282543: funcgraph_exit: 2.175 us | }

fsync1-5526 [000] 2965.191283332: funcgraph_entry: | finish_task_switch() { fsync1-5526 [000] 2965.191283719: funcgraph_entry: | smp_irq_work_interrupt() {


Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

28


• Key Observations• All 9 cases in BugBenchk can still be reproduced when applying FTrace

• i.e., FTrace is non-intrusive for failure diagnosis

29






Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -


• Key Observations• Provide rich function-level information for diagnosis

• The unique function is much fewer compared to all functions traced

30






Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -


• Key Observations• Cannot trace all critical functions!

• Limited by the available_filter_functions file in debugfs

31






Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -


• Key Observations• Can trace closely related functions (e.g., parent functions) even if it missed

the critical function itself

32






Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -


• Key Observations• Cannot help at all in case of “6-F2FS” and “7-GFS” (i.e., zero observability)

• Both cases cause kernel panics, which affect the in-kernel FTrace

33






Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -


34

Built-in tracer is fundamentally limited for diagnosing severe storage failures






Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

• Key Observations• Cannot help at all in case of “6-F2FS” and “7-GFS” (i.e., zero observability

• Both cases cause kernel panics, which affect the in-kernel FTrace


• Key Observations• May generate substantial traces which may dilute the debugging focus

35






Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -


• Key Observations• May generate substantial traces which may dilute the debugging focus

36

More intelligent debugging method is needed to reason the root cause automatically!






Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

Experiment Result: PANDA

• Key Features Measured• Record & replay• 4 Plugins

37



Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes

Process-Block Relationship

1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √

9-BLK No N/A N/A N/A N/A N/A


• Key Observations • PANDA can be applied to diagnose 8 out of the 9 cases in BugBenchk

38



Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes


1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √



• Key Observations • Can handle severe failures (e.g., kernel panics) that FTrace cannot deal with

• By leveraging VM

39



Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes


1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √



• Key Observations • Can handle severe failures (e.g., kernel panics) that FTrace cannot deal with

• By leveraging VM

40

Isolating the target storage software stack from the tool itself is important!



Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes


1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √



• Key Observations • PANDA failed in“9-BLK”

• Heavy non-deterministic events overwhelm the full-stack recording

41



Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes


1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √



• Key Observations • PANDA failed in“9-BLK”

• Heavy non-deterministic events overwhelm the full-stack recording

42

Reducing the overhead of VM-based full stack tool is critically important!



Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes


1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √


Experiment Result: Extensions

• Neither tools can provide complete, full-stack observability• Missing the lowest-level of the storage software stack

• i.e., the communications between the OS kernel and the storage device

• Typically relies on special hardware support to collect such information (e.g., storage protocol analyzer)

43

Teledyne LeCroy Summit T34 Protocol Analyzer Trace View Software



• FTrace Extension• Intercept device commands via a

customized iSCSI driver • Stitch kernel functions with device

commands based on timestamp• Enhance the observability by

combining function-level information with command-level information• E.g., missing a SYNC_CACHE command

in a fync system call code path is problematic

44

FTrace with extended command-level information



45

• PANDA Extension• Intercept device commands

through customized QEMU• Record device commands

together with instruction trace

• Enhance the observability by combining instruction-level information with command-level information

QEMU with extended command-level information


Outline






46


Outline






47


Conclusion• Derived BugBenchk from real-world storage failures

• Enable realistic benchmarking of debugging tools

• Measured the observability of representative debugging tools via BugBenchk

• Tracers relying on built-in tracepoints/probes/instrumentations may be fundamentally limited• Zero observability when the storage failure is too severe

• This is also when the debugging support is needed the most

• VM-based record & replay tools may be more viable due to isolation• But heavy overhead can make a failure un-reproducible

• Both methods may generate substantial information• Still need much human effort to understand the root cause

• Both methods may miss the low-level information• can be remedied by extensions

• Less intrusive and more intelligent solutions are needed for enhancing the debugging observability• More benchmarking efforts can guide the design of new solutions

48

Benchmarking for Observability: The Case of Diagnosing Storage Failures Conclusion and Future Work

Future Work

• Enrich BugBenchk

• Add more diverse cases • E.g., driver bugs, persistent-memory specific bugs

• Derive more metrics and evaluate more tools • E.g., quantify human efforts needed to understand the root cause given

the limited observability provided by existing tools

• Build intelligent debugging tools to reduce debugging efforts• E.g., kernel-level data provenance tracking & querying

49


Future Work

• Enrich BugBenchk

• Add more diverse cases • E.g., driver bugs, persistent-memory specific bugs

• Derive more metrics and evaluate more tools • E.g., quantify human efforts needed to understand the root cause given

the limited observability provided by existing tools

• Build intelligent debugging tools to reduce debugging efforts• E.g., kernel-level data provenance tracking & querying

50


Thanks!

Benchmarking for Observability

Documents

Transcript of Benchmarking for Observability