Signals for SRE in Microservices Observability and the Golden
Benchmarking for Observability
Transcript of Benchmarking for Observability
Data Storage Lab
Benchmarking for Observability: The Case of Diagnosing Storage Failures
Duo Zhang, Mai Zheng
Department of Electrical and Computer Engineering, Iowa State University, USA
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
1
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
2
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
The Storage Stack & Failures
• The storage stack in OS kernel is too complex to be bug-free • May contain many classic bugs
• E.g., data races, deadlock, dangling pointers, buffer overflow, crash-consistency bugs, …• EXPLODE@OSDI’06• Lu et. al.@FAST’13• HYDRA@SOSP’19• KRace@S&P’20• …
• Being optimized aggressively for non-volatile memories (NVM) which might introduce new bugs• E.g., Duo et. al.@SYSTOR’21
3
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
The Storage Stack & Failures
• Bugs may lead to various storage failures in practice, which are often difficult & time-consuming to diagnose• E.g., Algolia data center incident:
• Servers crashed and files corrupted for unknown reason
• After weeks of diagnosis, Samsung SSDs were mistakenly blamed
• After one month, a Linux kernel bug was identified as root cause
4
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
The Storage Stack & Failures
• Bugs may lead to various storage failures in practice, which are often difficult & time-consuming to diagnose• E.g., Algolia data center incident:
• Servers crashed and files corrupted for unknown reason
• After weeks of diagnosis, Samsung SSDs were mistakenly blamed
• After one month, a Linux kernel bug was identified as root cause
5
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
More advanced debugging support is needed!
Limitations of Existing Tools?
• Three categories of debugging tools exist for failure diagnosis• (1) Interactive debuggers
• Support fine-grained manual inspection
• Set breakpoints and check its variable value
• Go forward/backward steps to check pre/after information
• E.g., GDB/KDB
6
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
https://www.gnu.org/software/gdb/Debug with GDB
Limitations of Existing Tools?
• Three categories of debugging tools exist for failure diagnosis• (2) Tracers
• Collect various events from a running system automatically
• Software tracers, e.g., Ftrace
• Hardware tracers, e.g., Storage Protocol Analyzer
7
Teledyne LeCroy Summit T34 Protocol Analyzer
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
https://blog.selectel.com/kernel-tracing-ftrace/
Limitations of Existing Tools?
• Three categories of debugging tools exist for failure diagnosis• (3) Record & replay tools
• Record program executions and replay segments of the past execution
• Instructions, non-determinism inputs, and snapshots
• E.g., TTVM@USENIX ATC’05, PANDA
8
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
TTVM framework https://panda-re.mit.edu/
Limitations of Existing Tools?
• Little measurement of their limitations for debugging!• Existing efforts simply measure tools’ runtime overhead
• Cannot tell how effective the tools are for diagnosing the root causes of failures
9
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
Limitations of Existing Tools?
• Little measurement of their limitations for debugging!• Existing efforts simply measure tools’ runtime overhead
• Cannot tell how effective the tools are for diagnosing the root causes of failures
10
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
More benchmarking effort is needed!
What Should We Measure?
• Quinn et. al.@HotOS’19• Observability:
• “The observations that they (debugging tools) allow developers to make”
• Three high-level properties:• Visibility
• Repeatability
• Expressibility
• No qualitative/quantitative metrics for measurement
11
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
What Should We Measure?
• Quinn et. al.@HotOS’19• Observability:
• “The observations that they (debugging tools) allow developers to make”
• Three high-level properties:• Visibility
• Repeatability
• Expressibility
• No qualitative/quantitative metrics for measurement
12
Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation
More concrete metrics are needed!
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
13
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
14
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
Methodology
15
Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures
• Collect data corruption related incidents reported on kernel Bugzilla• https://bugzilla.kernel.org/
• Focus on storage related components• E.g., File system, IO/Storage
Advanced search on Bugzilla
Methodology
16
• Collected information• Reported time• Last comment time• Kernel version• Component(s) involved• Number of comments• Number of participants
• Calculated information• Time of duration
An Example of Incident Page
Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures
Characteristics of Failure Incidents
• Observation 1• The incidents took multiple months to resolve on average, required
multiple rounds of discussion and multiple participants• Quantitatively show the difficulty of diagnosing storage failures
17
Group Count(%)
Avg. Days
Avg. Comments/Participants
Resolved 136 (49.1%) 146.9 8/3
Unresolved 141 (50.9%) 1444.2 5/2
Overall 277 807.3 6/2
Table: Summary of Incidents
Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures
Characteristics of Failure Incidents
• Observation 2• The incidents involve all major storage components
• Consistent with previous studies [e.g., Duo et.al.@SYSTOR’21]
• Debugging tools need to provide full-stack observability!
18
Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures
Characteristics of Failure Incidents
• Observation 3• The average debugging time is consistently long across different components
• Better debugging support is needed for all components!
19
Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures
Characteristics of Failure Incidents
• Observation 4• 37 out of 136 (26.3%) resolved issues involve multiple OS distributions or kernel
versions• Bugs may elude intensive testing and sneak into
new releases • Testing in the development environment is not enough• Debugging support for failure diagnosis is always
needed in practice
• Observation 5• Only 5 out of 136 (3.7%) resolved issues were caused by hardware
• Software bugs remain dominant for causing storage failures• Observing the behavior of the storage software stack is important!
20
Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
21
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
22
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
BugBenchk
• A collection of realistic, reproducible storage failure cases • Include all necessary information for
reproducing the cases• Bug-triggering workloads
• C/Bash based on the report• Environmental information
• OS distribution• Specific kernel version• System configurations
• Root causes• Critical functions containing bugs
• Packaged as virtual machine (VM) images• Portable and convenient
• Enable realistic benchmarking & measurements of debugging tools
23
Benchmarking for Observability: The Case of Diagnosing Storage Failures Deriving BugBenchk
OS:Ubuntu18.10 with kernel 4.19.0
Configuration:GRUB_CMDLINE_LINUX_DEFAULT="fsck.mode=force fsck.repair=no emergency scsi_mod.use_blk_mq=1
Workload:Install and uninstall multiple packages multiple times
Other requirements:Run with small memorye.g., 300MB
BugBenchk
• The current prototype includes 9 cases• Covering 4 different file systems as well as the block I/O layer
24
Case ID Critical Function (partial) Bug Type Bug Size WKLD Size
1-EX4 ext4_do_update_inode, ext4_clear_inode_state Semantics 8 70 (C)
2-EX4 parse_options Semantics 6 55 (C & Bash)
3-BTRFS btrfs_ioctl_snap_destroy, btrfs_set_log_full_commit Semantics 71 54 (C)
4-BTRFS btrfs_log_trailin_hole Semantics 121 61 (C)
5-BTRFS btrfs_log_all_parents, btrfs_record_unlink_dir Semantics 13 58 (C)
6-F2FS f2fs_submit_page_bio, f2fs_is_valid_blkaddr Memory 94 141 (C & Bash)
7-GFS gfs2_check_sb, fs_warn Memory 18 2 (Bash)
8-BLK blkdev_fsync, sync_blkdev Semantics 12 43 (C)
9-BLK __blk_mq_issue_directly, blk_mq_requeue_request Semantics 9 17 (Bash)
Benchmarking for Observability: The Case of Diagnosing Storage Failures Deriving BugBenchk
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
25
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
26
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
Experiment Overview
• Evaluated two state-of-the-art debugging tools via BugBenchk
• FTrace: a tracing utility built directly into the Linux kernel• PANDA: a VM-based record & replay platform for program analysis
• Measured a concrete set of metrics according to tools’ features and identify limitations in both tools• E.g., zero observability in case of kernel panics
• Enhanced the observability of both tools by adding command-level information
27
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Experiment Result: FTrace
• Key Feature Measured• Kernel function tracing
# tracer: function## TASK-PID CPU# TIMESTAMP EXECUTION TIME FUNCTION# | | | | | |
<idle>-0 [000] 2965.191280657: funcgraph_entry: | switch_mm_irqs_off() { <idle>-0 [000] 2965.191281919: funcgraph_entry: 0.283 us | load_new_mm_cr3(); <idle>-0 [000] 2965.191282543: funcgraph_exit: 2.175 us | }
fsync1-5526 [000] 2965.191283332: funcgraph_entry: | finish_task_switch() { fsync1-5526 [000] 2965.191283719: funcgraph_entry: | smp_irq_work_interrupt() {
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
28
Experiment Result: FTrace
• Key Observations• All 9 cases in BugBenchk can still be reproduced when applying FTrace
• i.e., FTrace is non-intrusive for failure diagnosis
29
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
Experiment Result: FTrace
• Key Observations• Provide rich function-level information for diagnosis
• The unique function is much fewer compared to all functions traced
30
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
Experiment Result: FTrace
• Key Observations• Cannot trace all critical functions!
• Limited by the available_filter_functions file in debugfs
31
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
Experiment Result: FTrace
• Key Observations• Can trace closely related functions (e.g., parent functions) even if it missed
the critical function itself
32
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
Experiment Result: FTrace
• Key Observations• Cannot help at all in case of “6-F2FS” and “7-GFS” (i.e., zero observability)
• Both cases cause kernel panics, which affect the in-kernel FTrace
33
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
Experiment Result: FTrace
34
Built-in tracer is fundamentally limited for diagnosing severe storage failures
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
• Key Observations• Cannot help at all in case of “6-F2FS” and “7-GFS” (i.e., zero observability
• Both cases cause kernel panics, which affect the in-kernel FTrace
Experiment Result: FTrace
• Key Observations• May generate substantial traces which may dilute the debugging focus
35
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
Experiment Result: FTrace
• Key Observations• May generate substantial traces which may dilute the debugging focus
36
More intelligent debugging method is needed to reason the root cause automatically!
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Total # of Func. Traced
Total # of Unique Func.
Critical Func. Observed
Shortest Distance
1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -
2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2
3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -
4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1
5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -
6-F2FS Yes 0 0 0/7 -
7-GFS Yes 0 0 0/2 -
8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -
9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -
Experiment Result: PANDA
• Key Features Measured• Record & replay• 4 Plugins
37
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Record &
Replay
Plugin
Show Instructions
Taint Analysis
Identify Processes
Process-Block Relationship
1-EX4 Yes √ √ √ √ √
2-EX4 Yes √ √ √ √ √
3-BTRFS Yes √ √ √ √ √
4-BTRFS Yes √ √ √ √ √
5-BTRFS Yes √ √ √ √ √
6-F2FS Yes √ √ √ √ √
7-GFS Yes √ √ √ √ √
8-BLK Yes √ √ √ √ √
9-BLK No N/A N/A N/A N/A N/A
Experiment Result: PANDA
• Key Observations • PANDA can be applied to diagnose 8 out of the 9 cases in BugBenchk
38
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Record &
Replay
Plugin
Show Instructions
Taint Analysis
Identify Processes
Process-Block Relationship
1-EX4 Yes √ √ √ √ √
2-EX4 Yes √ √ √ √ √
3-BTRFS Yes √ √ √ √ √
4-BTRFS Yes √ √ √ √ √
5-BTRFS Yes √ √ √ √ √
6-F2FS Yes √ √ √ √ √
7-GFS Yes √ √ √ √ √
8-BLK Yes √ √ √ √ √
9-BLK No N/A N/A N/A N/A N/A
Experiment Result: PANDA
• Key Observations • Can handle severe failures (e.g., kernel panics) that FTrace cannot deal with
• By leveraging VM
39
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Record &
Replay
Plugin
Show Instructions
Taint Analysis
Identify Processes
Process-Block Relationship
1-EX4 Yes √ √ √ √ √
2-EX4 Yes √ √ √ √ √
3-BTRFS Yes √ √ √ √ √
4-BTRFS Yes √ √ √ √ √
5-BTRFS Yes √ √ √ √ √
6-F2FS Yes √ √ √ √ √
7-GFS Yes √ √ √ √ √
8-BLK Yes √ √ √ √ √
9-BLK No N/A N/A N/A N/A N/A
Experiment Result: PANDA
• Key Observations • Can handle severe failures (e.g., kernel panics) that FTrace cannot deal with
• By leveraging VM
40
Isolating the target storage software stack from the tool itself is important!
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Record &
Replay
Plugin
Show Instructions
Taint Analysis
Identify Processes
Process-Block Relationship
1-EX4 Yes √ √ √ √ √
2-EX4 Yes √ √ √ √ √
3-BTRFS Yes √ √ √ √ √
4-BTRFS Yes √ √ √ √ √
5-BTRFS Yes √ √ √ √ √
6-F2FS Yes √ √ √ √ √
7-GFS Yes √ √ √ √ √
8-BLK Yes √ √ √ √ √
9-BLK No N/A N/A N/A N/A N/A
Experiment Result: PANDA
• Key Observations • PANDA failed in“9-BLK”
• Heavy non-deterministic events overwhelm the full-stack recording
41
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Record &
Replay
Plugin
Show Instructions
Taint Analysis
Identify Processes
Process-Block Relationship
1-EX4 Yes √ √ √ √ √
2-EX4 Yes √ √ √ √ √
3-BTRFS Yes √ √ √ √ √
4-BTRFS Yes √ √ √ √ √
5-BTRFS Yes √ √ √ √ √
6-F2FS Yes √ √ √ √ √
7-GFS Yes √ √ √ √ √
8-BLK Yes √ √ √ √ √
9-BLK No N/A N/A N/A N/A N/A
Experiment Result: PANDA
• Key Observations • PANDA failed in“9-BLK”
• Heavy non-deterministic events overwhelm the full-stack recording
42
Reducing the overhead of VM-based full stack tool is critically important!
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Case ID Still Reproducible?
Record &
Replay
Plugin
Show Instructions
Taint Analysis
Identify Processes
Process-Block Relationship
1-EX4 Yes √ √ √ √ √
2-EX4 Yes √ √ √ √ √
3-BTRFS Yes √ √ √ √ √
4-BTRFS Yes √ √ √ √ √
5-BTRFS Yes √ √ √ √ √
6-F2FS Yes √ √ √ √ √
7-GFS Yes √ √ √ √ √
8-BLK Yes √ √ √ √ √
9-BLK No N/A N/A N/A N/A N/A
Experiment Result: Extensions
• Neither tools can provide complete, full-stack observability• Missing the lowest-level of the storage software stack
• i.e., the communications between the OS kernel and the storage device
• Typically relies on special hardware support to collect such information (e.g., storage protocol analyzer)
43
Teledyne LeCroy Summit T34 Protocol Analyzer Trace View Software
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Experiment Result: Extensions
• FTrace Extension• Intercept device commands via a
customized iSCSI driver • Stitch kernel functions with device
commands based on timestamp• Enhance the observability by
combining function-level information with command-level information• E.g., missing a SYNC_CACHE command
in a fync system call code path is problematic
44
FTrace with extended command-level information
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Experiment Result: Extensions
45
• PANDA Extension• Intercept device commands
through customized QEMU• Record device commands
together with instruction trace
• Enhance the observability by combining instruction-level information with command-level information
QEMU with extended command-level information
Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
46
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
Outline
• Background & Motivation
• Understanding Real-World Storage Failures
• Deriving BugBenchk
• Measuring Observability
• Conclusion and Future Work
47
Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline
Conclusion• Derived BugBenchk from real-world storage failures
• Enable realistic benchmarking of debugging tools
• Measured the observability of representative debugging tools via BugBenchk
• Tracers relying on built-in tracepoints/probes/instrumentations may be fundamentally limited• Zero observability when the storage failure is too severe
• This is also when the debugging support is needed the most
• VM-based record & replay tools may be more viable due to isolation• But heavy overhead can make a failure un-reproducible
• Both methods may generate substantial information• Still need much human effort to understand the root cause
• Both methods may miss the low-level information• can be remedied by extensions
• Less intrusive and more intelligent solutions are needed for enhancing the debugging observability• More benchmarking efforts can guide the design of new solutions
48
Benchmarking for Observability: The Case of Diagnosing Storage Failures Conclusion and Future Work
Future Work
• Enrich BugBenchk
• Add more diverse cases • E.g., driver bugs, persistent-memory specific bugs
• Derive more metrics and evaluate more tools • E.g., quantify human efforts needed to understand the root cause given
the limited observability provided by existing tools
• Build intelligent debugging tools to reduce debugging efforts• E.g., kernel-level data provenance tracking & querying
49
Benchmarking for Observability: The Case of Diagnosing Storage Failures Conclusion and Future Work
Future Work
• Enrich BugBenchk
• Add more diverse cases • E.g., driver bugs, persistent-memory specific bugs
• Derive more metrics and evaluate more tools • E.g., quantify human efforts needed to understand the root cause given
the limited observability provided by existing tools
• Build intelligent debugging tools to reduce debugging efforts• E.g., kernel-level data provenance tracking & querying
50
Benchmarking for Observability: The Case of Diagnosing Storage Failures Conclusion and Future Work
Thanks!