Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason...

40
Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella

Transcript of Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason...

Page 1: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

Sledgehammer: Cluster-Fueled Debugging

Andrew Quinn, Jason Flinn, and Michael Cafarella

Page 2: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

2

• What was your most challenging bug?

Debugging

Page 3: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

3

Tracing Tools

[1] “Where is the Bug and How Is It Fixed? An Experiment with Practitioners” (FSE ‘17)[2] “Characterizing Logging Practices in Open-Source Software” (ICSE ‘12)

SourceCode Invariant

Log

• Inject logic into software to track program state [1][2]

• Re-create failing execution and analyze tracing output

Inject

Execute

Analyze Result

Tracing output

Page 4: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

4

Tracing Tools• Tradeoff between amount of tracing and overhead

Amount of Tracing

Trac

ing

Over

head

Simple Bugs Challenging Bugs

Low Overhead

High Overhead

Page 5: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

5

Tracing Tools• Tradeoff between amount of tracing and overhead

Amount of Tracing

Trac

ing

Over

head

Simple Bugs Challenging Bugs

Low Overhead

High Overhead

Page 6: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

6

Cluster-Fueled Debugging• Accelerate complex debugging queries • Shared cluster => on-demand debugging tool

Page 7: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

7

Tracing Tools

Analyzer

AnalysisOutput

TracingTracing

Execution

• But, the work of these tools are sequential!

??

?

Page 8: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

8

Sledgehammer

Analyzer

AnalysisOutput

Tracing

Execution

Tracing Tracing

Analyzer Analyzer

Tracing

• Replay-Based: queries run over past execution

Page 9: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

9

Sledgehammer

Analyzer

AnalysisOutput

Tracing

Execution

Tracing Tracing

Analyzer Analyzer

ToolScales to 1024 cores (speedup 416x)

• Replay-Based: queries run over past execution

Page 10: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

10

Outline

• Debugging Scenario• Parallelizing a Query• Evaluation

Page 11: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

11

What’s the root cause?What line of code?

Scenario - memcached

“corrupt cache”

Developer

Page 12: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

12

Scenario – Today’s Tools

“corrupt cache”

//walk cache data-structureChar *checkCache ();

// find invalidion points: void invalidInsts(int fd, int out);

Analyzers

checkCache();

Instrumentation

Tracing

Developer

Page 13: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

13

Scenario – Today’s Tools

“corrupt cache”

//walk cache data-structureChar *checkCache ();

// find invalidion points: void invalidInsts(int fd, int out);

Analyzers

checkCache();

Instrumentation

Tracing

Developer

Amount of Tracing

Trac

ing

Ove

rhea

d

AcceptableOverhead

Limited Tracing Full Tracing

Infeasible Overhead

Continuous Function

Evaluation

Page 14: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

14

Scenario - Sledgehammer

“corrupt cache”

//walk cache data-structureChar *checkCache ();

// find invalidion points: void invalidInsts(int fd, int out);

Analyzers

SH_Cont(checkCache)

Annotations

Tracing

(New Debugging Tool)Continuous Function

Evaluation: execute tracing after

each instruction

SledgehammerDeveloper

CFE Log:inst1, valid

inst2, invalid…

Page 15: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

15

Scenario - Sledgehammer

“corrupt cache”

//walk cache data-structureChar *checkCache ();

// find invalidion points: void invalidInsts(int fd, int out);

Analyzers

SH_Cont(checkCache)

Annotations

Tracing

(New Debugging Tool)Continuous Function

Evaluation: execute tracing after

each instruction

SledgehammerDeveloper

CFE Log:inst1, valid

inst2, invalid…

inst1inst2

Page 16: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

16

Scenario - Sledgehammer

“corrupt cache”

//walk cache data-structureChar *checkCache ();

// find invalidion points: void invalidInsts(int fd, int out);

Analyzers

SH_Cont(checkCache)

Annotations

Tracing

(New Debugging Tool)Continuous Function

Evaluation: execute tracing after

each instruction

SledgehammerDeveloper

CFE Log:inst1, valid

inst2, invalid…

inst1inst2

Does the lock protect the cache?

Page 17: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

17

Scenario - memcached

“corrupt cache”

//walk cache data-structureChar *checkCache ();void lockAcquire(mutex *m);void lockRelease(mutex *m);

// find invalid and unlocked: void badUpdate(int fd, int out);

Analyzers

SH_Cont(checkCache)SH_Hook(lock, lockAcquire)SH_Hook(unlock, lockRelease)

Annotations

Tracing

(New Debugging Tool)Continuous Function

Evaluation: Track tracing output

after every instruction

(Cutting-edge Tool)Retro-Logging:

Inject new logging into past execution

SledgehammerDeveloper

Introvirt SOSP ‘05

Page 18: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

18

Scenario - Sledgehammer

“corrupt cache”

//walk cache data-structureChar *checkCache ();void lockAcquire(mutex *m);void lockRelease(mutex *m);

// find invalid and unlocked: void badUpdate(int fd, int out);

Analyzers

SH_Cont(checkCache)SH_Hook(lock, lockAcquire)SH_Hook(unlock, lockRelease)

Annotations

Tracing

(New Debugging Tool)Continuous Function

Evaluation: Track tracing output

after every instruction

(Cutting-edge Tool)Retro-Logging:

Inject new logging into past execution

SledgehammerDeveloper

Inst1 (initialization)Inst2 (BUG)

Introvirt SOSP ‘05

Page 19: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

19

Outline

• Parallelizing a Query• Evaluation

Page 20: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

20

Parallelizing a Query

Analyzer

AnalysisOutput

Execution

TracingTracing Tracing TracingTracingTracing

Tracing

• Epoch parallelism – leverage deterministic replay

Doubleplay (ASPLOS ‘12)

Page 21: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

21

TracingTracing

Parallelizing a Query

Analyzer

AnalysisOutput

Execution

• Unconstrained tracing code causes divergences

TracingTracing

Crash or corrupt output!

Page 22: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

22

Tracing Isolation

Heavyweight Lightweight

Poor Scalability Pin / Valgrind

Good Scalability

Checkpointing(Introvirt SOSP ‘05)

Compiler-basedIsolation

• Prevent tracing code from updating replay state• Requires scalability and lightweight technique

Page 23: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

23

Compiler-based Isolation• Instrument tracing code to log updates• Revert updates using log

Undo-logWriteWrite

WriteWrite

Page 24: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

24

Compiler-based Isolation• Instrument tracing code to log updates• Revert updates using log

Undo-logWriteWriteScalable and Lightweight!

Page 25: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

25

Parallelizing a query

Analyzer

AnalysisOutput

Tracing

Execution

• Multi-tiered approach to parallelize analysis

Tracing Tracing

Tracing

Page 26: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

26

Parallelizing a query

Analyzer

AnalysisOutput

Tracing

Execution

• Multi-tiered approach to parallelize analysis

Tracing Tracing

Analyzer Analyzer

Tracing

Page 27: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

27

Analysis

Tree

• Developers construct…• Local runs on each core in parallel• Stream passes information over chain• Tree combines input from multiple analyzers

Local Local Local

Stream StreamStream

Output

Tracing Tracing Tracing

Page 28: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

28

Analysis

Tree

• Developers construct…• Local runs on each core in parallel• Stream passes information over chain• Tree combines input from multiple analyzers

Local Local Local

Stream StreamStream

Output

Tracing Tracing TracingParallel faster by up to 4x (mean 2x)

Page 29: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

29

Parallelizing a query

Analyzer

AnalysisOutput

Tracing

Execution

Tracing Tracing

Analyzer Analyzer

Tracing

Page 30: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

30

Sledgehammer Tools

Analyzer

AnalysisOutput

Tracing

Execution

Tracing Tracing

Analyzer Analyzer

Tool

• Continuous Function Evaluation: Run after every instruction

• Retro-Timing: Query timing information

• Retro-Logging: Inject new logging code

Page 31: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

31

Continuous Function Evaluation• Logically, execute function after each instruction• Instead, force determinism and trigger on input changes

Valid ValidInvalid

Page 32: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

32

Continuous Function Evaluation• Static instrumention produces input set• Memory page protections track updates to input set

Valid Valid

Input Set

Input Set

Input Set

Invalid

Protect Protect Protect

Page 33: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

33

Outline

• Evaluation

Page 34: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

34

ScenariosName Benchmark Replay

TimeTracing

(millions)1-CoreLatency

1024-CoreLatency

Data Corruption

Nginx 2s 8 5m25s 1s

Wild Store MongoDB 30s 3 2h8m8s 17s

Atomicity Violation

Memcached 1m38s 43 2h10m52s 14s

Memory Leak Nginx 1m16s 4 26m15s 3s

Lock Contention

Memcached 1m33s 76 54m42s 11s

Apache 45605 Apache 51s 2 4m10s 1s

Apache 25520 Apache 60s 4 11m57s 1s

Page 35: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

35

Scalability

1

10

100

1000

1 2 4 8 16 32 64 128 256 512 1024

Nor

mal

ized

Spee

dup

Number of Cores

IdealData corruptionWild storeAtomicity violationMemory leakLock contentionApache 45605Apache 25520

92x 416x

Page 36: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

36

• Why does Sledgehammer not scale perfectly?

• Initialization (tracer injection, restoring epoch start)• Outliers (latency of slowest vs. average epoch)

• More analysis in the paper

Scalability

Page 37: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

37

Cluster-Fueled Debugging

• Accelerate complex queries to interactive latencies

Page 38: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

38

Sledgehammer

Analyzer

AnalysisOutput

Tracing

Execution

Tracing Tracing

Analyzer Analyzer

Tracing

Page 39: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

39

Sledgehammer

Analyzer

AnalysisOutput

Tracing

Execution

Tracing Tracing

Analyzer Analyzer

Tool

Scales to 1024 cores (speedup 416x)Faster than without debugging!

Page 40: Sledgehammer: Cluster-Fueled Debugging...Sledgehammer: Cluster-Fueled Debugging Andrew Quinn, Jason Flinn, and Michael Cafarella. 2 •What was your most challenging bug? Debugging

40

Questions

Poster #13