Post on 12-Apr-2018
Carnegie Mellon
Towards Fingerpointing in the Emulab DynamicDistributed System
Michael P. KasickPriya Narasimhan
Carnegie Mellon University
Kevin AtkinsonJay Lepreau
University of Utah
November 5, 2006
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 1 / 37
Carnegie Mellon
Introduction to Emulab Classic
University of Utah:Flux Research GroupNetwork emulation testbed1300 users430 local nodes740 distributed nodesIn service for 6 years
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 2 / 37
Carnegie Mellon
Emulab’s Experiments
Users upload an experiment configuration (NS file)Configuration specifies virtual node topologyUsers granted full, exclusive access to nodesNodes automatically redelegated when experiments go idle
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 3 / 37
Carnegie Mellon
Emulab Software Infrastructure
Off-the-shelf componentsDatabase, OS, etc.
Custom developed componentsWeb interfaceTestbed setup & management490,000 lines of code
Swap-* proceduresswap-in, swap-out, swap-modify40+ script invocations
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 4 / 37
Carnegie Mellon
Emulab’s Difficulties
System errors reported to operator mailing listAverage of 82 failure emails per day (April 2006)Swap-* procedures are largest sources of errors
Each mail is 100+ lines longProblem is not always obviousMany underlying causesOnly a few errors require operator attention
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 5 / 37
Carnegie Mellon
Example Swap-* FailureTIMESTAMP: 11:12:38:659154 assign started
assign foo-bar-5987.ptop foo-bar-5987.top
ASSIGN FAILED:
Type precheck passed.
Node mapping precheck:
Node mapping precheck succeeded
Annealing.
Fixed node: Could not map srs-101 to pc106
Trying assign on an empty testbed.
TIMESTAMP: 11:12:40:663660 ptopgen started
ptopargs -p foo -e bar -a
TIMESTAMP: 11:12:41:576498 ptopgen finished
TIMESTAMP: 11:12:41:576719 assign started
assign -n foo-bar-5987.ptop foo-bar-5987.top
Precheck succeeded.
Assign succeeded on an empty testbed.
*** /usr/testbed/libexec/assign_wrapper:
Unretriable error. Giving up.
*** Failed (65) to map to reality.
Recovering virtual and physical state.
Doing a recovery swap-in of old state.
236 line email111 lines of log4 errors1 root cause
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 6 / 37
Carnegie Mellon
Problem Summary
Need to automate swap-* failure analysisIsolate errorsDetermine error relevance
Can be done with post-processing analysisThis problem solved by concurrent work
Emulab’s new tblog logging mechanism
Can we do better?
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 7 / 37
Carnegie Mellon
Local vs. Global Analysis
Analysis of a single swap-* failure:Considers a single, local, error domainScope limits precision of fingerpointing
Concurrent analysis of many swap-* failures:Considers the entire, global, error domainError correlation increases precision of fingerpointing
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 8 / 37
Carnegie Mellon
The Bigger Problem
Current error reporting does not facilitate global analysis
We propose a new structured error reportingmechanism that does facilitate global analysis
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 9 / 37
Carnegie Mellon
Outline
1 Introduction
2 Initial Attempt at Fingerpointingtblog Logging MechanismLessons Learned
3 Structured Error ReportingIngredients of a SolutionDevelopment & DeploymentInitial Results
4 Summary
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 10 / 37
Carnegie Mellon
tblog Logging Mechanism
Perl module interfaces scripts with an error-log databaseAutomatically logs diagnostic messages
stdout, stderr, die(), warn(), etc.
Provides an API for scripts to write messagesRecords script execution context
Time stamp, script invocation #, parent script #, etc.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 11 / 37
Carnegie Mellon
tblog Analysis
Reconstructs script call-chainAscertains most recent error at greatest depthFlags script and its errors as relevant
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 12 / 37
Carnegie Mellon
tblog Example
swapexp
tbswap
assign wrapper
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 13 / 37
Carnegie Mellon
tblog Example
swapexp
tbswap
assign wrapper
ptopgen assign
Fixed node: Could not map srs-101 to pc106
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 14 / 37
Carnegie Mellon
tblog Example
swapexp
tbswap
assign wrapper
ptopgen assign ptopgen assign
Fixed node: Could not map srs-101 to pc106
*** /usr/testbed/libexec/assign_wrapper:
Unretriable error. Giving up.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 15 / 37
Carnegie Mellon
tblog Example
swapexp
tbswap
assign wrapper
ptopgen assign ptopgen assign
Fixed node: Could not map srs-101 to pc106
*** /usr/testbed/libexec/assign_wrapper:
Unretriable error. Giving up.
*** Failed (65) to map to reality.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 16 / 37
Carnegie Mellon
tblog Example
swapexp
tbswap
assign wrapper
ptopgen assign ptopgen assign
Fixed node: Could not map srs-101 to pc106
*** /usr/testbed/libexec/assign_wrapper:
Unretriable error. Giving up.
*** Failed (65) to map to reality.
*** /usr/testbed/bin/swapexp:
Update aborted; old state restored.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 17 / 37
Carnegie Mellon
tblog Example
swapexp
tbswap
assign wrapper
ptopgen assign ptopgen assign
Fixed node: Could not map srs-101 to pc106
*** /usr/testbed/libexec/assign_wrapper:
Unretriable error. Giving up.
*** Failed (65) to map to reality.
*** /usr/testbed/bin/swapexp:
Update aborted; old state restored.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 18 / 37
Carnegie Mellon
Opaque Failure Messages
Designed to be human interpretableVague and lacking in context details
Need context for spatial correlation“Unretriable error. Giving up.”
Messages are cumbersome to parseDifferent messages may describe the same error
“Invalid OS FOO in project bar!”“[tb-set-node-os] Invalid osid FOO”
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 19 / 37
Carnegie Mellon
Absence of Error Context
tblog only captures a general contexttime stamp, script name, etc.
No provision for capturing error-specific contextnodes, OS images, configuration, etc.
Reporting must preserve the error-specific contextRequired for error correlationFacilitates global analysis
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 20 / 37
Carnegie Mellon
Outline
1 Introduction
2 Initial Attempt at Fingerpointingtblog Logging MechanismLessons Learned
3 Structured Error ReportingIngredients of a SolutionDevelopment & DeploymentInitial Results
4 Summary
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 21 / 37
Carnegie Mellon
Discrete Error Types
Identify distinct errorsDefined by a specification for each error type
named error type, definition, associated specific context
Node-boot failure example:Error type: node boot failedDefinition: Node failed to boot twiceContext: node, type, osid
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 22 / 37
Carnegie Mellon
Error Context & Propagation
Context distinguishes between errors of the same typeNode boot failures across different nodesNode boot failures with different OSes
Propagation centers focus on relevant errorsNested scripts should propagate the primary errorOtherwise parent scripts generate “me-too” errorsSecondary (“me-too”) errors add noiseAchievable with exceptions (RPC, middleware)
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 23 / 37
Carnegie Mellon
Research Phase
Used tblog to identify a set of target errorsGoal was not to obtain 100% coverageSystem functionality is always expandingSmall portion of possible errors actually observed
Drafted error specifications and error typesRequired significant knowledge of errors and meaningEliminated error ambiguitiesIdentified relevant error specific context
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 24 / 37
Carnegie Mellon
Development Phase
Developed a prototype Perl reporting moduleStructured error reporting functionError parsers for C++ & TCL language components
Added reporting hooks for the target errorsProblem: Emulab provides no error propagationNested scripts return success or failure onlyFix: severity-level assignmentAlternative: tblog post-processing analysis
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 25 / 37
Carnegie Mellon
Testing & Deployment Phases
Tested prototype in elabinelabIntegrated prototype into tblog framework
New local analysis engine: tbreport
Deployed on the production Emulab testbed750 lines of added or changed code
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 26 / 37
Carnegie Mellon
Initial Results
Data collected August 16-24th, 2006681 swap-* sessions started
108 (17.3%) reported at least one error
283 total fatal errors reportedMany errors repeated for each node in a session118 unique instances of errors in a given session
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 27 / 37
Carnegie Mellon
Error Statistics
Occurrences Error Type31 26.3% assign violation/feasible
(resource shortage)24 20.3% assign type precheck/feasible
(node shortage)22 18.6% node boot failed10 8.5% ns parse failed
(bad experiment configuration)... ... ...
Normalized errors (unique in a session) grouped by error type.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 28 / 37
Carnegie Mellon
Node Shortage Failures
Second most common error (20.3%)Insufficient free nodes for experiment swap-inCurrent node availability is listed on website
Illustrates user demand48% due to lack of pc3000
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 29 / 37
Carnegie Mellon
Other Resource Shortage Failures
Most common error (26.3%)Sufficient free nodes to swap-in, but:
Attempted assignment violated mapping constraintsOften due to oversubscribed switch bandwidth
Assignment algorithm is non-deterministicUser cannot predict when these errors might occurLater attempts may succeed w/o topology change
Frequent resubmissions lead to further errors
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 30 / 37
Carnegie Mellon
Node-Boot Failures
Third most common error (18.6%)Node status daemon
Reports boot successTimeout results in error
Many underlying causesFaulty hardware, broken user contributed OS, etc.
Motivating scenario for our future research
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 31 / 37
Carnegie Mellon
Node-Boot Failure Example (I)
pc297cust_os1
Single node, one sessionUnknown culprit
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 32 / 37
Carnegie Mellon
Node-Boot Failure Example (II)pc297
cust_os1
cust_os1pc301
Two nodes, two sessions, same OSSuggests bad OS
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 33 / 37
Carnegie Mellon
Node-Boot Failure Example (III)pc297 pc297
cust_os1
cust_os1
cust_os2
cust_os2pc301 pc301+
Same two nodes, four sessions, different OSStrongly suggests bad OS
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 34 / 37
Carnegie Mellon
Node-Boot Failures: What’s Next?
Cannot diagnose root cause from a single traceOperator dilemma:
Assume node is faulty and quarantine?Assume OS is faulty and leave node as is?
Motivates global fingerpointing (future work)Correlation of multiple error instancesReliably fingerpoints the culprit
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 35 / 37
Carnegie Mellon
Summary
Manual diagnosis of system errors is costlytblog-style analysis aids in message filteringOpaque failure messages limits error usefulnessStructured error reports enable global analysisGlobal analysis fingerpoints errors with fine granularityFuture work:
Develop a global analysis engine for EmulabStart by targeting the identified node-boot failure scenarioTarget other real-world systems for error analysis
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 36 / 37
Carnegie Mellon
Further Reading
Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, and Jay Lepreau.Towards fingerpointing in the Emulab dynamic distributed system.In Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems(WORLDS ’06), Seattle, WA, November 2006.
Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold,Mike Hibler, Chad Barb, and Abhijeet Joglekar.An integrated experimental environment for distributed systems and networks.In Proceedings of the Fifth Symposium on Operating System Design and Implementation(OSDI ’02), pages 255–270, Boston, MA, December 2002.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 37 / 37