Why Recovery Should Be Free, And Often Can Be
description
Transcript of Why Recovery Should Be Free, And Often Can Be
![Page 1: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/1.jpg)
Why Recovery Should Be Free,Why Recovery Should Be Free,And Often Can BeAnd Often Can Be
Armando FoxArmando Fox, Stanford University, Stanford University
June 2003 ROC RetreatJune 2003 ROC Retreat
![Page 2: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/2.jpg)
© 2003 Armando Fox
Recovery Should Be Free, and Can Recovery Should Be Free, and Can BeBe
Already espouse arguments about lowering MTTR:Already espouse arguments about lowering MTTR: Mitigates impact on service as a whole [Fox & Patterson, Mitigates impact on service as a whole [Fox & Patterson,
2002]2002] Results in higher end-user-perceived availability, given Results in higher end-user-perceived availability, given
same overall availability [Xie et al. 2002]same overall availability [Xie et al. 2002] etcetc Tim Chou, Oracle: maybe more important to make Tim Chou, Oracle: maybe more important to make
recovery recovery predictable predictable (so can plan provisioning, anticipate (so can plan provisioning, anticipate impact of outage, etc.)...if we understand it, we can impact of outage, etc.)...if we understand it, we can optimize its speedoptimize its speed
![Page 3: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/3.jpg)
© 2003 Armando Fox
Real win: Recovery management is Real win: Recovery management is hardhard
Determining when to recover is hardDetermining when to recover is hard How to detect that something’s wrong?How to detect that something’s wrong? How do you know when recovery is really necessary? (fail-stutter, How do you know when recovery is really necessary? (fail-stutter,
etc.)etc.) Will recovery make things worse? (cascading recovery)Will recovery make things worse? (cascading recovery)
Knowing what happens when you recover is hardKnowing what happens when you recover is hard Will a particular recovery technique work? (the machinery needed Will a particular recovery technique work? (the machinery needed
to perform the recovery may also be broken)to perform the recovery may also be broken) What is the effect on online performance? (recovery can be What is the effect on online performance? (recovery can be
expensive)expensive) What if you needlessly “over-recover”? (cost of making a mistake What if you needlessly “over-recover”? (cost of making a mistake
is high)is high)
If recovery were predictable and fast, it would simplify both If recovery were predictable and fast, it would simplify both failure detection failure detection and and recovery management.recovery management.
![Page 4: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/4.jpg)
© 2003 Armando Fox
Simplifying Recovery Management: Crash-Only Simplifying Recovery Management: Crash-Only SoftwareSoftware
Goal: enforce simple invariants on Goal: enforce simple invariants on recovery recovery behavior, behavior, from from outside outside the component(s) being recoveredthe component(s) being recovered Crash-only component provides PWR switch: Crash-only component provides PWR switch: stop = stop =
crashcrash:: clean shutdown = loss of power = kernel panic = ...clean shutdown = loss of power = kernel panic = ...
One way to go down One way to go down one way to come up: one way to come up: start = start = recoverrecover
Power switch is Power switch is externalexternal uniform behavioruniform behavior killkill -9, -9, “turning off” (process kill) a VM, pull power cord“turning off” (process kill) a VM, pull power cord Intuition: the “infrastructure” supporting the power switch is Intuition: the “infrastructure” supporting the power switch is
usually usually simpler simpler than the applications using it, and common than the applications using it, and common across all those applicationsacross all those applications
Can crash-only software actually be built, and if so, how?Can crash-only software actually be built, and if so, how? (a) provide building blocks(a) provide building blocks (b) formalize C/O definition and provide developer (b) formalize C/O definition and provide developer
![Page 5: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/5.jpg)
© 2003 Armando Fox
Crash-only Building BlocksCrash-only Building Blocks JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., JAGR/ROC-2, a self-recovering J2EE app server [Candea et al.,
WIAPP 2003]WIAPP 2003] Micro-reboots used for recovery, application-generic failure-path Micro-reboots used for recovery, application-generic failure-path
inference used for determining recovery strategyinference used for determining recovery strategy Significantly improves performability relative to whole-app redeploySignificantly improves performability relative to whole-app redeploy
SSM: a CO session state manager [Ling, Fox, AMS 2003]SSM: a CO session state manager [Ling, Fox, AMS 2003] DStore: a CO persistent single-key state manager [Huang, Fox, DStore: a CO persistent single-key state manager [Huang, Fox,
submitted to SRDS 2003]submitted to SRDS 2003] Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]
Common features of both SSM and DStore:Common features of both SSM and DStore: Redundancy used for persistenceRedundancy used for persistence Workload semantics exploited to simplify consistency model & Workload semantics exploited to simplify consistency model &
recoveryrecovery Recovery=restart, safe to reboot any node at any timeRecovery=restart, safe to reboot any node at any time Safe to coerce any failure to a crash (fail-stop) at any timeSafe to coerce any failure to a crash (fail-stop) at any time
![Page 6: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/6.jpg)
© 2003 Armando Fox
Building blocks, cont.Building blocks, cont. Pinpoint, statistical-anomaly-based failure detectionPinpoint, statistical-anomaly-based failure detection
Standard tension: accuracy vs. precision (false positives Standard tension: accuracy vs. precision (false positives problem)problem)
Different clustering techniques seem to be good at Different clustering techniques seem to be good at detecting different kinds of problemsdetecting different kinds of problems Surprising result from a CS241 project: character-frequency Surprising result from a CS241 project: character-frequency
histograms are a good app-generic way to detect end-user-histograms are a good app-generic way to detect end-user-visible failuresvisible failures
Mostly integrated with JAGR and SSMMostly integrated with JAGR and SSM On burner: discussions with BEA Systems for integrating into On burner: discussions with BEA Systems for integrating into
WebLogic ServerWebLogic Server
Insight: if cost of “over-recovering” is low, aggressive Insight: if cost of “over-recovering” is low, aggressive statistics-based failure detection becomes more appealingstatistics-based failure detection becomes more appealing
![Page 7: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/7.jpg)
© 2003 Armando Fox
Toward a crash-only formalismToward a crash-only formalism Component frameworks force you into certain app-writing Component frameworks force you into certain app-writing
patternspatterns Inter-EJB calls through runtime-managed level of indirectionInter-EJB calls through runtime-managed level of indirection Restrictions on how persistent state mgt can be expressedRestrictions on how persistent state mgt can be expressed Restrictions on state sharing: difficult to do without using Restrictions on state sharing: difficult to do without using
explicit external storeexplicit external store Hypothesis: these are the elements that allow C/O to workHypothesis: these are the elements that allow C/O to work
Ongoing work: formalize crash-only SWOngoing work: formalize crash-only SW One possibility: One possibility: observational equivalenceobservational equivalence with respect to a with respect to a
request streamrequest stream Can be expressed using a Can be expressed using a design pattern design pattern or or denotational denotational
semanticssemantics Ideally, will lead to a tool (“co-lint”) telling you whether your Ideally, will lead to a tool (“co-lint”) telling you whether your
component is crash-onlycomponent is crash-only
![Page 8: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/8.jpg)
© 2003 Armando Fox
Summary: Toward a Crash-only Summary: Toward a Crash-only WorldWorld
Goal: simplify Goal: simplify recovery managementrecovery management diagnosisdiagnosis: statistical methods even more appealing if the cost of : statistical methods even more appealing if the cost of
making a mistake is lowmaking a mistake is low recoveryrecovery: crash-only enforces invariants about what happens when : crash-only enforces invariants about what happens when
recovery is attemptedrecovery is attempted allows aggressive use of fault model enforcement [Martin et al allows aggressive use of fault model enforcement [Martin et al
2002]2002]
Good progress on providing building blocks for app writersGood progress on providing building blocks for app writers JAGR: J2EE app server that allows fast recovery via micro-reboots JAGR: J2EE app server that allows fast recovery via micro-reboots
and application-generic fault injectionand application-generic fault injection SSM: a crash-only session state store (in process of integrating with SSM: a crash-only session state store (in process of integrating with
JAGR)JAGR) DStore: a crash-only persistent single-key storeDStore: a crash-only persistent single-key store PinPoint: statistics-based failure detection (integrated with JAGR, PinPoint: statistics-based failure detection (integrated with JAGR,
mostly integrated with SSM)mostly integrated with SSM)
![Page 9: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/9.jpg)
© 2003 Armando Fox
Xie et al: MTTR and End-User Xie et al: MTTR and End-User AvailabilityAvailability
Let ALet AUU=user-perceived unavailability, A=user-perceived unavailability, ASS=system unavailability=system unavailability
Hypothesis: if users retry failed requests, and retry succeeds Hypothesis: if users retry failed requests, and retry succeeds because system had fast recovery, they will perceive higher because system had fast recovery, they will perceive higher availabilityavailability When retry rate is sufficiently frequent, AWhen retry rate is sufficiently frequent, AUU approaches A approaches ASS (for A (for ASS
=99.3%, this threshold is 200-300 sec)=99.3%, this threshold is 200-300 sec)
Method: model user retry behavior and system failure/recovery Method: model user retry behavior and system failure/recovery using Markov models; solve using numerical methodsusing Markov models; solve using numerical methods
Finding: Given 2 systems with same AFinding: Given 2 systems with same ASS, the one with shorter , the one with shorter MTTR (MTTR (even though it also has lower MTTF)even though it also has lower MTTF) appears better to appears better to the user.the user.
Goal of this project: validate that result empirically (Jeff Goal of this project: validate that result empirically (Jeff Raymakers, Yee-Jiun Song, Wendy Tobagus)Raymakers, Yee-Jiun Song, Wendy Tobagus)
![Page 10: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/10.jpg)
© 2003 Armando Fox
User perceived unavailability vs retry User perceived unavailability vs retry raterate
“sweet spot” Higher user retry rates yields little improvement in perceived availability.
![Page 11: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/11.jpg)
© 2003 Armando Fox
“sweet spot”At low MTTR, lowering MTTR and MTTF at the same time results in worse user perceived unavailability!Variable MTTR, but fixed system
availability (low MTTR -> low MTTF)
Surprise! MTTF eventually catches up with Surprise! MTTF eventually catches up with youyou
![Page 12: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/12.jpg)
© 2003 Armando Fox
Optimization ChoicesOptimization Choices
Fixed MTTF
Fixed MTTR
System Unavailability
User Perceived Unavailability
![Page 13: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/13.jpg)
© 2003 Armando Fox
Results SummaryResults Summary We can find a “sweet spot” (for a given system We can find a “sweet spot” (for a given system
availability) beyond which higher user retry rates availability) beyond which higher user retry rates yield little benefit.yield little benefit.
For two systems of a given availability, the one For two systems of a given availability, the one with lower MTTR does not always yield better user with lower MTTR does not always yield better user perceived availability.perceived availability.
For a given system, we can determine whether For a given system, we can determine whether improving MTTR or MTTF will yield more user-improving MTTR or MTTF will yield more user-visible benefits.visible benefits.
![Page 14: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/14.jpg)
© 2003 Armando Fox
““Clean” shutdown vs. restart?Clean” shutdown vs. restart? Impractical to guarantee zero crashes Impractical to guarantee zero crashes robust robust
systems must be crash-safe anywaysystems must be crash-safe anyway In that case, why support any other kind of shutdown? In that case, why support any other kind of shutdown? Historically, for Historically, for performanceperformance (avoid synchronous writes, (avoid synchronous writes,
do buffering/caching, etc) - leads to replicated/mirrored do buffering/caching, etc) - leads to replicated/mirrored state, more code, special recovery code paths... state, more code, special recovery code paths...
Crash-only software must:(a) be crash-safe & (b) recover quickly
Total recovery time may be shorter even if crash is forced WinXP can be
(mostly) crash-rebooted for upgrades
VMS sysadmins would sometimes crash the system rather than shut it down (if no users were logged on)
![Page 15: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/15.jpg)
© 2003 Armando Fox
Why Crash-Only Simplifies Why Crash-Only Simplifies RecoveryRecovery
““Hardware works, software doesn’t”Hardware works, software doesn’t” Hardware interlocks, timers, etc. have small state spaces of Hardware interlocks, timers, etc. have small state spaces of
behavior, hence high confidence they will work as designedbehavior, hence high confidence they will work as designed Crash-only PWR switch is a way to approach that same Crash-only PWR switch is a way to approach that same
property for softwareproperty for software
Crash-only makes recovery policies easier to reason Crash-only makes recovery policies easier to reason aboutabout Opportunity to aggressively apply SW rejuvenationOpportunity to aggressively apply SW rejuvenation ““Recovery” code exercised on every restart; no exotic-but-Recovery” code exercised on every restart; no exotic-but-
rarely-used code pathsrarely-used code paths ““Over-recovery” may be OK from performability standpoint: Over-recovery” may be OK from performability standpoint:
if recovery is free (performance & correctness), you stop if recovery is free (performance & correctness), you stop thinking about it as thinking about it as recovery recovery and start thinking about it as and start thinking about it as normal aspect of operationnormal aspect of operation
![Page 16: Why Recovery Should Be Free, And Often Can Be](https://reader035.fdocuments.net/reader035/viewer/2022081512/5681683b550346895dde06d7/html5/thumbnails/16.jpg)
© 2003 Armando Fox
Towards a Crash-Only WorldTowards a Crash-Only World Existing software that is crash-only or near-crash-onlyExisting software that is crash-only or near-crash-only
Stateless apps: most Web serversStateless apps: most Web servers Most RDBMS’s: crash-safe, but long recoveryMost RDBMS’s: crash-safe, but long recovery Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main
codepathcodepath Some appliance storage devices: separate but pretty fast recovery pathSome appliance storage devices: separate but pretty fast recovery path
Our goals...Our goals... Focus on Internet (“3 tier”) applications; already “crash-mostly” except Focus on Internet (“3 tier”) applications; already “crash-mostly” except
for persistence tier(s)for persistence tier(s) Make the app server, middle-tier persistence, and back-end tier (to the Make the app server, middle-tier persistence, and back-end tier (to the
extent possible) truly crash-onlyextent possible) truly crash-only Deploy application-generic failure detection techniques (which may Deploy application-generic failure detection techniques (which may
over-recover, but the goal is to make that OK)over-recover, but the goal is to make that OK) Quantify improvement (we hope!) in performability resulting from Quantify improvement (we hope!) in performability resulting from
these changesthese changes By doing it in the middleware, any app on that middleware can benefitBy doing it in the middleware, any app on that middleware can benefit