Why Recovery Should Be Free, And Often Can Be

Why Recovery Should Be Free,Why Recovery Should Be Free,And Often Can BeAnd Often Can Be

Armando FoxArmando Fox, Stanford University, Stanford University

June 2003 ROC RetreatJune 2003 ROC Retreat

© 2003 Armando Fox

Recovery Should Be Free, and Can Recovery Should Be Free, and Can BeBe

Already espouse arguments about lowering MTTR:Already espouse arguments about lowering MTTR: Mitigates impact on service as a whole [Fox & Patterson, Mitigates impact on service as a whole [Fox & Patterson,

2002]2002] Results in higher end-user-perceived availability, given Results in higher end-user-perceived availability, given

same overall availability [Xie et al. 2002]same overall availability [Xie et al. 2002] etcetc Tim Chou, Oracle: maybe more important to make Tim Chou, Oracle: maybe more important to make

recovery recovery predictable predictable (so can plan provisioning, anticipate (so can plan provisioning, anticipate impact of outage, etc.)...if we understand it, we can impact of outage, etc.)...if we understand it, we can optimize its speedoptimize its speed

© 2003 Armando Fox

Real win: Recovery management is Real win: Recovery management is hardhard

Determining when to recover is hardDetermining when to recover is hard How to detect that something’s wrong?How to detect that something’s wrong? How do you know when recovery is really necessary? (fail-stutter, How do you know when recovery is really necessary? (fail-stutter,

etc.)etc.) Will recovery make things worse? (cascading recovery)Will recovery make things worse? (cascading recovery)

Knowing what happens when you recover is hardKnowing what happens when you recover is hard Will a particular recovery technique work? (the machinery needed Will a particular recovery technique work? (the machinery needed

to perform the recovery may also be broken)to perform the recovery may also be broken) What is the effect on online performance? (recovery can be What is the effect on online performance? (recovery can be

expensive)expensive) What if you needlessly “over-recover”? (cost of making a mistake What if you needlessly “over-recover”? (cost of making a mistake

is high)is high)

If recovery were predictable and fast, it would simplify both If recovery were predictable and fast, it would simplify both failure detection failure detection and and recovery management.recovery management.

© 2003 Armando Fox

Simplifying Recovery Management: Crash-Only Simplifying Recovery Management: Crash-Only SoftwareSoftware

Goal: enforce simple invariants on Goal: enforce simple invariants on recovery recovery behavior, behavior, from from outside outside the component(s) being recoveredthe component(s) being recovered Crash-only component provides PWR switch: Crash-only component provides PWR switch: stop = stop =

crashcrash:: clean shutdown = loss of power = kernel panic = ...clean shutdown = loss of power = kernel panic = ...

One way to go down One way to go down one way to come up: one way to come up: start = start = recoverrecover

Power switch is Power switch is externalexternal uniform behavioruniform behavior killkill -9, -9, “turning off” (process kill) a VM, pull power cord“turning off” (process kill) a VM, pull power cord Intuition: the “infrastructure” supporting the power switch is Intuition: the “infrastructure” supporting the power switch is

usually usually simpler simpler than the applications using it, and common than the applications using it, and common across all those applicationsacross all those applications

Can crash-only software actually be built, and if so, how?Can crash-only software actually be built, and if so, how? (a) provide building blocks(a) provide building blocks (b) formalize C/O definition and provide developer (b) formalize C/O definition and provide developer

© 2003 Armando Fox

Crash-only Building BlocksCrash-only Building Blocks JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., JAGR/ROC-2, a self-recovering J2EE app server [Candea et al.,

WIAPP 2003]WIAPP 2003] Micro-reboots used for recovery, application-generic failure-path Micro-reboots used for recovery, application-generic failure-path

inference used for determining recovery strategyinference used for determining recovery strategy Significantly improves performability relative to whole-app redeploySignificantly improves performability relative to whole-app redeploy

SSM: a CO session state manager [Ling, Fox, AMS 2003]SSM: a CO session state manager [Ling, Fox, AMS 2003] DStore: a CO persistent single-key state manager [Huang, Fox, DStore: a CO persistent single-key state manager [Huang, Fox,

submitted to SRDS 2003]submitted to SRDS 2003] Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]

Common features of both SSM and DStore:Common features of both SSM and DStore: Redundancy used for persistenceRedundancy used for persistence Workload semantics exploited to simplify consistency model & Workload semantics exploited to simplify consistency model &

recoveryrecovery Recovery=restart, safe to reboot any node at any timeRecovery=restart, safe to reboot any node at any time Safe to coerce any failure to a crash (fail-stop) at any timeSafe to coerce any failure to a crash (fail-stop) at any time

© 2003 Armando Fox

Building blocks, cont.Building blocks, cont. Pinpoint, statistical-anomaly-based failure detectionPinpoint, statistical-anomaly-based failure detection

Standard tension: accuracy vs. precision (false positives Standard tension: accuracy vs. precision (false positives problem)problem)

Different clustering techniques seem to be good at Different clustering techniques seem to be good at detecting different kinds of problemsdetecting different kinds of problems Surprising result from a CS241 project: character-frequency Surprising result from a CS241 project: character-frequency

histograms are a good app-generic way to detect end-user-histograms are a good app-generic way to detect end-user-visible failuresvisible failures

Mostly integrated with JAGR and SSMMostly integrated with JAGR and SSM On burner: discussions with BEA Systems for integrating into On burner: discussions with BEA Systems for integrating into

WebLogic ServerWebLogic Server

Insight: if cost of “over-recovering” is low, aggressive Insight: if cost of “over-recovering” is low, aggressive statistics-based failure detection becomes more appealingstatistics-based failure detection becomes more appealing

© 2003 Armando Fox

Toward a crash-only formalismToward a crash-only formalism Component frameworks force you into certain app-writing Component frameworks force you into certain app-writing

patternspatterns Inter-EJB calls through runtime-managed level of indirectionInter-EJB calls through runtime-managed level of indirection Restrictions on how persistent state mgt can be expressedRestrictions on how persistent state mgt can be expressed Restrictions on state sharing: difficult to do without using Restrictions on state sharing: difficult to do without using

explicit external storeexplicit external store Hypothesis: these are the elements that allow C/O to workHypothesis: these are the elements that allow C/O to work

Ongoing work: formalize crash-only SWOngoing work: formalize crash-only SW One possibility: One possibility: observational equivalenceobservational equivalence with respect to a with respect to a

request streamrequest stream Can be expressed using a Can be expressed using a design pattern design pattern or or denotational denotational

semanticssemantics Ideally, will lead to a tool (“co-lint”) telling you whether your Ideally, will lead to a tool (“co-lint”) telling you whether your

component is crash-onlycomponent is crash-only

© 2003 Armando Fox

Summary: Toward a Crash-only Summary: Toward a Crash-only WorldWorld

Goal: simplify Goal: simplify recovery managementrecovery management diagnosisdiagnosis: statistical methods even more appealing if the cost of : statistical methods even more appealing if the cost of

making a mistake is lowmaking a mistake is low recoveryrecovery: crash-only enforces invariants about what happens when : crash-only enforces invariants about what happens when

recovery is attemptedrecovery is attempted allows aggressive use of fault model enforcement [Martin et al allows aggressive use of fault model enforcement [Martin et al

2002]2002]

Good progress on providing building blocks for app writersGood progress on providing building blocks for app writers JAGR: J2EE app server that allows fast recovery via micro-reboots JAGR: J2EE app server that allows fast recovery via micro-reboots

and application-generic fault injectionand application-generic fault injection SSM: a crash-only session state store (in process of integrating with SSM: a crash-only session state store (in process of integrating with

JAGR)JAGR) DStore: a crash-only persistent single-key storeDStore: a crash-only persistent single-key store PinPoint: statistics-based failure detection (integrated with JAGR, PinPoint: statistics-based failure detection (integrated with JAGR,

mostly integrated with SSM)mostly integrated with SSM)

© 2003 Armando Fox

Xie et al: MTTR and End-User Xie et al: MTTR and End-User AvailabilityAvailability

Let ALet AUU=user-perceived unavailability, A=user-perceived unavailability, ASS=system unavailability=system unavailability

Hypothesis: if users retry failed requests, and retry succeeds Hypothesis: if users retry failed requests, and retry succeeds because system had fast recovery, they will perceive higher because system had fast recovery, they will perceive higher availabilityavailability When retry rate is sufficiently frequent, AWhen retry rate is sufficiently frequent, AUU approaches A approaches ASS (for A (for ASS

=99.3%, this threshold is 200-300 sec)=99.3%, this threshold is 200-300 sec)

Method: model user retry behavior and system failure/recovery Method: model user retry behavior and system failure/recovery using Markov models; solve using numerical methodsusing Markov models; solve using numerical methods

Finding: Given 2 systems with same AFinding: Given 2 systems with same ASS, the one with shorter , the one with shorter MTTR (MTTR (even though it also has lower MTTF)even though it also has lower MTTF) appears better to appears better to the user.the user.

Goal of this project: validate that result empirically (Jeff Goal of this project: validate that result empirically (Jeff Raymakers, Yee-Jiun Song, Wendy Tobagus)Raymakers, Yee-Jiun Song, Wendy Tobagus)

© 2003 Armando Fox

User perceived unavailability vs retry User perceived unavailability vs retry raterate

“sweet spot” Higher user retry rates yields little improvement in perceived availability.

© 2003 Armando Fox

“sweet spot”At low MTTR, lowering MTTR and MTTF at the same time results in worse user perceived unavailability!Variable MTTR, but fixed system

availability (low MTTR -> low MTTF)

Surprise! MTTF eventually catches up with Surprise! MTTF eventually catches up with youyou

© 2003 Armando Fox

Optimization ChoicesOptimization Choices

Fixed MTTF

Fixed MTTR

System Unavailability

User Perceived Unavailability

© 2003 Armando Fox

Results SummaryResults Summary We can find a “sweet spot” (for a given system We can find a “sweet spot” (for a given system

availability) beyond which higher user retry rates availability) beyond which higher user retry rates yield little benefit.yield little benefit.

For two systems of a given availability, the one For two systems of a given availability, the one with lower MTTR does not always yield better user with lower MTTR does not always yield better user perceived availability.perceived availability.

For a given system, we can determine whether For a given system, we can determine whether improving MTTR or MTTF will yield more user-improving MTTR or MTTF will yield more user-visible benefits.visible benefits.

© 2003 Armando Fox

““Clean” shutdown vs. restart?Clean” shutdown vs. restart? Impractical to guarantee zero crashes Impractical to guarantee zero crashes robust robust

systems must be crash-safe anywaysystems must be crash-safe anyway In that case, why support any other kind of shutdown? In that case, why support any other kind of shutdown? Historically, for Historically, for performanceperformance (avoid synchronous writes, (avoid synchronous writes,

do buffering/caching, etc) - leads to replicated/mirrored do buffering/caching, etc) - leads to replicated/mirrored state, more code, special recovery code paths... state, more code, special recovery code paths...

Crash-only software must:(a) be crash-safe & (b) recover quickly

Total recovery time may be shorter even if crash is forced WinXP can be

(mostly) crash-rebooted for upgrades

VMS sysadmins would sometimes crash the system rather than shut it down (if no users were logged on)

© 2003 Armando Fox

Why Crash-Only Simplifies Why Crash-Only Simplifies RecoveryRecovery

““Hardware works, software doesn’t”Hardware works, software doesn’t” Hardware interlocks, timers, etc. have small state spaces of Hardware interlocks, timers, etc. have small state spaces of

behavior, hence high confidence they will work as designedbehavior, hence high confidence they will work as designed Crash-only PWR switch is a way to approach that same Crash-only PWR switch is a way to approach that same

property for softwareproperty for software

Crash-only makes recovery policies easier to reason Crash-only makes recovery policies easier to reason aboutabout Opportunity to aggressively apply SW rejuvenationOpportunity to aggressively apply SW rejuvenation ““Recovery” code exercised on every restart; no exotic-but-Recovery” code exercised on every restart; no exotic-but-

rarely-used code pathsrarely-used code paths ““Over-recovery” may be OK from performability standpoint: Over-recovery” may be OK from performability standpoint:

if recovery is free (performance & correctness), you stop if recovery is free (performance & correctness), you stop thinking about it as thinking about it as recovery recovery and start thinking about it as and start thinking about it as normal aspect of operationnormal aspect of operation

© 2003 Armando Fox

Towards a Crash-Only WorldTowards a Crash-Only World Existing software that is crash-only or near-crash-onlyExisting software that is crash-only or near-crash-only

Stateless apps: most Web serversStateless apps: most Web servers Most RDBMS’s: crash-safe, but long recoveryMost RDBMS’s: crash-safe, but long recovery Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main

codepathcodepath Some appliance storage devices: separate but pretty fast recovery pathSome appliance storage devices: separate but pretty fast recovery path

Our goals...Our goals... Focus on Internet (“3 tier”) applications; already “crash-mostly” except Focus on Internet (“3 tier”) applications; already “crash-mostly” except

for persistence tier(s)for persistence tier(s) Make the app server, middle-tier persistence, and back-end tier (to the Make the app server, middle-tier persistence, and back-end tier (to the

extent possible) truly crash-onlyextent possible) truly crash-only Deploy application-generic failure detection techniques (which may Deploy application-generic failure detection techniques (which may

over-recover, but the goal is to make that OK)over-recover, but the goal is to make that OK) Quantify improvement (we hope!) in performability resulting from Quantify improvement (we hope!) in performability resulting from

these changesthese changes By doing it in the middleware, any app on that middleware can benefitBy doing it in the middleware, any app on that middleware can benefit

Why Recovery Should Be Free, And Often Can Be

Documents

Transcript of Why Recovery Should Be Free, And Often Can Be