Teaching Recovery A report from the Theory of Recovery Workshop
A Theory of Redo Recovery
-
Upload
ashton-douglas -
Category
Documents
-
view
18 -
download
1
description
Transcript of A Theory of Redo Recovery
1
A Theory of Redo RecoveryA Theory of Redo Recovery
David LometDavid LometMicrosoft Research, RedmondMicrosoft Research, Redmond
Mark TuttleMark TuttleHP Research, CambridgeHP Research, Cambridge
2
Big PictureBig Picture
Redo RecoveryRedo Recovery requires Good db state Replay of the right operations
Good state updates:Good state updates: conflict order not required Write-read conflicts can be ignored Some db “variables” irrelevant (don’t need to update them)
Synchronize State updateSynchronize State update & ops replayedops replayed Captured in recovery InvariantInvariant We prove that maintaining invariant maintaining invariant recovery recovery
Current recovery methods:Current recovery methods: maintain invariant Show how current methods work (e.g. ARIES redo) Show how “new” methods could work
Much simplerMuch simpler than our VLDB’95 paper
3
Conflict State Graph (CSG)Conflict State Graph (CSG) Conflict graphConflict graph (“Borrowed” from Concurrency Control)(“Borrowed” from Concurrency Control)
Nodes are log operations; Edges: conflicts (RW, WR, Nodes are log operations; Edges: conflicts (RW, WR, WW)WW)
State graph SGState graph SG Add writes(node): {<name, value>…} of vars updated Add writes(node): {<name, value>…} of vars updated State for SG:State for SG: {< {<x,vx,v>| <>| <x,v>x,v> in writes(n) and n is last node in writes(n) and n is last node
in state graph with in state graph with x in vars(n)x in vars(n)}}
Final state Final state SSfinalfinal of of CSGCSG is desired is desired recovered staterecovered state Any Any prefixprefix of a state graph is a state graph of a state graph is a state graph
Prefix: node in prefix Prefix: node in prefix predecessor in prefix predecessor in prefix State of any State of any prefix of CSG can be recovered by can be recovered by
Replaying operations in suffix in conflict graph orderReplaying operations in suffix in conflict graph order
We will relax CSG requirements
4
Conflict State Graph & StatesConflict State Graph & States
OO: readset{: readset{xx}}
writes{<writes{<x,1>x,1>}}
QQ: readset{: readset{xx}}
writes{<writes{<x,3>x,3>}}
PP: readset{: readset{xx}}
writes{<writes{<y,2>y,2>}}
x=1,y=0x=1,y=0
SSfinal final : x=3, y=2: x=3, y=2
x=0,y=0x=0,y=0
x=1, y=2x=1, y=2
Write-read edgeWrite-read edge
Write-read & Write-read & write-write & write-write &
read-write edgeread-write edge
Read-write edgeRead-write edge
5
Installation GraphInstallation Graph Example:Example: Initial stable state: Initial stable state: {<x,0><y,0>}{<x,0><y,0>}
O: x O: x ←← x+1 x+1 P: y P: y ←← x+1 x+1 After O,P, state is After O,P, state is {<x,1>,<y,2>}{<x,1>,<y,2>} Flush y to disk-Flush y to disk- Stable state is Stable state is {<x,0><y,2>}{<x,0><y,2>}
Replay O-Replay O- generates correct state generates correct state {<x,1>,<y,2>}{<x,1>,<y,2>} O’s readset x unchanged by P’s installationO’s readset x unchanged by P’s installation Even though Write-Read edge orders P after OEven though Write-Read edge orders P after O
Installation graph:Installation graph: conflict graph without conflict graph without write-read edgeswrite-read edges Installation state graphInstallation state graph ( (ISGISG): ):
same same writes(n)writes(n) for node n as conflict state graphfor node n as conflict state graph State of any prefix of ISG can be recoveredState of any prefix of ISG can be recovered
More prefixes (states) because of fewer edgesMore prefixes (states) because of fewer edges
y written by Py written by P
6
Installation State Graph & StatesInstallation State Graph & Statesx=0,y=0x=0,y=0
OO: readset{: readset{xx}}
writes{<writes{<x,1>x,1>}}
QQ: readset{: readset{xx}}
writes{<writes{<x,3>x,3>}}
PP: readset{: readset{xx}}
writes{<writes{<y,2>y,2>}}
x=1,y=0x=1,y=0
x=3, y=2x=3, y=2
x=1, y=2x=1, y=2
x=0,y=2x=0,y=2
Removed write-read edge
Retained read-write edge
Retained write-write &
read-write edge
ISG recoverable state
7
Exposed VariablesExposed Variables ExampleExample
O1: x O1: x ←← z+1 z+1 O2: x O2: x ←← 25 25 After After O2O2, we don’t care about x value of , we don’t care about x value of O1O1
Variable Variable xx is is unexposedunexposed after ops after ops II ({ ({O1}O1} here) if here) if minminconflictconflict op op in in Ops(log) – I Ops(log) – I writes writes x x
Without reading itWithout reading it x’s value is a “don’t care” when x’s value is a “don’t care” when xx is unexposed is unexposed This is example of This is example of Physical LoggingPhysical Logging
Prefix of installation graphPrefix of installation graph explainsexplains state state SS if if values of values of exposedexposed variables variables in in SS are the same as are the same as values in state of prefix of ISGvalues in state of prefix of ISG
8
Potentially Recoverable StatePotentially Recoverable State
Potentially recoverable state:Potentially recoverable state: state that state that by the replay of a subset of operations of the by the replay of a subset of operations of the
conflict graph, in conflict order, will produce conflict graph, in conflict order, will produce the the recovered staterecovered state S Sfinalfinal
Theorem:Theorem: If If SS is a state explained by a is a state explained by a prefix of the installation graph, then prefix of the installation graph, then SS is is potentially recoverablepotentially recoverable
9
REDO Test & Recovery ProcedureREDO Test & Recovery Procedure REDO:REDO: tests op’s in conflict order log scan tests op’s in conflict order log scan
Yes (true):Yes (true): replay operation replay operation No (false):No (false): bypass operation bypass operation
redo_set = redo_set = {O|{O|REDOREDO(O..) & O on scanned log}(O..) & O on scanned log} Recover Procedure:Recover Procedure:
Set log scan point to “checkpoint”Set log scan point to “checkpoint” whilewhile not at log end not at log end
O O ←← current log operation current log operation State = State = ifif REDOREDO(O,State,Log,Analysis) (O,State,Log,Analysis)
ThenThen O(State) O(State) ElseElse State State
Advance log scan point to next operationAdvance log scan point to next operation EndEnd
10
RecoveryRecovery Recoverable system:Recoverable system: a system with a system with
a potentially recoverable state a potentially recoverable state SSpotpot Replay of O’s in Replay of O’s in redo_set redo_set from from SSpotpot produces produces SSfinalfinal
InvInv : : ops(Log)-redo_setops(Log)-redo_set defines prefix of the defines prefix of the installation state graph that explains Stateinstallation state graph that explains State Every system change must be Every system change must be atomic transitionatomic transition
maintaining maintaining InvInv Corollary:Corollary: Given a Given a statestate,, loglog,, checkpointcheckpoint, and an , and an
execution ofexecution of RecoverRecover (identifying redo_set) (identifying redo_set) If If InvInv holds holds Then System is recoverableThen System is recoverable Only specific potentially specific potentially
recoverable staterecoverable state is recoverable
11
Write Graph Write Graph Write graph:Write graph: start from start from installation state graphnstallation state graph
CollapseCollapse set of nodes (acyclic) merges nodes set of nodes (acyclic) merges nodes Add new nodeAdd new node for next operation for next operation Add edgeAdd edge (collapse cycles) (collapse cycles) Remove a writeRemove a write of an unexposed variable of an unexposed variable
We do not care about values of We do not care about values of unexposedunexposed variables variables
Write graph captures Write graph captures entire system stateentire system state PrefixPrefix that is stable that is stable SuffixSuffix in cache in cache
Cache ManagerCache Manager uses write graph uses write graph To maintain potentially recoverable stateTo maintain potentially recoverable state Usually by collapsing suffix node into stable prefixUsually by collapsing suffix node into stable prefix
12
Removed write-read edge
Write graph remains acyclic
Based on installation graph
Write Graph Write Graph {via Node Collapse}{via Node Collapse}Fewer StatesFewer States
OO: readset{: readset{xx}}
writes{<writes{<x,1>x,1>}}
QQ: readset{: readset{xx}}
writes{<writes{<x,3>x,3>}}
PP: readset{: readset{xx}}
writes{<writes{<y,2>y,2>}}
x=3, y=2x=3, y=2
x=0,y=0x=0,y=0
x=0,y=2x=0,y=2
Collapsed Collapsed Node nNode n
x=1, y=2x=1, y=2
x=1, y=0x=1, y=0
Ops(n) = {O,P}Ops(n) = {O,P}Writes(n) = {<x,3>}Writes(n) = {<x,3>}
Keep only one version of each variable in
cache
Retained read-write edge translates to flush order for
cache manager
13
Managing RecoveryManaging Recovery
Stable StateStable State
Write Graph PrefixWrite Graph Prefix
Usually Single NodeUsually Single Node
Log
O1
O2
O3
O1 O2 O3
Volatile StateVolatile State
Suffix of Write GraphSuffix of Write Graph
In CacheIn Cache
Collapse to“Install” X
Updating State
Removing O3 from redo_set
AtomiAtomicc
14
Physiological RecoveryPhysiological Recovery Physiological recoveryPhysiological recovery (e.g. ARIES) (e.g. ARIES)
Operation Form:Operation Form: read A, write Aread A, write A Log Op has Log Op has LSNLSN
Variable tagged: LSN of last log op writing itVariable tagged: LSN of last log op writing it REDOREDO: : op’s LSN > variable LSNop’s LSN > variable LSN “Yes” “Yes” (Replay) (Replay)
Our explanationOur explanation Ops writing variable collapsed to one cache nodeOps writing variable collapsed to one cache node Flushing page to Flushing page to stable state (root of write graph)stable state (root of write graph)
CollapsesCollapses cache node into stable state node cache node into stable state node Keeps state Keeps state potentially recoverablepotentially recoverable
redo testredo test node’s ops removed from node’s ops removed from redo_setredo_set MaintainsMaintains invariant invariant InvInv
[state change; redo_set change] is [state change; redo_set change] is atomicatomic
Physical and Logical Recovery described in paper
15
Extended LSN MethodExtended LSN Method
Generalize physiological opsGeneralize physiological ops read/write multiple variablesread/write multiple variables Our example:Our example: ops can read X, write Y (like P) ops can read X, write Y (like P)
also read X, write Xalso read X, write X LSNsLSNs still effective for REDO test still effective for REDO test
Flush synchronizes change to state and redo_setFlush synchronizes change to state and redo_set Cache managementCache management
Now requires flush of one variable before anotherNow requires flush of one variable before another Our theory captures this Our theory captures this careful writecareful write requirement requirement
Consider B-tree split:Consider B-tree split: (B (Blinklink-tree) -tree) ** Next slide Next slide shows “half split” graphicallyshows “half split” graphically Must also post index term for Must also post index term for newnew node node
16
Extended Recovery Extended Recovery {B{Blinklink-tree Split}-tree Split}
x=0,y=0x=0,y=0
PP: readset{: readset{xx}}
writes{<writes{<y,2>y,2>}}
x=3, y=2x=3, y=2
x=0,y=2x=0,y=2x=1, y=2x=1, y=2
x=1, y=0x=1, y=0
New Node YNew Node YOld Node XOld Node X
Move half to node Y Move half to node Y Read X, write Y Read X, write Y
OO: readset{: readset{xx}}
writes{<writes{<x,1>x,1>}}
QQ: readset{: readset{xx}}
writes{<writes{<x,3>x,3>}}
Collapsed Collapsed NodeNode
Ops(n) = {O,P}Ops(n) = {O,P}Writes(n) = {<x,3>}Writes(n) = {<x,3>}
Update node X Update node X remove Y recordsremove Y records
Update Node X Update Node X
Flush Flush YY before before XX
In SqlServer 6.0 In SqlServer 6.0
17
Recoverable Systems SummaryRecoverable Systems Summary
Cache management keeps state Cache management keeps state potentially recoverablepotentially recoverable Very generally via Very generally via write graphwrite graph
Derived from Derived from installation state graphinstallation state graph
Maintains invariant Maintains invariant INVINV so that replayed operations are correct set so that replayed operations are correct set By synchronizing changes to By synchronizing changes to redo_setredo_set with with
changes to changes to statestate
19
OutlineOutline FoundationFoundation
Conflict graph, state graphs, recovered stateConflict graph, state graphs, recovered state Abstract RecoveryAbstract Recovery
Cache Management: maintaining state Cache Management: maintaining state Installation order: weaker update order than conflict orderInstallation order: weaker update order than conflict order
Recovery Recovery Recovery procedure, redo test Recovery procedure, redo test
Invariant:Invariant: guarantees correct recovery guarantees correct recovery Coordinating state before failure with recovery execution after Coordinating state before failure with recovery execution after
failurefailure
Recoverable SystemsRecoverable Systems Write graphs for maintaining potentially recoverable stateWrite graphs for maintaining potentially recoverable state Maintaining recovery invariant Maintaining recovery invariant Explaining current recovery methodsExplaining current recovery methods
20
Managing the CacheManaging the Cache Stable state:Stable state: prefix of write graph prefix of write graph
Usually a single nodeUsually a single node Means stable state Means stable state potentially recoverablepotentially recoverable
Cache:Cache: usually contains write graph suffix usually contains write graph suffix Volatile state- which is lost during system crashVolatile state- which is lost during system crash UsuallyUsually collapsing nodes so that one node per “variable” collapsing nodes so that one node per “variable”
State update:State update: move a move a minimumminimum write graph node in write graph node in cache to stable state cache to stable state atomicallyatomically Start with Start with potentially recoverable statepotentially recoverable state Atomic transitionAtomic transition – frequently node collapse – frequently node collapse New New potentially recoverable statepotentially recoverable state
21
Maintaining Recovery InvariantMaintaining Recovery Invariant Potentially recoverable state only Potentially recoverable state only ““halfhalf”” of job of job
Ops(log) – Redo_setOps(log) – Redo_set must explain must explain statestate Jobs need to be synchronized to enforce Jobs need to be synchronized to enforce INVINV
Examples:Examples: Stable state is root of write graph Stable state is root of write graph Logical recovery Logical recovery (in paper)(in paper) Physical recovery Physical recovery (in paper)(in paper)
Physiological recovery Physiological recovery **
Extended recovery Extended recovery **
22
Logical RecoveryLogical Recovery
Logical recovery with arbitrary log ops — Logical recovery with arbitrary log ops — System RSystem R Quiesce Quiesce and write shadow “checkpoint” to diskand write shadow “checkpoint” to disk By dumping cache contents to disk shadow pagesBy dumping cache contents to disk shadow pages Disk shadow is installed atomicallyDisk shadow is installed atomically
Replacing old versions of shadow variablesReplacing old versions of shadow variables Our explanationOur explanation
Shadow coalesced on disk is single write graph nodeShadow coalesced on disk is single write graph node Encompassing all changes from last checkpointEncompassing all changes from last checkpoint Hence is a write graph prefixHence is a write graph prefix
Shadow “installed” atomically” via pointer swingShadow “installed” atomically” via pointer swing Accomplished by writing new pointer in checkpoint record to logAccomplished by writing new pointer in checkpoint record to log
Log is truncated with the writing checkpoint recordLog is truncated with the writing checkpoint record All prior records are added to checkpointAll prior records are added to checkpoint Which “installs” all earlier operations simultaneously with stable Which “installs” all earlier operations simultaneously with stable
state update, hence maintaining state update, hence maintaining InvInv
23
Physical RecoveryPhysical Recovery Physical recovery writes entire page Physical recovery writes entire page
Pages are written back to diskPages are written back to disk When prefix of log contains only pages already written back, When prefix of log contains only pages already written back,
log is truncatedlog is truncated Via checkpoint record indicating redo pass startVia checkpoint record indicating redo pass start All records scanned during recovery are replayed All records scanned during recovery are replayed
REDO(op) always is “yes”REDO(op) always is “yes” Our explanationOur explanation
Operations are blind writes of single variable- read set is Operations are blind writes of single variable- read set is emptyempty
All variables with operations not in checkpoint are unexposedAll variables with operations not in checkpoint are unexposed These operations are replayed during recoveryThese operations are replayed during recovery
They never readThey never read Writing to those variables leaves them unexposedWriting to those variables leaves them unexposed
However, they are now set to be installedHowever, they are now set to be installed Installation occurs when checkpoint record is writtenInstallation occurs when checkpoint record is written
Operations now not part of redo scan are thus Operations now not part of redo scan are thus installedinstalled
24
Our GoalOur Goal
REDO Recovery REDO Recovery explanation (explanation (Not all of recovery)Not all of recovery) Cache management:Cache management: stage data to stable state stage data to stable state
Goal: fewer writes & less constrained orderGoal: fewer writes & less constrained order Some methods require careful write ordering– why?Some methods require careful write ordering– why?
Recovery:Recovery: which ops to replay which ops to replay And how to coordinate state changes with replay changes And how to coordinate state changes with replay changes
ProvablyProvably ensure “recoverability” ensure “recoverability” DisclaimersDisclaimers
Abstract story-Abstract story- real recovery needs more real recovery needs more Simpler operation modelSimpler operation model than past work than past work Not everything is explainedNot everything is explained: :
All actually used recovery techniques are handledAll actually used recovery techniques are handled But not all recovery techniques we know of are “quite” captured But not all recovery techniques we know of are “quite” captured
25
System ModelSystem Model StateState: {<name, value>…}: {<name, value>…} OperationOperation::
readset(O): readset(O): set of variables read by set of variables read by OO writeset(O): writeset(O): set of variables written by set of variables written by OO Operations are atomic– system must ensure atomicityOperations are atomic– system must ensure atomicity
Operation SequenceOperation Sequence Sequence of opsSequence of ops O O11,O,O22,…O,…Ok k … O… Ofinalfinal
State SequenceState Sequence Sequence of states Sequence of states SS11, S, S22,… S,… Sk k … S… Sfinal final generated by op seg from generated by op seg from SS00
OOkk precedes (leads to) precedes (leads to) SSk k when executed “against” when executed “against” SSk-1k-1
Recovery goalRecovery goal From some state and a record of operations (on log)From some state and a record of operations (on log) Reproduce last state in sequence Reproduce last state in sequence SSfinal final