A Theory of Redo Recovery

1

A Theory of Redo RecoveryA Theory of Redo Recovery

David LometDavid LometMicrosoft Research, RedmondMicrosoft Research, Redmond

Mark TuttleMark TuttleHP Research, CambridgeHP Research, Cambridge

2

Big PictureBig Picture

Redo RecoveryRedo Recovery requires Good db state Replay of the right operations

Good state updates:Good state updates: conflict order not required Write-read conflicts can be ignored Some db “variables” irrelevant (don’t need to update them)

Synchronize State updateSynchronize State update & ops replayedops replayed Captured in recovery InvariantInvariant We prove that maintaining invariant maintaining invariant recovery recovery

Current recovery methods:Current recovery methods: maintain invariant Show how current methods work (e.g. ARIES redo) Show how “new” methods could work

Much simplerMuch simpler than our VLDB’95 paper

3

Conflict State Graph (CSG)Conflict State Graph (CSG) Conflict graphConflict graph (“Borrowed” from Concurrency Control)(“Borrowed” from Concurrency Control)

Nodes are log operations; Edges: conflicts (RW, WR, Nodes are log operations; Edges: conflicts (RW, WR, WW)WW)

State graph SGState graph SG Add writes(node): {<name, value>…} of vars updated Add writes(node): {<name, value>…} of vars updated State for SG:State for SG: {< {<x,vx,v>| <>| <x,v>x,v> in writes(n) and n is last node in writes(n) and n is last node

in state graph with in state graph with x in vars(n)x in vars(n)}}

Final state Final state SSfinalfinal of of CSGCSG is desired is desired recovered staterecovered state Any Any prefixprefix of a state graph is a state graph of a state graph is a state graph

Prefix: node in prefix Prefix: node in prefix predecessor in prefix predecessor in prefix State of any State of any prefix of CSG can be recovered by can be recovered by

Replaying operations in suffix in conflict graph orderReplaying operations in suffix in conflict graph order

We will relax CSG requirements

4

Conflict State Graph & StatesConflict State Graph & States

OO: readset{: readset{xx}}

writes{<writes{<x,1>x,1>}}

QQ: readset{: readset{xx}}


PP: readset{: readset{xx}}

writes{<writes{<y,2>y,2>}}

x=1,y=0x=1,y=0

SSfinal final : x=3, y=2: x=3, y=2

x=0,y=0x=0,y=0

x=1, y=2x=1, y=2

Write-read edgeWrite-read edge

Write-read & Write-read & write-write & write-write &

read-write edgeread-write edge

Read-write edgeRead-write edge

5

Installation GraphInstallation Graph Example:Example: Initial stable state: Initial stable state: {<x,0><y,0>}{<x,0><y,0>}

O: x O: x ←← x+1 x+1 P: y P: y ←← x+1 x+1 After O,P, state is After O,P, state is {<x,1>,<y,2>}{<x,1>,<y,2>} Flush y to disk-Flush y to disk- Stable state is Stable state is {<x,0><y,2>}{<x,0><y,2>}

Replay O-Replay O- generates correct state generates correct state {<x,1>,<y,2>}{<x,1>,<y,2>} O’s readset x unchanged by P’s installationO’s readset x unchanged by P’s installation Even though Write-Read edge orders P after OEven though Write-Read edge orders P after O

Installation graph:Installation graph: conflict graph without conflict graph without write-read edgeswrite-read edges Installation state graphInstallation state graph ( (ISGISG): ):

same same writes(n)writes(n) for node n as conflict state graphfor node n as conflict state graph State of any prefix of ISG can be recoveredState of any prefix of ISG can be recovered

More prefixes (states) because of fewer edgesMore prefixes (states) because of fewer edges

y written by Py written by P

6

Installation State Graph & StatesInstallation State Graph & Statesx=0,y=0x=0,y=0







x=1,y=0x=1,y=0

x=3, y=2x=3, y=2

x=1, y=2x=1, y=2

x=0,y=2x=0,y=2

Removed write-read edge

Retained read-write edge

Retained write-write &

read-write edge

ISG recoverable state

7

Exposed VariablesExposed Variables ExampleExample

O1: x O1: x ←← z+1 z+1 O2: x O2: x ←← 25 25 After After O2O2, we don’t care about x value of , we don’t care about x value of O1O1

Variable Variable xx is is unexposedunexposed after ops after ops II ({ ({O1}O1} here) if here) if minminconflictconflict op op in in Ops(log) – I Ops(log) – I writes writes x x

Without reading itWithout reading it x’s value is a “don’t care” when x’s value is a “don’t care” when xx is unexposed is unexposed This is example of This is example of Physical LoggingPhysical Logging

Prefix of installation graphPrefix of installation graph explainsexplains state state SS if if values of values of exposedexposed variables variables in in SS are the same as are the same as values in state of prefix of ISGvalues in state of prefix of ISG

8

Potentially Recoverable StatePotentially Recoverable State

Potentially recoverable state:Potentially recoverable state: state that state that by the replay of a subset of operations of the by the replay of a subset of operations of the

conflict graph, in conflict order, will produce conflict graph, in conflict order, will produce the the recovered staterecovered state S Sfinalfinal

Theorem:Theorem: If If SS is a state explained by a is a state explained by a prefix of the installation graph, then prefix of the installation graph, then SS is is potentially recoverablepotentially recoverable

9

REDO Test & Recovery ProcedureREDO Test & Recovery Procedure REDO:REDO: tests op’s in conflict order log scan tests op’s in conflict order log scan

Yes (true):Yes (true): replay operation replay operation No (false):No (false): bypass operation bypass operation

redo_set = redo_set = {O|{O|REDOREDO(O..) & O on scanned log}(O..) & O on scanned log} Recover Procedure:Recover Procedure:

Set log scan point to “checkpoint”Set log scan point to “checkpoint” whilewhile not at log end not at log end

O O ←← current log operation current log operation State = State = ifif REDOREDO(O,State,Log,Analysis) (O,State,Log,Analysis)

ThenThen O(State) O(State) ElseElse State State

Advance log scan point to next operationAdvance log scan point to next operation EndEnd

10

RecoveryRecovery Recoverable system:Recoverable system: a system with a system with

a potentially recoverable state a potentially recoverable state SSpotpot Replay of O’s in Replay of O’s in redo_set redo_set from from SSpotpot produces produces SSfinalfinal

InvInv : : ops(Log)-redo_setops(Log)-redo_set defines prefix of the defines prefix of the installation state graph that explains Stateinstallation state graph that explains State Every system change must be Every system change must be atomic transitionatomic transition

maintaining maintaining InvInv Corollary:Corollary: Given a Given a statestate,, loglog,, checkpointcheckpoint, and an , and an

execution ofexecution of RecoverRecover (identifying redo_set) (identifying redo_set) If If InvInv holds holds Then System is recoverableThen System is recoverable Only specific potentially specific potentially

recoverable staterecoverable state is recoverable

11

Write Graph Write Graph Write graph:Write graph: start from start from installation state graphnstallation state graph

CollapseCollapse set of nodes (acyclic) merges nodes set of nodes (acyclic) merges nodes Add new nodeAdd new node for next operation for next operation Add edgeAdd edge (collapse cycles) (collapse cycles) Remove a writeRemove a write of an unexposed variable of an unexposed variable

We do not care about values of We do not care about values of unexposedunexposed variables variables

Write graph captures Write graph captures entire system stateentire system state PrefixPrefix that is stable that is stable SuffixSuffix in cache in cache

Cache ManagerCache Manager uses write graph uses write graph To maintain potentially recoverable stateTo maintain potentially recoverable state Usually by collapsing suffix node into stable prefixUsually by collapsing suffix node into stable prefix

12

Removed write-read edge

Write graph remains acyclic

Based on installation graph

Write Graph Write Graph {via Node Collapse}{via Node Collapse}Fewer StatesFewer States







x=3, y=2x=3, y=2

x=0,y=0x=0,y=0

x=0,y=2x=0,y=2

Collapsed Collapsed Node nNode n

x=1, y=2x=1, y=2

x=1, y=0x=1, y=0

Ops(n) = {O,P}Ops(n) = {O,P}Writes(n) = {<x,3>}Writes(n) = {<x,3>}

Keep only one version of each variable in

cache

Retained read-write edge translates to flush order for

cache manager

13

Managing RecoveryManaging Recovery

Stable StateStable State

Write Graph PrefixWrite Graph Prefix

Usually Single NodeUsually Single Node

Log

O1

O2

O3

O1 O2 O3

Volatile StateVolatile State

Suffix of Write GraphSuffix of Write Graph

In CacheIn Cache

Collapse to“Install” X

Updating State

Removing O3 from redo_set

AtomiAtomicc

14

Physiological RecoveryPhysiological Recovery Physiological recoveryPhysiological recovery (e.g. ARIES) (e.g. ARIES)

Operation Form:Operation Form: read A, write Aread A, write A Log Op has Log Op has LSNLSN

Variable tagged: LSN of last log op writing itVariable tagged: LSN of last log op writing it REDOREDO: : op’s LSN > variable LSNop’s LSN > variable LSN “Yes” “Yes” (Replay) (Replay)

Our explanationOur explanation Ops writing variable collapsed to one cache nodeOps writing variable collapsed to one cache node Flushing page to Flushing page to stable state (root of write graph)stable state (root of write graph)

CollapsesCollapses cache node into stable state node cache node into stable state node Keeps state Keeps state potentially recoverablepotentially recoverable

redo testredo test node’s ops removed from node’s ops removed from redo_setredo_set MaintainsMaintains invariant invariant InvInv

[state change; redo_set change] is [state change; redo_set change] is atomicatomic

Physical and Logical Recovery described in paper

15

Extended LSN MethodExtended LSN Method

Generalize physiological opsGeneralize physiological ops read/write multiple variablesread/write multiple variables Our example:Our example: ops can read X, write Y (like P) ops can read X, write Y (like P)

also read X, write Xalso read X, write X LSNsLSNs still effective for REDO test still effective for REDO test

Flush synchronizes change to state and redo_setFlush synchronizes change to state and redo_set Cache managementCache management

Now requires flush of one variable before anotherNow requires flush of one variable before another Our theory captures this Our theory captures this careful writecareful write requirement requirement

Consider B-tree split:Consider B-tree split: (B (Blinklink-tree) -tree) ** Next slide Next slide shows “half split” graphicallyshows “half split” graphically Must also post index term for Must also post index term for newnew node node

16

Extended Recovery Extended Recovery {B{Blinklink-tree Split}-tree Split}

x=0,y=0x=0,y=0



x=3, y=2x=3, y=2

x=0,y=2x=0,y=2x=1, y=2x=1, y=2

x=1, y=0x=1, y=0

New Node YNew Node YOld Node XOld Node X

Move half to node Y Move half to node Y Read X, write Y Read X, write Y





Collapsed Collapsed NodeNode

Ops(n) = {O,P}Ops(n) = {O,P}Writes(n) = {<x,3>}Writes(n) = {<x,3>}

Update node X Update node X remove Y recordsremove Y records

Update Node X Update Node X

Flush Flush YY before before XX

In SqlServer 6.0 In SqlServer 6.0

17

Recoverable Systems SummaryRecoverable Systems Summary

Cache management keeps state Cache management keeps state potentially recoverablepotentially recoverable Very generally via Very generally via write graphwrite graph

Derived from Derived from installation state graphinstallation state graph

Maintains invariant Maintains invariant INVINV so that replayed operations are correct set so that replayed operations are correct set By synchronizing changes to By synchronizing changes to redo_setredo_set with with

changes to changes to statestate

18

Questions?

19

OutlineOutline FoundationFoundation

Conflict graph, state graphs, recovered stateConflict graph, state graphs, recovered state Abstract RecoveryAbstract Recovery

Cache Management: maintaining state Cache Management: maintaining state Installation order: weaker update order than conflict orderInstallation order: weaker update order than conflict order

Recovery Recovery Recovery procedure, redo test Recovery procedure, redo test

Invariant:Invariant: guarantees correct recovery guarantees correct recovery Coordinating state before failure with recovery execution after Coordinating state before failure with recovery execution after

failurefailure

Recoverable SystemsRecoverable Systems Write graphs for maintaining potentially recoverable stateWrite graphs for maintaining potentially recoverable state Maintaining recovery invariant Maintaining recovery invariant Explaining current recovery methodsExplaining current recovery methods

20

Managing the CacheManaging the Cache Stable state:Stable state: prefix of write graph prefix of write graph

Usually a single nodeUsually a single node Means stable state Means stable state potentially recoverablepotentially recoverable

Cache:Cache: usually contains write graph suffix usually contains write graph suffix Volatile state- which is lost during system crashVolatile state- which is lost during system crash UsuallyUsually collapsing nodes so that one node per “variable” collapsing nodes so that one node per “variable”

State update:State update: move a move a minimumminimum write graph node in write graph node in cache to stable state cache to stable state atomicallyatomically Start with Start with potentially recoverable statepotentially recoverable state Atomic transitionAtomic transition – frequently node collapse – frequently node collapse New New potentially recoverable statepotentially recoverable state

21

Maintaining Recovery InvariantMaintaining Recovery Invariant Potentially recoverable state only Potentially recoverable state only ““halfhalf”” of job of job

Ops(log) – Redo_setOps(log) – Redo_set must explain must explain statestate Jobs need to be synchronized to enforce Jobs need to be synchronized to enforce INVINV

Examples:Examples: Stable state is root of write graph Stable state is root of write graph Logical recovery Logical recovery (in paper)(in paper) Physical recovery Physical recovery (in paper)(in paper)

Physiological recovery Physiological recovery **

Extended recovery Extended recovery **

22

Logical RecoveryLogical Recovery

Logical recovery with arbitrary log ops — Logical recovery with arbitrary log ops — System RSystem R Quiesce Quiesce and write shadow “checkpoint” to diskand write shadow “checkpoint” to disk By dumping cache contents to disk shadow pagesBy dumping cache contents to disk shadow pages Disk shadow is installed atomicallyDisk shadow is installed atomically

Replacing old versions of shadow variablesReplacing old versions of shadow variables Our explanationOur explanation

Shadow coalesced on disk is single write graph nodeShadow coalesced on disk is single write graph node Encompassing all changes from last checkpointEncompassing all changes from last checkpoint Hence is a write graph prefixHence is a write graph prefix

Shadow “installed” atomically” via pointer swingShadow “installed” atomically” via pointer swing Accomplished by writing new pointer in checkpoint record to logAccomplished by writing new pointer in checkpoint record to log

Log is truncated with the writing checkpoint recordLog is truncated with the writing checkpoint record All prior records are added to checkpointAll prior records are added to checkpoint Which “installs” all earlier operations simultaneously with stable Which “installs” all earlier operations simultaneously with stable

state update, hence maintaining state update, hence maintaining InvInv

23

Physical RecoveryPhysical Recovery Physical recovery writes entire page Physical recovery writes entire page

Pages are written back to diskPages are written back to disk When prefix of log contains only pages already written back, When prefix of log contains only pages already written back,

log is truncatedlog is truncated Via checkpoint record indicating redo pass startVia checkpoint record indicating redo pass start All records scanned during recovery are replayed All records scanned during recovery are replayed

REDO(op) always is “yes”REDO(op) always is “yes” Our explanationOur explanation

Operations are blind writes of single variable- read set is Operations are blind writes of single variable- read set is emptyempty

All variables with operations not in checkpoint are unexposedAll variables with operations not in checkpoint are unexposed These operations are replayed during recoveryThese operations are replayed during recovery

They never readThey never read Writing to those variables leaves them unexposedWriting to those variables leaves them unexposed

However, they are now set to be installedHowever, they are now set to be installed Installation occurs when checkpoint record is writtenInstallation occurs when checkpoint record is written

Operations now not part of redo scan are thus Operations now not part of redo scan are thus installedinstalled

24

Our GoalOur Goal

REDO Recovery REDO Recovery explanation (explanation (Not all of recovery)Not all of recovery) Cache management:Cache management: stage data to stable state stage data to stable state

Goal: fewer writes & less constrained orderGoal: fewer writes & less constrained order Some methods require careful write ordering– why?Some methods require careful write ordering– why?

Recovery:Recovery: which ops to replay which ops to replay And how to coordinate state changes with replay changes And how to coordinate state changes with replay changes

ProvablyProvably ensure “recoverability” ensure “recoverability” DisclaimersDisclaimers

Abstract story-Abstract story- real recovery needs more real recovery needs more Simpler operation modelSimpler operation model than past work than past work Not everything is explainedNot everything is explained: :

All actually used recovery techniques are handledAll actually used recovery techniques are handled But not all recovery techniques we know of are “quite” captured But not all recovery techniques we know of are “quite” captured

25

System ModelSystem Model StateState: {<name, value>…}: {<name, value>…} OperationOperation::

readset(O): readset(O): set of variables read by set of variables read by OO writeset(O): writeset(O): set of variables written by set of variables written by OO Operations are atomic– system must ensure atomicityOperations are atomic– system must ensure atomicity

Operation SequenceOperation Sequence Sequence of opsSequence of ops O O11,O,O22,…O,…Ok k … O… Ofinalfinal

State SequenceState Sequence Sequence of states Sequence of states SS11, S, S22,… S,… Sk k … S… Sfinal final generated by op seg from generated by op seg from SS00

OOkk precedes (leads to) precedes (leads to) SSk k when executed “against” when executed “against” SSk-1k-1

Recovery goalRecovery goal From some state and a record of operations (on log)From some state and a record of operations (on log) Reproduce last state in sequence Reproduce last state in sequence SSfinal final

A Theory of Redo Recovery

Documents

Transcript of A Theory of Redo Recovery