Databases 2
xx The Paxos Commit Algorithm
The Paxos Commit Algorithmxx
Agenda
Paxos Commit Algorithm: Overview The participating processes
The resource managers The leader The acceptors
Paxos Commit Algorithm: the base version Failure scenarios Optimizations for Paxos Commit Performance Paxos Commit vs. Two-Phase Commit Using a dynamic set of resource managers
The Paxos Commit Algorithmxx
Paxos Commit Algorithm: Overview
Paxos was applied to Transaction Commit by L.Lamport and Jim Gray in Consensus on Transaction Commit
One instance of Paxos (consensus algorithm) is executed for each resource manager, in order to agree upon a value (Prepared/Aborted) proposed by it
“Not-synchronous” Commit algorithm Fault-tolerant (unlike 2PC)
Intended to be used in systems where failures are fail-stop only, for both processes and network
Safety is guaranteed (unlike 3PC) Formally specified and checked Can be optimized to the theoretically best performance
The Paxos Commit Algorithmxx
Participants: the resource managers
N resource managers (“RM”) execute the distributed transaction, then choose a value (“locally chosen value” or “LCV”; ‘p’ for prepared iff it is willing to commit)
Every RM tries to get its LCV accepted by a majority set of acceptors (“MS”: any subset with a cardinality strictly greater than half of the total).
Each RM is the first proposer in its own instance of Paxos
Participants: the leader
Coordinates the commit algorithm All the instances of Paxos share the same leader It is not a single point of failure (unlike 2PC) Assumed always defined (true, many leader-(s)election
algorithms exist) and unique (not necessarily true, but unlike 3PC safety does not rely on it)
The Paxos Commit Algorithmxx
p p p p p
Participants: the acceptors
A denotes the set of acceptors All the instances of Paxos share the
same set A of acceptors 2F+1 acceptors involved in order to
achieve tolerance to F failures We will consider only F+1 acceptors,
leaving F more for “spare” purposes (less communication overhead)
Each acceptors keep track of its own progress in a Nx1 vector
Vectors need to be merged into a Nx|MS| table, called aState, in order to take the global decision (we want “many” p’s)
AC4
AC5
AC1
AC2
AC3
Consensus box (MS)RM1
a
Ok!
RM2
p
Ok!
RM3
p
Ok!
a a a a a
p p p p p
3rd instance
1st instance
2nd instance
Acc1 Acc2 Acc3 Acc4 Acc5aState
Paxos
The Paxos Commit Algorithmxx
Paxos Commit (base) RMrmAMSacc
Not blocked iff F acceptors respond
T1
},{ apv
T2
AC0L AC1 AC2 RM1 RM2 RM3 RM4RM0
prepare(N-1) x
p2a rm 0 v(rm)(N(F+1)-1) x
rm 0 v(rm)rm 0 v(rm)p2b acc
rm 0 v(rm)rm 0 v(rm)rm 0 v(rm)Opt.F x
If (Global Commit)
then commitp3
else abortp3
x N
0 0 v(0)p2a1x BeginCommit
(N=5)
(F=2): Writes on log
The Paxos Commit Algorithmxx
Global Commit Condition
That is: there must be one and only one row for each RM involved in the commitment; in each row of those rows there must be at least F+1 entries that have ‘p’ as a value and refer to the same ballot
)()()()(( MSaccMSbrm p2b acc rm b p .)recsentwas
CommitGlobal
The Paxos Commit Algorithmxx
[T1] What if some RMs do not submit their LCV?
RMRMj missing
LeaderOne majorityof acceptors
Leader: «Has resource manager j ever proposed you a value?»
},{ apv
(Promise not to answer any bL2<bL1)
“accept?”
“promise”
“prepare?”
p1a
p1b
p2a
(1) Acceptori: «Yes, in my last session (ballot) bi with it I accepted its proposal vi»
(2) Acceptori: «No, never»
If (at least |MS| acceptors answered)
If (for ALL of them case (2) holds) then V=‘a’ [FREE]
else V=v(maximum({bi}) [FORCED]
Leader: «I am j, I propose V»
bL1 >0
The Paxos Commit Algorithmxx
trusted
trusted
trusted
ignored
ignored
ignored
[T2] What if the leader fails?
If the leader fails, some leader-(s)election algorithm is executed. A faulty election (2+ leaders) doesn’t preclude safety ( 3PC), but can impede progress…
MSb1 >0
b2>b1
Non-terminating example: infinite sequence of p1a-p1b-p2a messages from 2 leaders
Not really likely to happen It can be avoided (random T?)b3>b2
b4>b3
T
T
T
L2L1
trusted
The Paxos Commit Algorithmxx
Optimizations for Paxos Commit (1)
Co-Location: each acceptor is on the same node as a RM and the initiating RM is on the same node as the initial leader
-1 message phase (BeginCommit), -(F+2) messages
“Real-Time assumptions”: RMs can prepare spontaneously. The prepare phase is not needed anymore, RMs just “know” they have to prepare in some amount of time
-1 message phase (Prepare), -(N-1) messages
RM3RM0
AC0
p2a
BeginCommit
Lp3
RM1
AC1
p2a
RM4RM2
AC2
p2a
RM0
AC0L
RM3 RM4RM1
AC1
RM2
AC2
Not needed anymore!prepare(N-1) x
The Paxos Commit Algorithmxx
Optimizations for Paxos Commit (2)
Phase 3 elimination: the acceptors send their phase2b messages (the columns of aState) directly to the RMs, that evaluate the global commit condition
Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC) FPC + Co-location + R.T.A. = Optimal Consensus Algorithm
RM0
AC0L
RM3 RM4RM1
AC1
RM2
AC2
RM0
AC0L
RM3 RM4RM1
AC1
RM2
AC2
p2b
p3
p2b
The Paxos Commit Algorithmxx
Performance
2PC Paxos Commit Faster Paxos Commit
No coloc. Coloc. No coloc. Coloc. No coloc. Coloc.
Message delays* 4 3 5 4 4 3
Messages* 3N-1 3N-3 NF+F+3N-1 NF+3N-3 2NF+3N-1 2FN-2F+3N-3
Stable storage write delays**
2 2 2
Stable storage writes**
N+1 N+F+1 N+F+1
If we deploy only one acceptor for Paxos Commit (F=0), its fault tolerance and cost are the same as 2PC’s. Are they exactly the same protocol in that case?
*Not Assuming RMs’ concurrent preparation (slides-like scenario)**Assuming RMs’ concurrent preparation (r.t. constraints needed)
The Paxos Commit Algorithmxx
Paxos Commit vs. 2PC
Yes, but…
T1
T2
TM RM1Other RMs
2PC from Lamport and Gray’s paper
2PC from the slides of the
course
…two slightly different versions of 2PC!
The Paxos Commit Algorithmxx
Using a dynamic set of RM
You add one process, the registrar, that acts just like another resource manager, despite the following: pad Pad
RMs can join the transaction until the Commit Protocol begins
The global commit condition now holds on the set of resource managers proposed by the registrar and decided in its own instance of Paxos:
},{ apvregistrar }:{ ntransactiothejoinedrmrmvregistrar
)()()()(( MSaccMSbvrm registrar p2b acc rm b p .)recsentwas
DynRMCommitGlobal
AC4
AC5
AC1
AC2
AC3
MS
RM1a
Ok!
RM2p
Ok!
RM3p
Ok!
Paxos
REGRM1;RM2;RM3
Ok!
join
join
join
RM1RM2RM3
The Paxos Commit Algorithmxx
Thank You!
Questions?
Top Related