Post on 03-Jan-2016
description
A Fusion-based Approach for Tolerating Faults in Finite State Machines
Vinit Ogale, Bharath BalasubramanianParallel and Distributed Systems Lab
Electrical and Computer Engineering Dept. University of Texas at Austin
Vijay K. GargIBM India Research Lab
OutlineMotivationRelated WorkQuestions and Issues AddressedModelPartition LatticeFault GraphsFault Tolerance in FSMs and (f,m) – fusionAlgorithms : Generating Backups and RecoveryImplementation ResultsConclusion and Future Work
2
MotivationMany real applications modeled as FSMsEmbedded Systems :
Traffic controllers, home appliancesSensor networks
E.g. hundreds of multiple sensors (like temperature, pressure etc) need to be backed up
3
Problem
4
Given a set of finite state machines (FSMs), some FSMs may either crash (fail-stop faults) or lie about their execution state (Byzantine faults)
a a
a
b b
b
Counter counting ‘b’sCounter counting ‘a’s
a0 a1 a2 b0 b1 b2
Existing Solution - Replicate
5
n.f extra FSMs to tolerate k crash faults; 2.n.f extra FSMs to tolerate f Byzantine faults (where n is the # of original FSMs)
a a
a
b b
bCounter counting ‘b’s
Counter counting ‘a’s
a0 a1 a2
b0 b1 b2
1-crash fault tolerant setup
a a
a
b
b b
Related WorkTraditional Approach – Redundancy
n.k backup machines to tolerate k faults in n machinesFault Tolerance in Finite State Machines using Fusion
(Balasubramanian, Ogale, Garg 08)Exponential algorithm for generating machines which can tolerate
crash faults Number of faults = Number of Machines
Fusible Data Structures (Garg, Ogale 06)Fuse common data structures such as link lists, hash tables etc – the
fused structure smaller than sum of original structuresErasure Coding
Fault Tolerance in Data
6
Reachable Cross Product
7
a a
a
A
b b
b
B
Counter counting ‘b’s
Counter counting ‘a’s
a0 a1 a2
b0 b1 b2
0 0 0
R (A, B)
<a0, b0> <a0, b1> <a0, b2>
<a1, b0> <a1, b1> <a1,b2>
<a2, b0> <a2, b1> <a2, b2>
Reachable Cross Product of {A,B}
=
Can We Do Better ?
a a
a
b b
b
Counter counting ‘b’s (mod 3)
Counter counting ‘a’s (mod 3)
a0 a1 a2
b0 b1 b2
F1
“a a b”
(a + b ) modulo 3
8
b b
b
a a
a
Can We Do Better ?
a a
a
b b
b
Counter counting ‘b’s (mod 3)
Counter counting ‘a’s (mod 3)
a0 a1 a2
b0 b1 b2
F1
(a + b ) modulo 3
9
b b
b
a
a
a
F2
(a - b ) modulo 3 a
a a
b b
b
2-crash fault tolerant setup
Questions and Issues addressedCan we do better than the cross product ?How many faults can be tolerated ? What is the minimum
number of machines required to tolerate f crash faults ?Can these machines tolerate Byzantine faults? (For
example, in previous slide, DFSMs A and B along with F1 and F2 can tolerate one Byzantine fault )
Main Aims :Develop theory to understand and define this problem Efficient algorithms based on this to generate backup
machines
10
Application Scenario: Sensor Network1000 sensors (simple counters) each recording a
parameter (temperature, pressure etc.). Sensors will be collected later and their data analyzed offline
10 sensors are expected to crashReplication requires 1000 x 10 backup sensors to
ensure fault tolerant operationCan we use just 10 extra sensors instead of
10000?
11
ModelFSMs (machines) execute independently (in
parallel)The inputs to a FSM are not determined by any other
FSM.FSMs act concurrently on the same set of eventsFail stop (crash) faults
Loss of current state, underlying FSM intactByzantine faults
Machines can `lie` about their current state
12
Join of Two FSMs
13
Join (t) : Reachable cross product: 4 states in this case instead of 9
Less Than Equal To Relation (·)Given FSMs: A and B
A · B , A t B = B
Given the state of B, we can determine the current state of A
14
PartitionsGiven any FSM, we can partition the states into
blocks such that the transitions for all states in a block are consistentE.g. if states t0 and t3
have to be combined to form one partition
t0
t3
t1 t2
15
Input 0
Input 1
Largest Consistent Partition Containing {t0,t3}
t0
t3
t1 t2
t0,t3 t1 t2
16
Largest Consistent Partition Containing {t0,t1}
17
t0
t3
t1 t2
t0,t1, t2 t3
Partition LatticeSet of all FSMs corresponding to partitions of a given
FSM (say T) forms a lattice with respect to the · relation [HarSte66].
i.e, for any two FSMs, A and B, formed by partitioning T, there exists a unique C · T such thatC = A t B : (join/ t )
A · C and B · C and C is the smallest such elementC = A u B : (meet/ u)
C · A and C · B and C is the largest such FSM
18
t0,t2t1t1
t3 t3
t0t0
t3t3
t1t1t2t2
t0,t3t0,t3
t1 t1 t2 t2 t0t0
t1 t1 t2,t3 t2,t3 t0t0
t1,t2 t1,t2 t3 t3
t0,t2,t3t0,t2,t3t1t1 t0,t3t0,t3
t1,t2t1,t2 t0t0t1,t2,t3t1,t2,t3
t0, t1,t2t0, t1,t2t3t3
t0,t1,t2,t3t0,t1,t2,t3
F1 (A) F2 (B) F3F4
S1
S2 S3 S4
>
19
Top Element (>)Given a set of FSMs: A = {A1, …, An}
> = A1 t A2 t … t An
All FSMs we consider henceforth are less than or equal to >
Intuitively, > has information about the state of every machine in the original set, A
20
Bottom Element of Lattice (?)Single state FSM.
contains one partition with all the states on any input it transitions to itselfconveys no information about the current state of any
machine
21
t0,t2t1t1
t3 t3
t0t0
t3t3
t1t1t2t2
t0,t3t0,t3
t1 t1 t2 t2 t0t0
t1 t1 t2,t3 t2,t3 t0t0
t1,t2 t1,t2 t3 t3
t0,t2,t3t0,t2,t3t1t1 t0,t3t0,t3
t1,t2t1,t2 t0t0t1,t2,t3t1,t2,t3
t0, t1,t2t0, t1,t2t3t3
t0,t1,t2,t3t0,t1,t2,t3
F1F2 F3
F4
S1
S2 S3 S4
>
22
Tolerating Faults
F1F2
23
Tolerating Faults
F1F2
X
t0
t3
t1 t2
>
T: Reachable cross product
24
Fault Graph: Fault tolerance indicator
t0
t3
t1 t2
>
t0,t3 t1 t2
F1
t0 t1 t2,t3
F2
X
t3
t0 t2
t1
1 1
2 2
2
2
T: Reachable cross product Fault Graph G (A , T)A : { F1, F2} : Original machines
25
t0,t2 t1 t3
t0
t3
t1 t2
t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3
t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3
t0,t1,t2,t3
F1F2 F3
F4
S1
S2 S3 S4
>A = {FSMs in Yellow Region} t3
t0 t2
t1
1 1
2 2
2
2
26
Hamming DistanceHamming distance d(ti, tj) : weight of the
edge separating the states (ti, tj) in the fault graphe.g. d(t0, t1) = 2
Minimum Hamming distance dmin(T, A ) : The weight of the weakest edge in the fault graphe.g. dmin(T, A ) = 1
t3
t0 t2
t1
1 1
2 2
2
2
dmin(T, A ) = 1
27
Fault Tolerance in FSMs (crash faults)
Theorem 1 : A set of machines A can tolerate up to f crash faults iff :dmin(T(A), A ) > fe.g. A = {A,B,M1,M2}
- dmin(T(A ), A ) = 3
- can tolerate 2 crash faults
t3
t0 t2
t1
3
dmin(T(A), A ) = 3
28
3 33
4
4
Fault Tolerance in FSMs (Byzantine faults)
Theorem 2 : A set of machines A can tolerate up to f Byzantine faults iff :dmin(T(A), A ) > 2fe.g. A = {A,B,M1,M2}
Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t0, t2}, M2 ={t3}B and M1 are lying about their state (f = 2)Since dmin(T(A), A ) = 3 < 4, we cannot determine the state
of T
t3
t0 t2
t1
3
dmin(T(A), A ) = 3
29
3 33
4
4
Fault Tolerance in FSMs (Byzantine faults)
Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t3}, M2 ={t3}Only B is lying about it’s state (f = 2)Since dmin(T(A), A ) = 3 > 2, we can determine the
state of T as t3
Henceforth, dmin(T(A), A ) => dmin(A )
t3
t0 t2
t1
3
dmin(T(A), A ) = 3
30
3 33
4
4
Fault Tolerance and (f,m)- fusionGiven a set of n machines, A , the set of m
machines, F , is an (f,m)-fusion of A, if :dmin(A F ) > f
The set of machines in A F can tolerate f crash faults or f/2 Byzantine faultsE.g. A = {A,B}, F = {M1,M2}, dmin(A F ) = 3 F = {M1,M2} is a (2,2) – fusion of A
31
Minimal FusionGiven a set of machines A, a fusion set F is minimal if
there does not exist another (f, m)- fusion F' such that
8 F 2 F, 9 F' 2 F' : F' · F and 9( F 2 F, F' 2 F') : F' < F
32
t0,t2 t1 t3
t0
t3
t1 t2
t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3
t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3
t0,t1,t2,t3
F1F2 F3
F4
S1
S2 S3 S4
>
(1,1) fusion
Minimal (1,1) fusion
A = {FSMs in Yellow Region}n = 2
33
Minimal Fusion: Example
t0
t3
t1 t2
>
t0,t3 t1 t2
F1
t0 t1 t2,t3
F2
t0, t1,t2 t3
S4
X
t3
t0 t2
t1
2 2
3
22
2
Fault Graph : G (A , T)A
34
Algorithm : Generating BackupsAim: Add the least possible number of machines that
tolerate f faults
Input: Set of machines A , number of faults f
Output: Minimal fusion set with the least size
If |T|= N , size of the event set if |E|, the time complexity of the algorithm is O(N3. |E|. f)
35
Algorithm overview f: # of faults, A : given set of machines1. While dmin (A F) f
1. M := >2. While M
1. Compute lower cover of M , i.e. LC(M)2. If machine F LC(M): dmin (F A F)> dmin (A F)
M := FElse F := F F
2. Return F
36
t0,t2 t1 t3
t0
t3
t1 t2
t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3
t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3
t0,t1,t2,t3
F1F2 F3
F4
S1
S2 S3 S4
>
A = {FSMs in Yellow Region} t3
t0 t2
t1
1 1
2 2
2
2
37
w=1
t0,t2 t1 t3
t0
t3
t1 t2
t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3
t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3
t0,t1,t2,t3
F1F2 F3
F4
S1
S2 S3 S4
>
A = {FSMs in Yellow Region} t3
t0 t2
t1
2 2
3 3
3
3
38
w=2
t0,t2 t1 t3
t0
t3
t1 t2
t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3
t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3
t0,t1,t2,t3
F1F2 F3
F4
S1
S2 S3 S4
>
A = {FSMs in Yellow Region} t3
t0 t2
t1
2 2
3 2
3
3
39
w=2
t0,t2 t1 t3
t0
t3
t1 t2
t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3
t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3
t0,t1,t2,t3
F1F2 F3
F4
S1
S2 S3 S4
>
A = {FSMs in Yellow Region} t3
t0 t2
t1
2 1
3 2
2
3
40
w=1
t0,t2 t1 t3
t0
t3
t1 t2
t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3
t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3
t0,t1,t2,t3
F1F2 F3
F4
S1
S2 S3 S4
>
A = {FSMs in Yellow Region} t3
t0 t2
t1
2 2
2 2
3
2
41
w=2
Algorithm : RecoveryAim: Recover the state of the faulty machines for f
crash or f/2 Byzantine faults, given the state of the remaining machines
Input: Current states of all available machines in A F
Output: Correct state of T
The time complexity of the algorithm is O((n+ m) . f )
42
Algorithm overview S: set of current states of machines in A F count : Vector of size |T|, initialized to 01. For all (s in S) do1. For all (ti in s) do
1. ++count[i]
2. return tc : 1 · c · N and count[c] is the maximal element in count
43
Algorithm : Example
Consider machines A, B, M1,M2 :dmin ({A, B, M1,M2 }) = 3 ; they can tolerate one Byzantine
fault
Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t1, t2,t3}, M2 ={t0}M1 is lying about it’s stateThe recovery algorithm will return t0 since, count[0] = 3, is greater
than, count[1] = 1, count[2] = 1 and count[3] = 2
44
Experimental ResultsOriginal Machines f(faults) State space for
replicationState space for fusion
MESI, Counter A and B, Shift register
2 7,569 1,521
Even and Odd Parity Checkers, Toggle Switch, Pattern Generator, MESI
3 262,144 32,768
Counters A and B, Divider, Machine A , Machine B
2 6,724 504
Pattern Generator, TCP, Machine A, Machine B
2 3,136 2464
45
Conclusion/Future WorkIt is not always necessary to have n.f backups to
tolerate f faultsPolynomial time algorithm to generate the smallest
minimal set that tolerates f faultsImplementation of this algorithm shows that many
complex state machines have efficient fusionsWill machines outside the lattice give better results?Backup Machines need to be given all events ; can we
do better?
46