Hein Meling, CANOE Workshop, Toronto, August 2010
1
Self-repairing Replicated Systems and Dependability Evaluation
Toronto, August 27, 2010CANOE Workshop
Hein MelingDepartment of Electrical Engineering and Computer ScienceUniversity of Stavanger, Norway
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
2
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
3
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
4
Friday, August 27, 2010
Wide Area Network
Site XNode X1
ServiceA
Node X2
ServiceB
Site YNode Y1
ServiceC
Node Y2
Client
Client
ClientClient
Client
Hein Meling, CANOE Workshop, Toronto, August 2010
5
Context – Multiple Data Centers
Friday, August 27, 2010
Wide Area Network
Site XNode X1
ServiceA
Node X2
ServiceB
Site YNode Y1
ServiceC
Node Y2
Client
Client
ClientClient
Client
Network
partition
Hein Meling, CANOE Workshop, Toronto, August 2010
6
Context – Failures will occur
Friday, August 27, 2010
Wide Area Network
Site XNode X1
ServiceA
ServiceB
Node X2
ServiceC
ServiceB
Site YNode Y1
ServiceA
ServiceC
Node Y2
ServiceC
ServiceA
Client
Client
ClientClient
Client
Hein Meling, CANOE Workshop, Toronto, August 2010
7
Common Solution is Redundancy
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
8
Middleware for Fault Tolerance It is difficult to support fault tolerance
Tolerate object, node and network failuresTechniques
RedundancyMasking failures (failover)
Reuse fault tolerance mechanismsUse a group communication system (e.g. Jgroup or Spread)
Focus on development issues
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
9
Group Communication
S1
S3
S2
Server
Logical unit
Group ofservers
Clients
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
10
The Group Membership Service
S1
S2
S3singleton views first full view partitioning merging
S3 crashes;a new view is installed
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
11
Middleware for Fault Treatment Further improve the system's dependability characteristics
Consider: Deployment and operational aspectsAutonomous Fault Treatment
Recovery from node, object and network failuresNot just tolerate faults, repair them as wellWithout human interventionLet groups be self-healing (deal with its own internal failures)
Goal: Minimize the time spent in a state of reduced failure resilience
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Evaluation TechniquesTrivial performance evaluation of repair mechanism
For a single failure injectionBut more interesting
Can we find a way to quantify/predict the improvement in availability by running experiments?
(Without running them for manyyears to get the exact numbers.)
12
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Moving to large-scale (Cloud)Assume now the number of services to deploy becomes
very largeWe need to find placements for the services to avoid bottlenecksMultiple conflicting requirements/goals for these servicesPlacement is a multi-criteria optimization problem
Placement becomes NP-hardCentralized optimization techniques fall short quickly
Also, if it were possible to compute the optimal placementWould it still be valid when we are ready to deploy/reconfigure?
Distributed heuristic to compute near optimal placementsBased on a technique called Cross-Entropy Ant System
13
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
14
Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks
Friday, August 27, 2010
db1
db2
db3
DNS
web3
web2
web4web1
LDAP
Storage
Hein Meling, CANOE Workshop, Toronto, August 2010
15
Related work: Virtualization
Friday, August 27, 2010
db1 db2 db3
DNS
web3
web2
web4web1
LDAP
Storage
Hein Meling, CANOE Workshop, Toronto, August 2010
16
Related work: Virtualization
Friday, August 27, 2010
db1 db2 db3
DNS
web3
web2
web4web1
LDAP
Storage
Hein Meling, CANOE Workshop, Toronto, August 2010
17
Related work: Virtualization
Failover = Reboot/start
SPOF
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
AssumptionsPool of processors to host applicationsReplicated stateful applications (Wide area network)Shared-nothing architecture
Neither disk or main memory is shared by processesAvoid distributed file systemsState of application must be transmitted across network
18
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Related work:Centralized Recovery DecisionsAQuA
Leader of group affected by a failure joins the centralized dependability manager to report failure
FT CORBAJgroup/ARM
Report failures to centralized replication manager
19
Friday, August 27, 2010
Server A
Clients
Dependableregistry
ARM
Server B Server C
Factories
Network
Node
Hein Meling, CANOE Workshop, Toronto, August 2010
ARM Overview20
Friday, August 27, 2010
ReplicationManager
ReplicationManager
ping()
ManagementClient
ReplicationManager
Management
Eve
nts
createGroup()removeGroup()updateGroup()subscribe()unsubscribe()
GUI
ProtocolModules
Factory
NodeJVM
ProtocolModules
JVM
JVM
ProtocolModules
Factory
NodeJVM
JVMcreateReplica()removeReplica()queryReplicas()
notify()
SupervisionModule
notify()
Cal
lbac
k
S A !2"
S A !1" S B!1"
Hein Meling, CANOE Workshop, Toronto, August 2010
ARM Architecture21
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Failure Monitoring22
!"#$%&'(")*+,*-*!.*/!"#$%&'0*12$.343$25,*/
!"#$%&'6$*7893!:*/
0*12$.3#$"!;3!3:*,
<51*,=$-$"!;")52*
43.#",&
(")*>6;
<51*,=$-$"!;")52*
>6;
>6;
<51*,=$-$"!;")52*
(")*>6;
>6;
!"#$%&'?3@A2$=*/
+*,$")$.
B=*!#C),$=*!
D*3)*,D*3)*,D*3)*,D*3)*,
!"#$%&'(")*+,*-*!.*/!"#$%&'0*12$.343$25,*/
SA!1"
SA!1"
SB!1"
43.#",&1$!:'/
1$!:'/
Friday, August 27, 2010
N1 crashed Leader
notify(ViewChange)
notify(ViewChange) notify(ViewChange)
N2
N3
N1
N4
RM
join
createReplica()notify(ViewChange)
Hein Meling, CANOE Workshop, Toronto, August 2010
Crash Failure and Recovery23
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
24
Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Why go distributed?Less infrastructure - less complexNo need to maintain consistent replicated (centralized)
database of deployed groupsLess communication overhead
25
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
DARM Overview26
Group A
Clients
Group B Group C
Factories
Network
Node
Group leader
Factory leader
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Spread communication27
Node B
libspread
Spread
Client
Spread
Daemon
Node A
libspread
Spread
Client
Spread
DaemonNetwork
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
DARM Components28
Node
DARM
Factorylibdarm
Spread
Daemon
DARM
Client
libspread
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
The Factory GroupUsed to install replicas of a given serviceKeeps track of
Node availabilityLocal load of nodes
Interacts with the DARM libraryTo install replacement replicas
Does not maintain any state about deployed replicas In case of failure: just restart factory to host new replicas
29
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Factory group install replacement replicas
30
Facto
ry
libdarm
Fa
cto
ry
createReplica()
createReplicaOnNode()
libdarm
libdarm
Facto
ry
Factory leaderNode 1
Node 2
Node 4
Node 5
Node 6
Node 3
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Replica Placement PolicyPurpose of replica placement policy: Describe how replicas
should be allocated onto the set of available sites and nodes
1. Find the site with the least # of replicas of the given type2. Find the node in the candidate site with the least load;
ignoring nodes already running the service
Objective of this policy: Ensure available replicas in each likely partition that may ariseAvoid collocating two replicas of the same service on the same nodeDisperse replicas evenly on the available sitesLeast loaded nodes in each site are selected (Same node may host multiple distinct service types)
31
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
32
Fault Treatment PolicyKeepMinimalInPartition:
Maintain a minimal redundancy level in each partitionRemovePolicy:
Remove excessive replicasReplicas no longer needed to satisfy the fault treatment policy
KeepMinimalInPrimaryPartition:Maintain a minimal redundancy level in the primary partition only
RedundancyFollowsLoad: Increase redundancy in loaded part of the network
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Crash failure-recovery behavior33
N4Join
Leader
createReplica()
Legend:Legend:
Fault treatmentpending
N3
N1
N2
View no. i:
V� V
�
V� V
�V�
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Failure-recovery with network partitioning and merging
34
N4Leader
createReplica()
Legend:Legend:
FT pendingN3
N1
N2
View no. i:
V�
V�
V�
V�
V�
V�
Partitioning Merging
V�
Leaving
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
35
The DARM Library libdarm wraps around libspread and intercepts
Connection requests to the daemon– To verify and finalize runtime configuration of DARM– Join DARM private group of the associated application
Message receives - SP_receive()– If message belongs to DARM private group pass message to DARM– Otherwise pass message to application– Call SP_receive() again: to avoid having to return control to the
application without passing a message libdarm also provides functions to set
Minimum and maximum number of replicas for the groupThe recovery and remove delays for the group
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
36
The DARM LibraryMembership messages for the DARM private group
Used to decide whether fault treatment is neededBootstrapping applications:
Only a single instance of an application needs to be startedAssuming the application is configured with some minimum
number of replicasDARM will install the required number of replicas
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
37
Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Target system38
Site x
x1
Factory
x2
Factory
x3
Factory
Site z
z1
Factory
z2
Factory
z3
Factory
Site y
y1
Factory
y2
Factory
y3
Factory
Fault injector
NetworkInject(xy | z)
E1 E3
E2
E5E4
E6
E
E
Replacement replica
Replica
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
39Network Partition/MergeExperimentsWant to determine
the single partition recovery durationscorresponding merge of partitions
(and removal of excessive replicas)
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Fast Spread;partition with 2 live replicas
40
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
01
23
45
6
Partition (2 live replicas, 1 added) − Density estimates for detection and recovery times (N=194)
Time since partition injected (s)
Partition detection, (µµ=0.9, σσ=0.261)Replica create, (µµ=2.9, σσ=0.209)Final view, (µµ=3, σσ=0.304)
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Fast Spread;partition with 1 live replica
41
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
01
23
45
Partition (1 live replica, 2 added) − Density estimates for detection and recovery times (N=136)
Time since partition injected (s)
Partition detection, (µµ=0.9, σσ=0.284)Replica create, (µµ=2.9, σσ=0.288)Final view, (µµ=5, σσ=0.273)
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Fast Spread;Merge, removing 2 replicas
42
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
02
46
810
Network merge − Density estimates for detection and remove times (N=600)
Time since merge injected (s)
Merge detection, (µµ=2, σσ=0.226)Replica remove, (µµ=4.1, σσ=0.23)Merged view, (µµ=6.1, σσ=0.22)
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
43
Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
44
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Objective of EvaluationProvide estimates for dependability attributes:
UnavailabilitySystem failure intensityDown time
45
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Predicting Dependability AttributesUse stratified samplingSeries of lab experiments are performed
One or more fault injections in each experiment– (all faults manifest themselves as crash failures)
According to a homogeneous Poisson processStrata := the number of near-coincident failure events
A posteriori stratification: Experiments are allocated to different strata after experiment completion
Three strata: single, double, and triple failures
46
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Predicting Dependability AttributesOffline a posteriori analysis
Events are recorded during experimentsUsed to construct single global timeline of eventsCompute trajectories on a predefined state machine
Analysis provide strata classification and various statisticsThe statistical measures are used as input to estimators for
dependability attributes:– Unavailability– System failure intensity– Down time
47
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Target System Illustrated48
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Target System - State MachineFailure-recovery behavior of a service
Modeled as a state machine (next slide)Events are as seen by the service replicas
The state machine is only used a posterioriTo compute statistics of the experiment (not used to control fault injections)
49
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Partial State MachineFault Injection can occur in
all statesCauses different trajectories
in the state machineCircular states: UPSquared states: DOWN
50
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
51
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Measurement Approach:Timeline of events Place multiple processor failures close together
Examine system behavior of such rare events (determine the rate at which they cause system failure)Use these results to compute system unavailability
(Given MTBF for a single processor)
52
System failure
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
The Failure Trajectory53
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
The Failure TrajectoryCharacteristics obtainable from the failure trajectory
Unavailability:– Down time for trajectory i
– Unavailability
Probability of failure (reliability)– (formulas in the paper)
54
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Experimental StrategyConsider multiple near-coincident failuresClassify experiments into strata Sk
If k failure events occurred in the trajectoryEach strata sampled separatelyCollected samples for each stratum
Can obtain statistics for the system in that stratumE.g., the expected duration of a stratum Sk trajectory:
55
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Sampling Scheme56
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Estimators In real systems, failure intensity λ very low;
i.e, λ-1 >> Tmaxπk = probability of a trajectory reaching stratum Sk
Unconditional probability of a sample inStratum S2
Stratum S3
– (in the paper)
57
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Experimental ResultsPerform fault injections on target system according to
sampling scheme3000 (lab) experiments performed
Aiming for 1000 in each stratumClassified as stratum Sk if exactly k failures occur before
completion of experiment
58
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
ARM (top) / DARM (bottom)59
Chapter 5. Results 36
Classification Count θk = E(T |Sk) sd = √σk θk, 95% conf.int. Highest Lowest
Strata1 2265 2569.22 478.23 (1631.89, 3506.55) 16659 1742Strata2 591 4158.83 1039.10 (2122.18, 6195.47) 12869 2496Strata3 110 5966.58 1550.90 (2926.82, 9006.35) 16086 3046
Table 5.2: Statistics from recovery times (in milliseconds)
experiment should be rerun1 with adjusted parameters; lower Tmax and reactiontime
set to 3 seconds. This would however differentiate it from [19] and they would not be
comparable.
Notice that the iteration count does not equal the corresponding strata distribution.
This is, for the most part, due to iterations that does not successfully recover. Also
for Strata1, 2 iterations are erroneous. One (the only for all 3000 iterations) fail to
properly initialize due to a bug2. This bug will not cause the service to fall into any
of the unavailable states, and probably the bug would resolve itself over time. Yet
the iteration is classified as a failure for all services. The second 1-fault-error is due to
improper initialization, where only a subset of the available nodes are used for the initial
deployment. The later iteration is rejected as an iteration with no fault occurrences. For
Strata2 there are 15 occurrences of the U0 state. All of these are caused by two fault
injections occurring in short intervals, [0.1− 2.1] seconds, on the service with two replica
only. The short fault injection interval leave no room for recovery an this service reach
U0. For Strata3 there are one occurrence of the same bug as presented for Strata1,
which is classified as leaving all services in U0 even though it actually leaves the services
available, yet not correctly. Strata3 has two occurrences of U0 for the MS. Faults are
injected through intervals of total 1 and 0.9 seconds on all its three replica. Also Strata3
has 14 occurrences of U0 for the service with two replica with intervals ranging [0.5− 2.2]
seconds. Note that the second service running three replica (not MS) never experience
the U0 state in these experiments.
5.1 Probability Density
First, the log files have been used to generate probability density graphs. For each
iteration the time from first fault injection till the time of first full recovery is collected
and stored in a set. This is done for each stratum collection. A full recovery is when
all included services reach their specified minimum replication (State A0). Recovery1A rerun of the experiment is listed in future work.2Multiple service replica on one node is a bug presented in future work. It defies the rule that a
service should not replicate twice on the same node
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Experimental Results19 experiments (0.63%) were classified as inadequate
16 experiments failed to recover3 experiments experienced additional not-intended failuresOf the 16, two were for S1, 6 for S2 and 11 for S3These 16 are due to deficiencies in Jgroup/ARM
These inadequate runs are accounted for as trajectories visiting a down state for 5 minutes (typically a reboot)
For DARM there were 2 inadequate experiments
60
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Prob. Density Function61
8000 10000 12000 14000 16000 18000 20000 22000 24000 26000
0.00
000
0.00
005
0.00
010
0.00
015
0.00
020
0.00
025
0.00
030
Density estimate of Jgroup/ARM crash recovery times
Time since injection of first crash failure (ms)
Single node failure (N=1781, BW=126.9)Two nearly coincident node failures (N=793, BW=652.9)Three nearly coincident node failures (N=407, BW=637.2)
Max. 0.00116
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Prob. Density S2 (DARM)62
Chapter 5. Results 38
Compared to JGroup/ARM, DARM generally achieves a full recovery almost 5 seconds
faster than its predecessor.
0 5 10 15
0.00
0.05
0.10
0.15
0.20
0.25
Probability Density for Strata 2
Duration of recovery (seconds)
Figure 5.2: Duration for failure trajectory in Strata2.
Figure 5.2 presents the probability density graph for Strata2. We see that the expectancy
is more spread out, compared to Figure 5.1 for Strata1. This is expected as the different
trajectories for a Strata2 is more variable considering that the two fault injections occur
at different time intervals. The highest values of Strata2 recoveries are 10.439 and 12.869
seconds. They have both been manually checked against the logs in suspicion of the same
cause for increased recoverytimes as observed for Strata1. However the explanations for
these values seems to be that the interval of failures have been close to the maximum of
what DARM ”allows” without recovery taking place in between. The mean value of a
Strata2 recovery is 4.158 seconds.
The highest value observed for a Strata2 recovery in DARM is only 0.1 seconds above
the mean value of Strata2 recovery in JGroup/ARM. Again the performance of DARM
is proven better than that of the JGroup/ARM framework, the variance however is
somewhat the same for both frameworks.
Figure 5.3 presents the probability density graph for Strata3. The graph is somewhat
misleading since its lower bound covers areas below its lowest observed value. Notice
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Applying the Equations63
Chapter 5. Results 40
Experiment Recovery Period Processor Recovery (5 min.) Manual Processor Recovery (2 hrs.)Processor Mean Time Between Failure (pmtbf=λ−1) (in days)
100 200 100 200 100 200
π1 0.9999979184 0.9999989592 0.9997568889 0.9998784583 0.9941238281 0.9970726237π2 2.0815438 · 10−6 1.0407719 · 10−6 2.4305555 · 10−4 1.2152777 · 10−4 5.8333333 · 10−3 2.9166666 · 10−3
π3 4.0903937 · 10−12 1.0225984 · 10−12 5.5447048 · 10−8 1.3861762 · 10−8 4.2838541 · 10−5 1.0709635 · 10−5
U 4.1317108 · 10−17 5.1646385 · 10−18 2.7771024 · 10−4 1.3887200 · 10−4 6.6274921 · 10−3 6.6471508 · 10−3
Λ−1 212 yrs 851 yrs - - - -
Table 5.3: Computed probabilities, unavailability metric and the system MTBF.
of Table 5.3 lists the calculations done for the fixed but comparable system. The two
bad runs, one from Strata1 and one from Strata3, along with two occurrences of U0 for
the MS, make up the foundation for E(Y d) and E(Y f ) presented in Equation 4.6 and
4.7 respectively.
The MTBF of 212 and 851 years indicate that DARM has become more stable than its
predecessor. It also indicate that DARM should be considered a highly reliable platform
for service replication, with much potential.
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
64
Concluding RemarksDARM supports autonomous fault treatment
Recovery decisions are distributed to the individual groups In previous systems recovery decisions were centralized
– Complex and error-proneDARM has been released as open source at:
darm.ux.uis.noWe are performing more advanced measurements
Client perceived availabilityLonger executions and with other parameters to get statistically
significant resultsExperimental results indicate that self-repairing systems
can obtain very high availability and MTBFAutomated fault injection tool
Proved very useful for uncovering a number of subtle bugsAllows for systematic stress and regression testing
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
Open IssuesHandling full group failures
ARM have a centralized component to monitor all groupsDARM only monitors the group from within itselfCould let the factory handle this in some way
– Lease/Renew or simple pinging
Management tasks to simplify deployment of applicationsSelf-configuration Reconfiguration of nodes that can host replicas
Express policies in terms of equations Implement more policies
65
Friday, August 27, 2010
N1 crashed
N2 crashed
timeout
notify(IamAlive)
N2
N3
N1
N4
RM
join
joinnotify(IamAlive)
notify(ViewChange)
notify(IamAlive)createReplica()
notify(ViewChange)
Hein Meling, CANOE Workshop, Toronto, August 2010
Group Failure Handling66
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
67
Thanks!
Friday, August 27, 2010
Hein Meling, CANOE Workshop, Toronto, August 2010
References[1] Hein Meling, Alberto Montresor, Bjarne E. Helvik, and Ozalp Babaoglu. Jgroup/ARM: a distributed object group platform with autonomous replication management. Software: Practice and Experience, 38(9):885-923, July 2008.
[2] Hein Meling and Joakim L. Gilje. A Distributed Approach to Autonomous Fault Treatment in Spread. In Proceedings of the 7th European Dependable Computing Conference (EDCC). IEEE Computer Society, May 2008.
[3] Bjarne E. Helvik, Hein Meling, and Alberto Montresor. An Approach to Experimentally Obtain Service Dependability Characteristics of the Jgroup/ARM System. In Proceedings of the Fifth European Dependable Computing Conference (EDCC), volume 3463 of Lecture Notes in Computer Science, pages 179-198. Springer-Verlag, April 2005.
[4] Hein Meling. Adaptive Middleware Support and Autonomous Fault Treatment: Architectural Design, Prototyping and Experimental Evaluation. PhD thesis, Norwegian University of Science and Technology, Department of Telematics, May 2006.
68
Friday, August 27, 2010
Top Related