Download - Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

1

Self-repairing Replicated Systems and Dependability Evaluation

Toronto, August 27, 2010CANOE Workshop

Hein MelingDepartment of Electrical Engineering and Computer ScienceUniversity of Stavanger, Norway

Friday, August 27, 2010


2



3



4


Wide Area Network

Site XNode X1

ServiceA

Node X2

ServiceB

Site YNode Y1

ServiceC

Node Y2

Client

Client

ClientClient

Client


5

Context – Multiple Data Centers


Wide Area Network

Site XNode X1

ServiceA

Node X2

ServiceB

Site YNode Y1

ServiceC

Node Y2

Client

Client

ClientClient

Client

Network

partition


6

Context – Failures will occur


Wide Area Network

Site XNode X1

ServiceA

ServiceB

Node X2

ServiceC

ServiceB

Site YNode Y1

ServiceA

ServiceC

Node Y2

ServiceC

ServiceA

Client

Client

ClientClient

Client


7

Common Solution is Redundancy



8

Middleware for Fault Tolerance It is difficult to support fault tolerance

Tolerate object, node and network failuresTechniques

RedundancyMasking failures (failover)

Reuse fault tolerance mechanismsUse a group communication system (e.g. Jgroup or Spread)

Focus on development issues



9

Group Communication

S1

S3

S2

Server

Logical unit

Group ofservers

Clients



10

The Group Membership Service

S1

S2

S3singleton views first full view partitioning merging

S3 crashes;a new view is installed



11

Middleware for Fault Treatment Further improve the system's dependability characteristics

Consider: Deployment and operational aspectsAutonomous Fault Treatment

Recovery from node, object and network failuresNot just tolerate faults, repair them as wellWithout human interventionLet groups be self-healing (deal with its own internal failures)

Goal: Minimize the time spent in a state of reduced failure resilience



Evaluation TechniquesTrivial performance evaluation of repair mechanism

For a single failure injectionBut more interesting

Can we find a way to quantify/predict the improvement in availability by running experiments?

(Without running them for manyyears to get the exact numbers.)

12



Moving to large-scale (Cloud)Assume now the number of services to deploy becomes

very largeWe need to find placements for the services to avoid bottlenecksMultiple conflicting requirements/goals for these servicesPlacement is a multi-criteria optimization problem

Placement becomes NP-hardCentralized optimization techniques fall short quickly

Also, if it were possible to compute the optimal placementWould it still be valid when we are ready to deploy/reconfigure?

Distributed heuristic to compute near optimal placementsBased on a technique called Cross-Entropy Ant System

13



14

Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks


Mail

db1

db2

db3

DNS

web3

web2

web4web1

LDAP

Storage


15

Related work: Virtualization


Mail

db1 db2 db3

DNS

web3

web2

web4web1

LDAP

Storage


16



Mail

db1 db2 db3

DNS

web3

web2

web4web1

LDAP

Storage


17


Failover = Reboot/start

SPOF



AssumptionsPool of processors to host applicationsReplicated stateful applications (Wide area network)Shared-nothing architecture

Neither disk or main memory is shared by processesAvoid distributed file systemsState of application must be transmitted across network

18



Related work:Centralized Recovery DecisionsAQuA

Leader of group affected by a failure joins the centralized dependability manager to report failure

FT CORBAJgroup/ARM

Report failures to centralized replication manager

19


Server A

Clients

Dependableregistry

ARM

Server B Server C

Factories

Network

Node


ARM Overview20


ReplicationManager

ReplicationManager

ping()

ManagementClient

ReplicationManager

Management

Eve

nts

createGroup()removeGroup()updateGroup()subscribe()unsubscribe()

GUI

ProtocolModules

Factory

NodeJVM

ProtocolModules

JVM

JVM

ProtocolModules

Factory

NodeJVM

JVMcreateReplica()removeReplica()queryReplicas()

notify()

SupervisionModule

notify()

Cal

lbac

k

S A !2"

S A !1" S B!1"


ARM Architecture21



Failure Monitoring22

!"#$%&'(")*+,*-*!.*/!"#$%&'0*12$.343$25,*/

!"#$%&'6$*7893!:*/

0*12$.3#$"!;3!3:*,

<51*,=$-$"!;")52*

43.#",&

(")*>6;

<51*,=$-$"!;")52*

>6;

>6;

<51*,=$-$"!;")52*

(")*>6;

>6;

!"#$%&'?3@A2$=*/

+*,$")$.

B=*!#C),$=*!

D*3)*,D*3)*,D*3)*,D*3)*,

!"#$%&'(")*+,*-*!.*/!"#$%&'0*12$.343$25,*/

SA!1"

SA!1"

SB!1"

43.#",&1$!:'/

1$!:'/


N1 crashed Leader

notify(ViewChange)

notify(ViewChange) notify(ViewChange)

N2

N3

N1

N4

RM

join

createReplica()notify(ViewChange)


Crash Failure and Recovery23



24




Why go distributed?Less infrastructure - less complexNo need to maintain consistent replicated (centralized)

database of deployed groupsLess communication overhead

25



DARM Overview26

Group A

Clients

Group B Group C

Factories

Network

Node

Group leader

Factory leader



Spread communication27

Node B

libspread

Spread

Client

Spread

Daemon

Node A

libspread

Spread

Client

Spread

DaemonNetwork



DARM Components28

Node

DARM

Factorylibdarm

Spread

Daemon

DARM

Client

libspread



The Factory GroupUsed to install replicas of a given serviceKeeps track of

Node availabilityLocal load of nodes

Interacts with the DARM libraryTo install replacement replicas

Does not maintain any state about deployed replicas In case of failure: just restart factory to host new replicas

29



Factory group install replacement replicas

30

Facto

ry

libdarm

Fa

cto

ry

createReplica()

createReplicaOnNode()

libdarm

libdarm

Facto

ry

Factory leaderNode 1

Node 2

Node 4

Node 5

Node 6

Node 3



Replica Placement PolicyPurpose of replica placement policy: Describe how replicas

should be allocated onto the set of available sites and nodes

1. Find the site with the least # of replicas of the given type2. Find the node in the candidate site with the least load;

ignoring nodes already running the service

Objective of this policy: Ensure available replicas in each likely partition that may ariseAvoid collocating two replicas of the same service on the same nodeDisperse replicas evenly on the available sitesLeast loaded nodes in each site are selected (Same node may host multiple distinct service types)

31



32

Fault Treatment PolicyKeepMinimalInPartition:

Maintain a minimal redundancy level in each partitionRemovePolicy:

Remove excessive replicasReplicas no longer needed to satisfy the fault treatment policy

KeepMinimalInPrimaryPartition:Maintain a minimal redundancy level in the primary partition only

RedundancyFollowsLoad: Increase redundancy in loaded part of the network



Crash failure-recovery behavior33

N4Join

Leader

createReplica()

Legend:Legend:

Fault treatmentpending

N3

N1

N2

View no. i:

V� V

�

V� V

�V�



Failure-recovery with network partitioning and merging

34

N4Leader

createReplica()

Legend:Legend:

FT pendingN3

N1

N2

View no. i:

V�

V�

V�

V�

V�

V�

Partitioning Merging

V�

Leaving



35

The DARM Library libdarm wraps around libspread and intercepts

Connection requests to the daemon– To verify and finalize runtime configuration of DARM– Join DARM private group of the associated application

Message receives - SP_receive()– If message belongs to DARM private group pass message to DARM– Otherwise pass message to application– Call SP_receive() again: to avoid having to return control to the

application without passing a message libdarm also provides functions to set

Minimum and maximum number of replicas for the groupThe recovery and remove delays for the group



36

The DARM LibraryMembership messages for the DARM private group

Used to decide whether fault treatment is neededBootstrapping applications:

Only a single instance of an application needs to be startedAssuming the application is configured with some minimum

number of replicasDARM will install the required number of replicas



37




Target system38

Site x

x1

Factory

x2

Factory

x3

Factory

Site z

z1

Factory

z2

Factory

z3

Factory

Site y

y1

Factory

y2

Factory

y3

Factory

Fault injector

NetworkInject(xy | z)

E1 E3

E2

E5E4

E6

E

E

Replacement replica

Replica



39Network Partition/MergeExperimentsWant to determine

the single partition recovery durationscorresponding merge of partitions

(and removal of excessive replicas)



Fast Spread;partition with 2 live replicas

40

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

01

23

45

6

Partition (2 live replicas, 1 added) − Density estimates for detection and recovery times (N=194)

Time since partition injected (s)

Partition detection, (µµ=0.9, σσ=0.261)Replica create, (µµ=2.9, σσ=0.209)Final view, (µµ=3, σσ=0.304)



Fast Spread;partition with 1 live replica

41

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

01

23

45

Partition (1 live replica, 2 added) − Density estimates for detection and recovery times (N=136)

Time since partition injected (s)

Partition detection, (µµ=0.9, σσ=0.284)Replica create, (µµ=2.9, σσ=0.288)Final view, (µµ=5, σσ=0.273)



Fast Spread;Merge, removing 2 replicas

42

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

02

46

810

Network merge − Density estimates for detection and remove times (N=600)

Time since merge injected (s)

Merge detection, (µµ=2, σσ=0.226)Replica remove, (µµ=4.1, σσ=0.23)Merged view, (µµ=6.1, σσ=0.22)



43




44



Objective of EvaluationProvide estimates for dependability attributes:

UnavailabilitySystem failure intensityDown time

45



Predicting Dependability AttributesUse stratified samplingSeries of lab experiments are performed

One or more fault injections in each experiment– (all faults manifest themselves as crash failures)

According to a homogeneous Poisson processStrata := the number of near-coincident failure events

A posteriori stratification: Experiments are allocated to different strata after experiment completion

Three strata: single, double, and triple failures

46



Predicting Dependability AttributesOffline a posteriori analysis

Events are recorded during experimentsUsed to construct single global timeline of eventsCompute trajectories on a predefined state machine

Analysis provide strata classification and various statisticsThe statistical measures are used as input to estimators for

dependability attributes:– Unavailability– System failure intensity– Down time

47



Target System Illustrated48



Target System - State MachineFailure-recovery behavior of a service

Modeled as a state machine (next slide)Events are as seen by the service replicas

The state machine is only used a posterioriTo compute statistics of the experiment (not used to control fault injections)

49



Partial State MachineFault Injection can occur in

all statesCauses different trajectories

in the state machineCircular states: UPSquared states: DOWN

50



51



Measurement Approach:Timeline of events Place multiple processor failures close together

Examine system behavior of such rare events (determine the rate at which they cause system failure)Use these results to compute system unavailability

(Given MTBF for a single processor)

52

System failure



The Failure Trajectory53



The Failure TrajectoryCharacteristics obtainable from the failure trajectory

Unavailability:– Down time for trajectory i

– Unavailability

Probability of failure (reliability)– (formulas in the paper)

54



Experimental StrategyConsider multiple near-coincident failuresClassify experiments into strata Sk

If k failure events occurred in the trajectoryEach strata sampled separatelyCollected samples for each stratum

Can obtain statistics for the system in that stratumE.g., the expected duration of a stratum Sk trajectory:

55



Sampling Scheme56



Estimators In real systems, failure intensity λ very low;

i.e, λ-1 >> Tmaxπk = probability of a trajectory reaching stratum Sk

Unconditional probability of a sample inStratum S2

Stratum S3

– (in the paper)

57



Experimental ResultsPerform fault injections on target system according to

sampling scheme3000 (lab) experiments performed

Aiming for 1000 in each stratumClassified as stratum Sk if exactly k failures occur before

completion of experiment

58



ARM (top) / DARM (bottom)59

Chapter 5. Results 36

Classification Count θk = E(T |Sk) sd = √σk θk, 95% conf.int. Highest Lowest

Strata1 2265 2569.22 478.23 (1631.89, 3506.55) 16659 1742Strata2 591 4158.83 1039.10 (2122.18, 6195.47) 12869 2496Strata3 110 5966.58 1550.90 (2926.82, 9006.35) 16086 3046

Table 5.2: Statistics from recovery times (in milliseconds)

experiment should be rerun1 with adjusted parameters; lower Tmax and reactiontime

set to 3 seconds. This would however differentiate it from [19] and they would not be

comparable.

Notice that the iteration count does not equal the corresponding strata distribution.

This is, for the most part, due to iterations that does not successfully recover. Also

for Strata1, 2 iterations are erroneous. One (the only for all 3000 iterations) fail to

properly initialize due to a bug2. This bug will not cause the service to fall into any

of the unavailable states, and probably the bug would resolve itself over time. Yet

the iteration is classified as a failure for all services. The second 1-fault-error is due to

improper initialization, where only a subset of the available nodes are used for the initial

deployment. The later iteration is rejected as an iteration with no fault occurrences. For

Strata2 there are 15 occurrences of the U0 state. All of these are caused by two fault

injections occurring in short intervals, [0.1− 2.1] seconds, on the service with two replica

only. The short fault injection interval leave no room for recovery an this service reach

U0. For Strata3 there are one occurrence of the same bug as presented for Strata1,

which is classified as leaving all services in U0 even though it actually leaves the services

available, yet not correctly. Strata3 has two occurrences of U0 for the MS. Faults are

injected through intervals of total 1 and 0.9 seconds on all its three replica. Also Strata3

has 14 occurrences of U0 for the service with two replica with intervals ranging [0.5− 2.2]

seconds. Note that the second service running three replica (not MS) never experience

the U0 state in these experiments.

5.1 Probability Density

First, the log files have been used to generate probability density graphs. For each

iteration the time from first fault injection till the time of first full recovery is collected

and stored in a set. This is done for each stratum collection. A full recovery is when

all included services reach their specified minimum replication (State A0). Recovery1A rerun of the experiment is listed in future work.2Multiple service replica on one node is a bug presented in future work. It defies the rule that a

service should not replicate twice on the same node



Experimental Results19 experiments (0.63%) were classified as inadequate

16 experiments failed to recover3 experiments experienced additional not-intended failuresOf the 16, two were for S1, 6 for S2 and 11 for S3These 16 are due to deficiencies in Jgroup/ARM

These inadequate runs are accounted for as trajectories visiting a down state for 5 minutes (typically a reboot)

For DARM there were 2 inadequate experiments

60



Prob. Density Function61

8000 10000 12000 14000 16000 18000 20000 22000 24000 26000

0.00

000

0.00

005

0.00

010

0.00

015

0.00

020

0.00

025

0.00

030

Density estimate of Jgroup/ARM crash recovery times

Time since injection of first crash failure (ms)

Single node failure (N=1781, BW=126.9)Two nearly coincident node failures (N=793, BW=652.9)Three nearly coincident node failures (N=407, BW=637.2)

Max. 0.00116



Prob. Density S2 (DARM)62


Compared to JGroup/ARM, DARM generally achieves a full recovery almost 5 seconds

faster than its predecessor.

0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

Probability Density for Strata 2

Duration of recovery (seconds)

Figure 5.2: Duration for failure trajectory in Strata2.

Figure 5.2 presents the probability density graph for Strata2. We see that the expectancy

is more spread out, compared to Figure 5.1 for Strata1. This is expected as the different

trajectories for a Strata2 is more variable considering that the two fault injections occur

at different time intervals. The highest values of Strata2 recoveries are 10.439 and 12.869

seconds. They have both been manually checked against the logs in suspicion of the same

cause for increased recoverytimes as observed for Strata1. However the explanations for

these values seems to be that the interval of failures have been close to the maximum of

what DARM ”allows” without recovery taking place in between. The mean value of a

Strata2 recovery is 4.158 seconds.

The highest value observed for a Strata2 recovery in DARM is only 0.1 seconds above

the mean value of Strata2 recovery in JGroup/ARM. Again the performance of DARM

is proven better than that of the JGroup/ARM framework, the variance however is

somewhat the same for both frameworks.

Figure 5.3 presents the probability density graph for Strata3. The graph is somewhat

misleading since its lower bound covers areas below its lowest observed value. Notice



Applying the Equations63


Experiment Recovery Period Processor Recovery (5 min.) Manual Processor Recovery (2 hrs.)Processor Mean Time Between Failure (pmtbf=λ−1) (in days)

100 200 100 200 100 200

π1 0.9999979184 0.9999989592 0.9997568889 0.9998784583 0.9941238281 0.9970726237π2 2.0815438 · 10−6 1.0407719 · 10−6 2.4305555 · 10−4 1.2152777 · 10−4 5.8333333 · 10−3 2.9166666 · 10−3

π3 4.0903937 · 10−12 1.0225984 · 10−12 5.5447048 · 10−8 1.3861762 · 10−8 4.2838541 · 10−5 1.0709635 · 10−5

U 4.1317108 · 10−17 5.1646385 · 10−18 2.7771024 · 10−4 1.3887200 · 10−4 6.6274921 · 10−3 6.6471508 · 10−3

Λ−1 212 yrs 851 yrs - - - -

Table 5.3: Computed probabilities, unavailability metric and the system MTBF.

of Table 5.3 lists the calculations done for the fixed but comparable system. The two

bad runs, one from Strata1 and one from Strata3, along with two occurrences of U0 for

the MS, make up the foundation for E(Y d) and E(Y f ) presented in Equation 4.6 and

4.7 respectively.

The MTBF of 212 and 851 years indicate that DARM has become more stable than its

predecessor. It also indicate that DARM should be considered a highly reliable platform

for service replication, with much potential.



64

Concluding RemarksDARM supports autonomous fault treatment

Recovery decisions are distributed to the individual groups In previous systems recovery decisions were centralized

– Complex and error-proneDARM has been released as open source at:

darm.ux.uis.noWe are performing more advanced measurements

Client perceived availabilityLonger executions and with other parameters to get statistically

significant resultsExperimental results indicate that self-repairing systems

can obtain very high availability and MTBFAutomated fault injection tool

Proved very useful for uncovering a number of subtle bugsAllows for systematic stress and regression testing



Open IssuesHandling full group failures

ARM have a centralized component to monitor all groupsDARM only monitors the group from within itselfCould let the factory handle this in some way

– Lease/Renew or simple pinging

Management tasks to simplify deployment of applicationsSelf-configuration Reconfiguration of nodes that can host replicas

Express policies in terms of equations Implement more policies

65


N1 crashed

N2 crashed

timeout

notify(IamAlive)

N2

N3

N1

N4

RM

join

joinnotify(IamAlive)

notify(ViewChange)

notify(IamAlive)createReplica()

notify(ViewChange)


Group Failure Handling66



67

Thanks!



References[1] Hein Meling, Alberto Montresor, Bjarne E. Helvik, and Ozalp Babaoglu. Jgroup/ARM: a distributed object group platform with autonomous replication management. Software: Practice and Experience, 38(9):885-923, July 2008.

[2] Hein Meling and Joakim L. Gilje. A Distributed Approach to Autonomous Fault Treatment in Spread. In Proceedings of the 7th European Dependable Computing Conference (EDCC). IEEE Computer Society, May 2008.

[3] Bjarne E. Helvik, Hein Meling, and Alberto Montresor. An Approach to Experimentally Obtain Service Dependability Characteristics of the Jgroup/ARM System. In Proceedings of the Fifth European Dependable Computing Conference (EDCC), volume 3463 of Lecture Notes in Computer Science, pages 179-198. Springer-Verlag, April 2005.

[4] Hein Meling. Adaptive Middleware Support and Autonomous Fault Treatment: Architectural Design, Prototyping and Experimental Evaluation. PhD thesis, Norwegian University of Science and Technology, Department of Telematics, May 2006.

68