Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein...

68
Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems and Dependability Evaluation Toronto, August 27, 2010 CANOE Workshop Hein Meling Department of Electrical Engineering and Computer Science University of Stavanger, Norway Friday, August 27, 2010

Transcript of Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein...

Page 1: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

1

Self-repairing Replicated Systems and Dependability Evaluation

Toronto, August 27, 2010CANOE Workshop

Hein MelingDepartment of Electrical Engineering and Computer ScienceUniversity of Stavanger, Norway

Friday, August 27, 2010

Page 2: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

2

Friday, August 27, 2010

Page 3: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

3

Friday, August 27, 2010

Page 4: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

4

Friday, August 27, 2010

Page 5: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Wide Area Network

Site XNode X1

ServiceA

Node X2

ServiceB

Site YNode Y1

ServiceC

Node Y2

Client

Client

ClientClient

Client

Hein Meling, CANOE Workshop, Toronto, August 2010

5

Context – Multiple Data Centers

Friday, August 27, 2010

Page 6: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Wide Area Network

Site XNode X1

ServiceA

Node X2

ServiceB

Site YNode Y1

ServiceC

Node Y2

Client

Client

ClientClient

Client

Network

partition

Hein Meling, CANOE Workshop, Toronto, August 2010

6

Context – Failures will occur

Friday, August 27, 2010

Page 7: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Wide Area Network

Site XNode X1

ServiceA

ServiceB

Node X2

ServiceC

ServiceB

Site YNode Y1

ServiceA

ServiceC

Node Y2

ServiceC

ServiceA

Client

Client

ClientClient

Client

Hein Meling, CANOE Workshop, Toronto, August 2010

7

Common Solution is Redundancy

Friday, August 27, 2010

Page 8: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

8

Middleware for Fault Tolerance It is difficult to support fault tolerance

Tolerate object, node and network failuresTechniques

RedundancyMasking failures (failover)

Reuse fault tolerance mechanismsUse a group communication system (e.g. Jgroup or Spread)

Focus on development issues

Friday, August 27, 2010

Page 9: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

9

Group Communication

S1

S3

S2

Server

Logical unit

Group ofservers

Clients

Friday, August 27, 2010

Page 10: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

10

The Group Membership Service

S1

S2

S3singleton views first full view partitioning merging

S3 crashes;a new view is installed

Friday, August 27, 2010

Page 11: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

11

Middleware for Fault Treatment Further improve the system's dependability characteristics

Consider: Deployment and operational aspectsAutonomous Fault Treatment

Recovery from node, object and network failuresNot just tolerate faults, repair them as wellWithout human interventionLet groups be self-healing (deal with its own internal failures)

Goal: Minimize the time spent in a state of reduced failure resilience

Friday, August 27, 2010

Page 12: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Evaluation TechniquesTrivial performance evaluation of repair mechanism

For a single failure injectionBut more interesting

Can we find a way to quantify/predict the improvement in availability by running experiments?

(Without running them for manyyears to get the exact numbers.)

12

Friday, August 27, 2010

Page 13: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Moving to large-scale (Cloud)Assume now the number of services to deploy becomes

very largeWe need to find placements for the services to avoid bottlenecksMultiple conflicting requirements/goals for these servicesPlacement is a multi-criteria optimization problem

Placement becomes NP-hardCentralized optimization techniques fall short quickly

Also, if it were possible to compute the optimal placementWould it still be valid when we are ready to deploy/reconfigure?

Distributed heuristic to compute near optimal placementsBased on a technique called Cross-Entropy Ant System

13

Friday, August 27, 2010

Page 14: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

14

Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks

Friday, August 27, 2010

Page 15: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Mail

db1

db2

db3

DNS

web3

web2

web4web1

LDAP

Storage

Hein Meling, CANOE Workshop, Toronto, August 2010

15

Related work: Virtualization

Friday, August 27, 2010

Page 16: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Mail

db1 db2 db3

DNS

web3

web2

web4web1

LDAP

Storage

Hein Meling, CANOE Workshop, Toronto, August 2010

16

Related work: Virtualization

Friday, August 27, 2010

Page 17: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Mail

db1 db2 db3

DNS

web3

web2

web4web1

LDAP

Storage

Hein Meling, CANOE Workshop, Toronto, August 2010

17

Related work: Virtualization

Failover = Reboot/start

SPOF

Friday, August 27, 2010

Page 18: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

AssumptionsPool of processors to host applicationsReplicated stateful applications (Wide area network)Shared-nothing architecture

Neither disk or main memory is shared by processesAvoid distributed file systemsState of application must be transmitted across network

18

Friday, August 27, 2010

Page 19: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Related work:Centralized Recovery DecisionsAQuA

Leader of group affected by a failure joins the centralized dependability manager to report failure

FT CORBAJgroup/ARM

Report failures to centralized replication manager

19

Friday, August 27, 2010

Page 20: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Server A

Clients

Dependableregistry

ARM

Server B Server C

Factories

Network

Node

Hein Meling, CANOE Workshop, Toronto, August 2010

ARM Overview20

Friday, August 27, 2010

Page 21: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

ReplicationManager

ReplicationManager

ping()

ManagementClient

ReplicationManager

Management

Eve

nts

createGroup()removeGroup()updateGroup()subscribe()unsubscribe()

GUI

ProtocolModules

Factory

NodeJVM

ProtocolModules

JVM

JVM

ProtocolModules

Factory

NodeJVM

JVMcreateReplica()removeReplica()queryReplicas()

notify()

SupervisionModule

notify()

Cal

lbac

k

S A !2"

S A !1" S B!1"

Hein Meling, CANOE Workshop, Toronto, August 2010

ARM Architecture21

Friday, August 27, 2010

Page 22: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Failure Monitoring22

!"#$%&'(")*+,*-*!.*/!"#$%&'0*12$.343$25,*/

!"#$%&'6$*7893!:*/

0*12$.3#$"!;3!3:*,

<51*,=$-$"!;")52*

43.#",&

(")*>6;

<51*,=$-$"!;")52*

>6;

>6;

<51*,=$-$"!;")52*

(")*>6;

>6;

!"#$%&'?3@A2$=*/

+*,$")$.

B=*!#C),$=*!

D*3)*,D*3)*,D*3)*,D*3)*,

!"#$%&'(")*+,*-*!.*/!"#$%&'0*12$.343$25,*/

SA!1"

SA!1"

SB!1"

43.#",&1$!:'/

1$!:'/

Friday, August 27, 2010

Page 23: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

N1 crashed Leader

notify(ViewChange)

notify(ViewChange) notify(ViewChange)

N2

N3

N1

N4

RM

join

createReplica()notify(ViewChange)

Hein Meling, CANOE Workshop, Toronto, August 2010

Crash Failure and Recovery23

Friday, August 27, 2010

Page 24: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

24

Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks

Friday, August 27, 2010

Page 25: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Why go distributed?Less infrastructure - less complexNo need to maintain consistent replicated (centralized)

database of deployed groupsLess communication overhead

25

Friday, August 27, 2010

Page 26: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

DARM Overview26

Group A

Clients

Group B Group C

Factories

Network

Node

Group leader

Factory leader

Friday, August 27, 2010

Page 27: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Spread communication27

Node B

libspread

Spread

Client

Spread

Daemon

Node A

libspread

Spread

Client

Spread

DaemonNetwork

Friday, August 27, 2010

Page 28: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

DARM Components28

Node

DARM

Factorylibdarm

Spread

Daemon

DARM

Client

libspread

Friday, August 27, 2010

Page 29: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

The Factory GroupUsed to install replicas of a given serviceKeeps track of

Node availabilityLocal load of nodes

Interacts with the DARM libraryTo install replacement replicas

Does not maintain any state about deployed replicas In case of failure: just restart factory to host new replicas

29

Friday, August 27, 2010

Page 30: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Factory group install replacement replicas

30

Facto

ry

libdarm

Fa

cto

ry

createReplica()

createReplicaOnNode()

libdarm

libdarm

Facto

ry

Factory leaderNode 1

Node 2

Node 4

Node 5

Node 6

Node 3

Friday, August 27, 2010

Page 31: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Replica Placement PolicyPurpose of replica placement policy: Describe how replicas

should be allocated onto the set of available sites and nodes

1. Find the site with the least # of replicas of the given type2. Find the node in the candidate site with the least load;

ignoring nodes already running the service

Objective of this policy: Ensure available replicas in each likely partition that may ariseAvoid collocating two replicas of the same service on the same nodeDisperse replicas evenly on the available sitesLeast loaded nodes in each site are selected (Same node may host multiple distinct service types)

31

Friday, August 27, 2010

Page 32: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

32

Fault Treatment PolicyKeepMinimalInPartition:

Maintain a minimal redundancy level in each partitionRemovePolicy:

Remove excessive replicasReplicas no longer needed to satisfy the fault treatment policy

KeepMinimalInPrimaryPartition:Maintain a minimal redundancy level in the primary partition only

RedundancyFollowsLoad: Increase redundancy in loaded part of the network

Friday, August 27, 2010

Page 33: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Crash failure-recovery behavior33

N4Join

Leader

createReplica()

Legend:Legend:

Fault treatmentpending

N3

N1

N2

View no. i:

V� V

V� V

�V�

Friday, August 27, 2010

Page 34: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Failure-recovery with network partitioning and merging

34

N4Leader

createReplica()

Legend:Legend:

FT pendingN3

N1

N2

View no. i:

V�

V�

V�

V�

V�

V�

Partitioning Merging

V�

Leaving

Friday, August 27, 2010

Page 35: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

35

The DARM Library libdarm wraps around libspread and intercepts

Connection requests to the daemon– To verify and finalize runtime configuration of DARM– Join DARM private group of the associated application

Message receives - SP_receive()– If message belongs to DARM private group pass message to DARM– Otherwise pass message to application– Call SP_receive() again: to avoid having to return control to the

application without passing a message libdarm also provides functions to set

Minimum and maximum number of replicas for the groupThe recovery and remove delays for the group

Friday, August 27, 2010

Page 36: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

36

The DARM LibraryMembership messages for the DARM private group

Used to decide whether fault treatment is neededBootstrapping applications:

Only a single instance of an application needs to be startedAssuming the application is configured with some minimum

number of replicasDARM will install the required number of replicas

Friday, August 27, 2010

Page 37: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

37

Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks

Friday, August 27, 2010

Page 38: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Target system38

Site x

x1

Factory

x2

Factory

x3

Factory

Site z

z1

Factory

z2

Factory

z3

Factory

Site y

y1

Factory

y2

Factory

y3

Factory

Fault injector

NetworkInject(xy | z)

E1 E3

E2

E5E4

E6

E

E

Replacement replica

Replica

Friday, August 27, 2010

Page 39: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

39Network Partition/MergeExperimentsWant to determine

the single partition recovery durationscorresponding merge of partitions

(and removal of excessive replicas)

Friday, August 27, 2010

Page 40: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Fast Spread;partition with 2 live replicas

40

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

01

23

45

6

Partition (2 live replicas, 1 added) − Density estimates for detection and recovery times (N=194)

Time since partition injected (s)

Partition detection, (µµ=0.9, σσ=0.261)Replica create, (µµ=2.9, σσ=0.209)Final view, (µµ=3, σσ=0.304)

Friday, August 27, 2010

Page 41: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Fast Spread;partition with 1 live replica

41

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

01

23

45

Partition (1 live replica, 2 added) − Density estimates for detection and recovery times (N=136)

Time since partition injected (s)

Partition detection, (µµ=0.9, σσ=0.284)Replica create, (µµ=2.9, σσ=0.288)Final view, (µµ=5, σσ=0.273)

Friday, August 27, 2010

Page 42: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Fast Spread;Merge, removing 2 replicas

42

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

02

46

810

Network merge − Density estimates for detection and remove times (N=600)

Time since merge injected (s)

Merge detection, (µµ=2, σσ=0.226)Replica remove, (µµ=4.1, σσ=0.23)Merged view, (µµ=6.1, σσ=0.22)

Friday, August 27, 2010

Page 43: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

43

Outline Introduction and motivationRelated workDistributed Autonomous Replication Management (DARM)Simple Network Partition Evaluation of DARMDependability Evaluation TechniqueConcluding remarks

Friday, August 27, 2010

Page 44: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

44

Friday, August 27, 2010

Page 45: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Objective of EvaluationProvide estimates for dependability attributes:

UnavailabilitySystem failure intensityDown time

45

Friday, August 27, 2010

Page 46: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Predicting Dependability AttributesUse stratified samplingSeries of lab experiments are performed

One or more fault injections in each experiment– (all faults manifest themselves as crash failures)

According to a homogeneous Poisson processStrata := the number of near-coincident failure events

A posteriori stratification: Experiments are allocated to different strata after experiment completion

Three strata: single, double, and triple failures

46

Friday, August 27, 2010

Page 47: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Predicting Dependability AttributesOffline a posteriori analysis

Events are recorded during experimentsUsed to construct single global timeline of eventsCompute trajectories on a predefined state machine

Analysis provide strata classification and various statisticsThe statistical measures are used as input to estimators for

dependability attributes:– Unavailability– System failure intensity– Down time

47

Friday, August 27, 2010

Page 48: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Target System Illustrated48

Friday, August 27, 2010

Page 49: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Target System - State MachineFailure-recovery behavior of a service

Modeled as a state machine (next slide)Events are as seen by the service replicas

The state machine is only used a posterioriTo compute statistics of the experiment (not used to control fault injections)

49

Friday, August 27, 2010

Page 50: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Partial State MachineFault Injection can occur in

all statesCauses different trajectories

in the state machineCircular states: UPSquared states: DOWN

50

Friday, August 27, 2010

Page 51: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

51

Friday, August 27, 2010

Page 52: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Measurement Approach:Timeline of events Place multiple processor failures close together

Examine system behavior of such rare events (determine the rate at which they cause system failure)Use these results to compute system unavailability

(Given MTBF for a single processor)

52

System failure

Friday, August 27, 2010

Page 53: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

The Failure Trajectory53

Friday, August 27, 2010

Page 54: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

The Failure TrajectoryCharacteristics obtainable from the failure trajectory

Unavailability:– Down time for trajectory i

– Unavailability

Probability of failure (reliability)– (formulas in the paper)

54

Friday, August 27, 2010

Page 55: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Experimental StrategyConsider multiple near-coincident failuresClassify experiments into strata Sk

If k failure events occurred in the trajectoryEach strata sampled separatelyCollected samples for each stratum

Can obtain statistics for the system in that stratumE.g., the expected duration of a stratum Sk trajectory:

55

Friday, August 27, 2010

Page 56: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Sampling Scheme56

Friday, August 27, 2010

Page 57: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Estimators In real systems, failure intensity λ very low;

i.e, λ-1 >> Tmaxπk = probability of a trajectory reaching stratum Sk

Unconditional probability of a sample inStratum S2

Stratum S3

– (in the paper)

57

Friday, August 27, 2010

Page 58: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Experimental ResultsPerform fault injections on target system according to

sampling scheme3000 (lab) experiments performed

Aiming for 1000 in each stratumClassified as stratum Sk if exactly k failures occur before

completion of experiment

58

Friday, August 27, 2010

Page 59: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

ARM (top) / DARM (bottom)59

Chapter 5. Results 36

Classification Count θk = E(T |Sk) sd = √σk θk, 95% conf.int. Highest Lowest

Strata1 2265 2569.22 478.23 (1631.89, 3506.55) 16659 1742Strata2 591 4158.83 1039.10 (2122.18, 6195.47) 12869 2496Strata3 110 5966.58 1550.90 (2926.82, 9006.35) 16086 3046

Table 5.2: Statistics from recovery times (in milliseconds)

experiment should be rerun1 with adjusted parameters; lower Tmax and reactiontime

set to 3 seconds. This would however differentiate it from [19] and they would not be

comparable.

Notice that the iteration count does not equal the corresponding strata distribution.

This is, for the most part, due to iterations that does not successfully recover. Also

for Strata1, 2 iterations are erroneous. One (the only for all 3000 iterations) fail to

properly initialize due to a bug2. This bug will not cause the service to fall into any

of the unavailable states, and probably the bug would resolve itself over time. Yet

the iteration is classified as a failure for all services. The second 1-fault-error is due to

improper initialization, where only a subset of the available nodes are used for the initial

deployment. The later iteration is rejected as an iteration with no fault occurrences. For

Strata2 there are 15 occurrences of the U0 state. All of these are caused by two fault

injections occurring in short intervals, [0.1− 2.1] seconds, on the service with two replica

only. The short fault injection interval leave no room for recovery an this service reach

U0. For Strata3 there are one occurrence of the same bug as presented for Strata1,

which is classified as leaving all services in U0 even though it actually leaves the services

available, yet not correctly. Strata3 has two occurrences of U0 for the MS. Faults are

injected through intervals of total 1 and 0.9 seconds on all its three replica. Also Strata3

has 14 occurrences of U0 for the service with two replica with intervals ranging [0.5− 2.2]

seconds. Note that the second service running three replica (not MS) never experience

the U0 state in these experiments.

5.1 Probability Density

First, the log files have been used to generate probability density graphs. For each

iteration the time from first fault injection till the time of first full recovery is collected

and stored in a set. This is done for each stratum collection. A full recovery is when

all included services reach their specified minimum replication (State A0). Recovery1A rerun of the experiment is listed in future work.2Multiple service replica on one node is a bug presented in future work. It defies the rule that a

service should not replicate twice on the same node

Friday, August 27, 2010

Page 60: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Experimental Results19 experiments (0.63%) were classified as inadequate

16 experiments failed to recover3 experiments experienced additional not-intended failuresOf the 16, two were for S1, 6 for S2 and 11 for S3These 16 are due to deficiencies in Jgroup/ARM

These inadequate runs are accounted for as trajectories visiting a down state for 5 minutes (typically a reboot)

For DARM there were 2 inadequate experiments

60

Friday, August 27, 2010

Page 61: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Prob. Density Function61

8000 10000 12000 14000 16000 18000 20000 22000 24000 26000

0.00

000

0.00

005

0.00

010

0.00

015

0.00

020

0.00

025

0.00

030

Density estimate of Jgroup/ARM crash recovery times

Time since injection of first crash failure (ms)

Single node failure (N=1781, BW=126.9)Two nearly coincident node failures (N=793, BW=652.9)Three nearly coincident node failures (N=407, BW=637.2)

Max. 0.00116

Friday, August 27, 2010

Page 62: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Prob. Density S2 (DARM)62

Chapter 5. Results 38

Compared to JGroup/ARM, DARM generally achieves a full recovery almost 5 seconds

faster than its predecessor.

0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

Probability Density for Strata 2

Duration of recovery (seconds)

Figure 5.2: Duration for failure trajectory in Strata2.

Figure 5.2 presents the probability density graph for Strata2. We see that the expectancy

is more spread out, compared to Figure 5.1 for Strata1. This is expected as the different

trajectories for a Strata2 is more variable considering that the two fault injections occur

at different time intervals. The highest values of Strata2 recoveries are 10.439 and 12.869

seconds. They have both been manually checked against the logs in suspicion of the same

cause for increased recoverytimes as observed for Strata1. However the explanations for

these values seems to be that the interval of failures have been close to the maximum of

what DARM ”allows” without recovery taking place in between. The mean value of a

Strata2 recovery is 4.158 seconds.

The highest value observed for a Strata2 recovery in DARM is only 0.1 seconds above

the mean value of Strata2 recovery in JGroup/ARM. Again the performance of DARM

is proven better than that of the JGroup/ARM framework, the variance however is

somewhat the same for both frameworks.

Figure 5.3 presents the probability density graph for Strata3. The graph is somewhat

misleading since its lower bound covers areas below its lowest observed value. Notice

Friday, August 27, 2010

Page 63: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Applying the Equations63

Chapter 5. Results 40

Experiment Recovery Period Processor Recovery (5 min.) Manual Processor Recovery (2 hrs.)Processor Mean Time Between Failure (pmtbf=λ−1) (in days)

100 200 100 200 100 200

π1 0.9999979184 0.9999989592 0.9997568889 0.9998784583 0.9941238281 0.9970726237π2 2.0815438 · 10−6 1.0407719 · 10−6 2.4305555 · 10−4 1.2152777 · 10−4 5.8333333 · 10−3 2.9166666 · 10−3

π3 4.0903937 · 10−12 1.0225984 · 10−12 5.5447048 · 10−8 1.3861762 · 10−8 4.2838541 · 10−5 1.0709635 · 10−5

U 4.1317108 · 10−17 5.1646385 · 10−18 2.7771024 · 10−4 1.3887200 · 10−4 6.6274921 · 10−3 6.6471508 · 10−3

Λ−1 212 yrs 851 yrs - - - -

Table 5.3: Computed probabilities, unavailability metric and the system MTBF.

of Table 5.3 lists the calculations done for the fixed but comparable system. The two

bad runs, one from Strata1 and one from Strata3, along with two occurrences of U0 for

the MS, make up the foundation for E(Y d) and E(Y f ) presented in Equation 4.6 and

4.7 respectively.

The MTBF of 212 and 851 years indicate that DARM has become more stable than its

predecessor. It also indicate that DARM should be considered a highly reliable platform

for service replication, with much potential.

Friday, August 27, 2010

Page 64: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

64

Concluding RemarksDARM supports autonomous fault treatment

Recovery decisions are distributed to the individual groups In previous systems recovery decisions were centralized

– Complex and error-proneDARM has been released as open source at:

darm.ux.uis.noWe are performing more advanced measurements

Client perceived availabilityLonger executions and with other parameters to get statistically

significant resultsExperimental results indicate that self-repairing systems

can obtain very high availability and MTBFAutomated fault injection tool

Proved very useful for uncovering a number of subtle bugsAllows for systematic stress and regression testing

Friday, August 27, 2010

Page 65: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

Open IssuesHandling full group failures

ARM have a centralized component to monitor all groupsDARM only monitors the group from within itselfCould let the factory handle this in some way

– Lease/Renew or simple pinging

Management tasks to simplify deployment of applicationsSelf-configuration Reconfiguration of nodes that can host replicas

Express policies in terms of equations Implement more policies

65

Friday, August 27, 2010

Page 66: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

N1 crashed

N2 crashed

timeout

notify(IamAlive)

N2

N3

N1

N4

RM

join

joinnotify(IamAlive)

notify(ViewChange)

notify(IamAlive)createReplica()

notify(ViewChange)

Hein Meling, CANOE Workshop, Toronto, August 2010

Group Failure Handling66

Friday, August 27, 2010

Page 67: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

67

Thanks!

Friday, August 27, 2010

Page 68: Self-repairing Replicated Systems and Dependability …meling/papers/CANOE-Workshop-Toronto...Hein Meling, CANOE Workshop, Toronto, August 2010 1 Self-repairing Replicated Systems

Hein Meling, CANOE Workshop, Toronto, August 2010

References[1] Hein Meling, Alberto Montresor, Bjarne E. Helvik, and Ozalp Babaoglu. Jgroup/ARM: a distributed object group platform with autonomous replication management. Software: Practice and Experience, 38(9):885-923, July 2008.

[2] Hein Meling and Joakim L. Gilje. A Distributed Approach to Autonomous Fault Treatment in Spread. In Proceedings of the 7th European Dependable Computing Conference (EDCC). IEEE Computer Society, May 2008.

[3] Bjarne E. Helvik, Hein Meling, and Alberto Montresor. An Approach to Experimentally Obtain Service Dependability Characteristics of the Jgroup/ARM System. In Proceedings of the Fifth European Dependable Computing Conference (EDCC), volume 3463 of Lecture Notes in Computer Science, pages 179-198. Springer-Verlag, April 2005.

[4] Hein Meling. Adaptive Middleware Support and Autonomous Fault Treatment: Architectural Design, Prototyping and Experimental Evaluation. PhD thesis, Norwegian University of Science and Technology, Department of Telematics, May 2006.

68

Friday, August 27, 2010