Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University...

Chapter 3Chapter 3

Wenbing ZhaoWenbing ZhaoDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering

Cleveland State UniversityCleveland State University

[email protected]@ieee.org

Building Dependable Building Dependable Distributed SystemsDistributed Systems

Building Dependable Distributed Systems, Copyright Wenbing Zhao 1

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

OutlineOutline Recovery oriented computing

Overview Application level fault detection

Structural behavior monitoring Path shape analysis

Recovery-Oriented Computing On availability of soft realtime systems

Availability = MTTF/(MTTF+MTTR) MTTF: mean time to failure MTTR: mean time to recover Availability can be improved by increasing MTTF as

well as reducing MTTR Recovery-oriented computing: focusing on

reducing MTTR Making fault detection faster and more accurate Making recovery faster


Fault Detection and Localization Fault detection: determine if some component in the

system has failed Fault localization: pinpoint the particular component that

failed Low-level fault detection mechanism

Based on timeout, probing each component periodically with a heartbeat message

Cannot detect many application-level faults

Recovery-oriented computing: focusing on application-level fault detection and localization 75% of the recovery time is spent on application-level fault

detection


Microreboot and System-Level Undo/Redo Microreboot: many problems can be fixed by

simply restarting the faulty component Works best with component-based systems

For problems cannot be fixed by microreboot, performs system-level undo, fixed the problem, then carries out system-level redo Based on checkpointing and logging


System Model for Recovery-Oriented Computing Three-tier architecture

Separating application logic and data management

Middle-tier is stateless or maintains only session state

Component-based middleware Java Platform, Enterprise

Edition (Java EE often referred to as J2EE)

Key component: Enterprise Java Bean (EJB)


Application-Level Fault Detection Fail-stop faults can be detected using timeouts Application-level faults can only be detected in the

application level One plausible fault detection method: acceptance test

Developer would have to develop effective and efficient acceptance test routings

Not practical for Internet apps due to their scale, complexity and rapid rate of changes

ROC-based approach: measure and monitor structural behaviors of an app May detect app-level faults without a priori knowledge of the app

details


Structural Behavior Monitoring Interaction patterns between different

components reflect the app-level functionality Each component implements a specific app function,

e.g., Stateful session bean to manage a user’s shopping cart A set of singleton session beans to keep track of inventory

The internal structural behavior can be monitored to infer whether or not the app is functioning normally

To monitor Log runtime path for each end-user request, including all

incoming msgs, outgoing msgs, method invocations, etc.


Structural Behavior: Runtime Path Example Runtime path for a single end-user request

Span 5 components Consist of 10 events


Structural Behavior: Machine Learning Train reference models using machine learning Historical reference model: training with aggregated

runtime path data Objective: anomaly detection based on historical behavior May use real workload as well as synthetic workload that

resembles real workload

Peer reference model: train with most recent runtime path data Objective: anomaly detection with respect to the peer

components Must train with real workload

Fault (anomaly) detection: comparing observed patterns with those in the reference models


Component Interactions Modeling Focus on interactions between a component instance

and all other component classes More scalable: can cope with cases when there are many

instances of each class Suitable for using the Chi-square test for anomaly detection


Component Interactions Modeling Given a system with n component classes, the

interaction model for a component instance consists of a set of n-1 weighted links between the instance and all the other n-1 component classes We assume instances of the same class do not interaction with

each other We assume that interactions are symmetric (i.e., request and

reply) Weight assigned to each link is the probability of the component

instance intreracting with the linked component class The sum of the weight on all links is 1, i.e., the component

instance has probability of 1 to interact with other component classes


Component Interaction Model: Example Class A: web component, handles end-user requests Class B: app logic, handles conversations with end-

users, 3 instances Class C and Class D: also app logic, representing

shared state Class E: database server, persistent state


Component Interaction Model: Example Machine learning: determine link

weight based on training data Training data

A issued 400 remote invocations on b1 b1 issued 300 local method invocations

on C, and 300 invocations on D Not important what happened between

C & E, D & E

Link weight calculation Total number interactions occurred at b1 instance: 1000 P(b1-A) = 400/1000 = 0.4 P(b1-C) = 300/1000 = 0.3 P(b1-D) = 300/1000 = 0.3


Anomaly Detection Comparison of current behavior with the trained

behavior: use Chi-Square test Prepare the observed data as a histogram Compare distribution using formula:

n: number of cells in the histogram ei: expected frequency in cell i oi: observed frequency in cell i If ei is 0, the cell should be pruned off Each link is regarded as a cell For observation period of m requests, expected frequency for link i:

ei = m * pi No anomaly: D = 0 ideally. In practice, D is not 0 due to randomness,

it follows a chi-square distributionBuilding Dependable Distributed

Systems, Copyright Wenbing Zhao 15

Anomaly Detection: Chi-Square Test Anomaly detected:

D > the 1- quantile of the chi-square distribution with freedom of degree of k=n-1 at a level of significance

Higher level of => more sensitive => more false positive

Level of significance: the probability of rejecting the null hypothesis in a statistical test when it is true http://www.merriam-webster.com/dictionary/level%20of%20significance


Anomaly Detection:

Chi-Square Test: Example Observation period: 100 requests A issued 45 requests on b1 b1 issued 35 invocations on C,

and 20 invocations on D Link(A-b1): expected value is

100*0.4=40, observed 45 Link(C-b1): expected: 100*0.3=30, observed 35 Link(D-b1): expected: 100*0.3, observed 20 D=(45-40)2/40 + (35-30)2/30+(20-30)2/30 = 4.79 Chi-square test: degree of freedom is 2 (only 3 cells), for

=0.1, 90% quantile is 4.6 => anomaly detected


Path Shapes Modeling The shape of a runtime path is defined to be the ordered

set of component classes A path shape is represented as a tree in which a node

represents a component class The directional edge represents the causal relationship between

two adjacent nodes


Path Shapes Modeling The probabilistic context-free grammar (PCFG) is used

for path shape modeling (in Chomsky Normal Form, CNF) A list of terminal symbols, Tk,

component classes in a path shape form Tk A list of nonterminal symbols, Ni

Denote the stages of the production rules N1: start symbol, often denoted as S $: the end of a rule All other nonterminal symbols are to be replaced by production rules

(see below) A list of production rules, Ni -> j a list of terminals and

nonterminals) A list of probabilities Rij = P(Ni -> j )


Path Shape Modeling: Example Path shape for 4 end-user requests 100% probability for the call to transit from A to B

R1j: SA, p=1.0

R2j: AB, p=1.0


Path Shape Modeling: Example For B, 3 possible transitions: to C with 25%, to D with 25%, and to both C&D with 50 probability

R3j: BC, p=0.25 | BD, p=0.25 | BCD, p=0.5

Once a call reaches C or D, it must transit to E, hence: R4j: CE, p=1.0

R5j: DE, p=1.0

E is the last stop for all R5j: E$, p=1.0


Path Shape Modeling: Anomaly Detection The path shape of new requests can be judged to see if

they confirm to the grammar An anomaly is detected if a path shape does not conform

to the grammar PCFG itself only detect fault, but not pinpoint root cause

(localization of fault) Need to use other method, such as decision tree


Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University...

Documents

Transcript of Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University...