Post on 14-Jan-2016
Chapter 3Chapter 3
Wenbing ZhaoWenbing ZhaoDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering
Cleveland State UniversityCleveland State University
wenbing@ieee.orgwenbing@ieee.org
Building Dependable Building Dependable Distributed SystemsDistributed Systems
Building Dependable Distributed Systems, Copyright Wenbing Zhao 1
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
OutlineOutline Recovery oriented computing
Overview Application level fault detection
Structural behavior monitoring Path shape analysis
Recovery-Oriented Computing On availability of soft realtime systems
Availability = MTTF/(MTTF+MTTR) MTTF: mean time to failure MTTR: mean time to recover Availability can be improved by increasing MTTF as
well as reducing MTTR Recovery-oriented computing: focusing on
reducing MTTR Making fault detection faster and more accurate Making recovery faster
Building Dependable Distributed Systems, Copyright Wenbing Zhao 3
Fault Detection and Localization Fault detection: determine if some component in the
system has failed Fault localization: pinpoint the particular component that
failed Low-level fault detection mechanism
Based on timeout, probing each component periodically with a heartbeat message
Cannot detect many application-level faults
Recovery-oriented computing: focusing on application-level fault detection and localization 75% of the recovery time is spent on application-level fault
detection
Building Dependable Distributed Systems, Copyright Wenbing Zhao 4
Microreboot and System-Level Undo/Redo Microreboot: many problems can be fixed by
simply restarting the faulty component Works best with component-based systems
For problems cannot be fixed by microreboot, performs system-level undo, fixed the problem, then carries out system-level redo Based on checkpointing and logging
Building Dependable Distributed Systems, Copyright Wenbing Zhao 5
System Model for Recovery-Oriented Computing Three-tier architecture
Separating application logic and data management
Middle-tier is stateless or maintains only session state
Component-based middleware Java Platform, Enterprise
Edition (Java EE often referred to as J2EE)
Key component: Enterprise Java Bean (EJB)
Building Dependable Distributed Systems, Copyright Wenbing Zhao 6
Application-Level Fault Detection Fail-stop faults can be detected using timeouts Application-level faults can only be detected in the
application level One plausible fault detection method: acceptance test
Developer would have to develop effective and efficient acceptance test routings
Not practical for Internet apps due to their scale, complexity and rapid rate of changes
ROC-based approach: measure and monitor structural behaviors of an app May detect app-level faults without a priori knowledge of the app
details
Building Dependable Distributed Systems, Copyright Wenbing Zhao 7
Structural Behavior Monitoring Interaction patterns between different
components reflect the app-level functionality Each component implements a specific app function,
e.g., Stateful session bean to manage a user’s shopping cart A set of singleton session beans to keep track of inventory
The internal structural behavior can be monitored to infer whether or not the app is functioning normally
To monitor Log runtime path for each end-user request, including all
incoming msgs, outgoing msgs, method invocations, etc.
Building Dependable Distributed Systems, Copyright Wenbing Zhao 8
Structural Behavior: Runtime Path Example Runtime path for a single end-user request
Span 5 components Consist of 10 events
Building Dependable Distributed Systems, Copyright Wenbing Zhao 9
Structural Behavior: Machine Learning Train reference models using machine learning Historical reference model: training with aggregated
runtime path data Objective: anomaly detection based on historical behavior May use real workload as well as synthetic workload that
resembles real workload
Peer reference model: train with most recent runtime path data Objective: anomaly detection with respect to the peer
components Must train with real workload
Fault (anomaly) detection: comparing observed patterns with those in the reference models
Building Dependable Distributed Systems, Copyright Wenbing Zhao 10
Component Interactions Modeling Focus on interactions between a component instance
and all other component classes More scalable: can cope with cases when there are many
instances of each class Suitable for using the Chi-square test for anomaly detection
Building Dependable Distributed Systems, Copyright Wenbing Zhao 11
Component Interactions Modeling Given a system with n component classes, the
interaction model for a component instance consists of a set of n-1 weighted links between the instance and all the other n-1 component classes We assume instances of the same class do not interaction with
each other We assume that interactions are symmetric (i.e., request and
reply) Weight assigned to each link is the probability of the component
instance intreracting with the linked component class The sum of the weight on all links is 1, i.e., the component
instance has probability of 1 to interact with other component classes
Building Dependable Distributed Systems, Copyright Wenbing Zhao 12
Component Interaction Model: Example Class A: web component, handles end-user requests Class B: app logic, handles conversations with end-
users, 3 instances Class C and Class D: also app logic, representing
shared state Class E: database server, persistent state
Building Dependable Distributed Systems, Copyright Wenbing Zhao 13
Component Interaction Model: Example Machine learning: determine link
weight based on training data Training data
A issued 400 remote invocations on b1 b1 issued 300 local method invocations
on C, and 300 invocations on D Not important what happened between
C & E, D & E
Link weight calculation Total number interactions occurred at b1 instance: 1000 P(b1-A) = 400/1000 = 0.4 P(b1-C) = 300/1000 = 0.3 P(b1-D) = 300/1000 = 0.3
Building Dependable Distributed Systems, Copyright Wenbing Zhao 14
Anomaly Detection Comparison of current behavior with the trained
behavior: use Chi-Square test Prepare the observed data as a histogram Compare distribution using formula:
n: number of cells in the histogram ei: expected frequency in cell i oi: observed frequency in cell i If ei is 0, the cell should be pruned off Each link is regarded as a cell For observation period of m requests, expected frequency for link i:
ei = m * pi No anomaly: D = 0 ideally. In practice, D is not 0 due to randomness,
it follows a chi-square distributionBuilding Dependable Distributed
Systems, Copyright Wenbing Zhao 15
Anomaly Detection: Chi-Square Test Anomaly detected:
D > the 1- quantile of the chi-square distribution with freedom of degree of k=n-1 at a level of significance
Higher level of => more sensitive => more false positive
Level of significance: the probability of rejecting the null hypothesis in a statistical test when it is true http://www.merriam-webster.com/dictionary/level%20of%20significance
Building Dependable Distributed Systems, Copyright Wenbing Zhao 16
Anomaly Detection:
Chi-Square Test: Example Observation period: 100 requests A issued 45 requests on b1 b1 issued 35 invocations on C,
and 20 invocations on D Link(A-b1): expected value is
100*0.4=40, observed 45 Link(C-b1): expected: 100*0.3=30, observed 35 Link(D-b1): expected: 100*0.3, observed 20 D=(45-40)2/40 + (35-30)2/30+(20-30)2/30 = 4.79 Chi-square test: degree of freedom is 2 (only 3 cells), for
=0.1, 90% quantile is 4.6 => anomaly detected
Building Dependable Distributed Systems, Copyright Wenbing Zhao 17
Path Shapes Modeling The shape of a runtime path is defined to be the ordered
set of component classes A path shape is represented as a tree in which a node
represents a component class The directional edge represents the causal relationship between
two adjacent nodes
Building Dependable Distributed Systems, Copyright Wenbing Zhao 18
Path Shapes Modeling The probabilistic context-free grammar (PCFG) is used
for path shape modeling (in Chomsky Normal Form, CNF) A list of terminal symbols, Tk,
component classes in a path shape form Tk A list of nonterminal symbols, Ni
Denote the stages of the production rules N1: start symbol, often denoted as S $: the end of a rule All other nonterminal symbols are to be replaced by production rules
(see below) A list of production rules, Ni -> j a list of terminals and
nonterminals) A list of probabilities Rij = P(Ni -> j )
Building Dependable Distributed Systems, Copyright Wenbing Zhao 19
Path Shape Modeling: Example Path shape for 4 end-user requests 100% probability for the call to transit from A to B
R1j: SA, p=1.0
R2j: AB, p=1.0
Building Dependable Distributed Systems, Copyright Wenbing Zhao 20
Path Shape Modeling: Example For B, 3 possible transitions: to C with 25%, to D with 25%, and to both C&D with 50 probability
R3j: BC, p=0.25 | BD, p=0.25 | BCD, p=0.5
Once a call reaches C or D, it must transit to E, hence: R4j: CE, p=1.0
R5j: DE, p=1.0
E is the last stop for all R5j: E$, p=1.0
Building Dependable Distributed Systems, Copyright Wenbing Zhao 21
Path Shape Modeling: Anomaly Detection The path shape of new requests can be judged to see if
they confirm to the grammar An anomaly is detected if a path shape does not conform
to the grammar PCFG itself only detect fault, but not pinpoint root cause
(localization of fault) Need to use other method, such as decision tree
Building Dependable Distributed Systems, Copyright Wenbing Zhao 22