Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University...

22
Chapter 3 Chapter 3 Wenbing Zhao Wenbing Zhao Department of Electrical and Computer Department of Electrical and Computer Engineering Engineering Cleveland State University Cleveland State University [email protected] [email protected] Building Dependable Building Dependable Distributed Systems Distributed Systems Building Dependable Distributed Systems, Copyright Wenbing Zhao 1

Transcript of Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University...

Page 1: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Chapter 3Chapter 3

Wenbing ZhaoWenbing ZhaoDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering

Cleveland State UniversityCleveland State University

[email protected]@ieee.org

Building Dependable Building Dependable Distributed SystemsDistributed Systems

Building Dependable Distributed Systems, Copyright Wenbing Zhao 1

Page 2: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

OutlineOutline Recovery oriented computing

Overview Application level fault detection

Structural behavior monitoring Path shape analysis

Page 3: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Recovery-Oriented Computing On availability of soft realtime systems

Availability = MTTF/(MTTF+MTTR) MTTF: mean time to failure MTTR: mean time to recover Availability can be improved by increasing MTTF as

well as reducing MTTR Recovery-oriented computing: focusing on

reducing MTTR Making fault detection faster and more accurate Making recovery faster

Building Dependable Distributed Systems, Copyright Wenbing Zhao 3

Page 4: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Fault Detection and Localization Fault detection: determine if some component in the

system has failed Fault localization: pinpoint the particular component that

failed Low-level fault detection mechanism

Based on timeout, probing each component periodically with a heartbeat message

Cannot detect many application-level faults

Recovery-oriented computing: focusing on application-level fault detection and localization 75% of the recovery time is spent on application-level fault

detection

Building Dependable Distributed Systems, Copyright Wenbing Zhao 4

Page 5: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Microreboot and System-Level Undo/Redo Microreboot: many problems can be fixed by

simply restarting the faulty component Works best with component-based systems

For problems cannot be fixed by microreboot, performs system-level undo, fixed the problem, then carries out system-level redo Based on checkpointing and logging

Building Dependable Distributed Systems, Copyright Wenbing Zhao 5

Page 6: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

System Model for Recovery-Oriented Computing Three-tier architecture

Separating application logic and data management

Middle-tier is stateless or maintains only session state

Component-based middleware Java Platform, Enterprise

Edition (Java EE often referred to as J2EE)

Key component: Enterprise Java Bean (EJB)

Building Dependable Distributed Systems, Copyright Wenbing Zhao 6

Page 7: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Application-Level Fault Detection Fail-stop faults can be detected using timeouts Application-level faults can only be detected in the

application level One plausible fault detection method: acceptance test

Developer would have to develop effective and efficient acceptance test routings

Not practical for Internet apps due to their scale, complexity and rapid rate of changes

ROC-based approach: measure and monitor structural behaviors of an app May detect app-level faults without a priori knowledge of the app

details

Building Dependable Distributed Systems, Copyright Wenbing Zhao 7

Page 8: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Structural Behavior Monitoring Interaction patterns between different

components reflect the app-level functionality Each component implements a specific app function,

e.g., Stateful session bean to manage a user’s shopping cart A set of singleton session beans to keep track of inventory

The internal structural behavior can be monitored to infer whether or not the app is functioning normally

To monitor Log runtime path for each end-user request, including all

incoming msgs, outgoing msgs, method invocations, etc.

Building Dependable Distributed Systems, Copyright Wenbing Zhao 8

Page 9: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Structural Behavior: Runtime Path Example Runtime path for a single end-user request

Span 5 components Consist of 10 events

Building Dependable Distributed Systems, Copyright Wenbing Zhao 9

Page 10: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Structural Behavior: Machine Learning Train reference models using machine learning Historical reference model: training with aggregated

runtime path data Objective: anomaly detection based on historical behavior May use real workload as well as synthetic workload that

resembles real workload

Peer reference model: train with most recent runtime path data Objective: anomaly detection with respect to the peer

components Must train with real workload

Fault (anomaly) detection: comparing observed patterns with those in the reference models

Building Dependable Distributed Systems, Copyright Wenbing Zhao 10

Page 11: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Component Interactions Modeling Focus on interactions between a component instance

and all other component classes More scalable: can cope with cases when there are many

instances of each class Suitable for using the Chi-square test for anomaly detection

Building Dependable Distributed Systems, Copyright Wenbing Zhao 11

Page 12: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Component Interactions Modeling Given a system with n component classes, the

interaction model for a component instance consists of a set of n-1 weighted links between the instance and all the other n-1 component classes We assume instances of the same class do not interaction with

each other We assume that interactions are symmetric (i.e., request and

reply) Weight assigned to each link is the probability of the component

instance intreracting with the linked component class The sum of the weight on all links is 1, i.e., the component

instance has probability of 1 to interact with other component classes

Building Dependable Distributed Systems, Copyright Wenbing Zhao 12

Page 13: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Component Interaction Model: Example Class A: web component, handles end-user requests Class B: app logic, handles conversations with end-

users, 3 instances Class C and Class D: also app logic, representing

shared state Class E: database server, persistent state

Building Dependable Distributed Systems, Copyright Wenbing Zhao 13

Page 14: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Component Interaction Model: Example Machine learning: determine link

weight based on training data Training data

A issued 400 remote invocations on b1 b1 issued 300 local method invocations

on C, and 300 invocations on D Not important what happened between

C & E, D & E

Link weight calculation Total number interactions occurred at b1 instance: 1000 P(b1-A) = 400/1000 = 0.4 P(b1-C) = 300/1000 = 0.3 P(b1-D) = 300/1000 = 0.3

Building Dependable Distributed Systems, Copyright Wenbing Zhao 14

Page 15: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Anomaly Detection Comparison of current behavior with the trained

behavior: use Chi-Square test Prepare the observed data as a histogram Compare distribution using formula:

n: number of cells in the histogram ei: expected frequency in cell i oi: observed frequency in cell i If ei is 0, the cell should be pruned off Each link is regarded as a cell For observation period of m requests, expected frequency for link i:

ei = m * pi No anomaly: D = 0 ideally. In practice, D is not 0 due to randomness,

it follows a chi-square distributionBuilding Dependable Distributed

Systems, Copyright Wenbing Zhao 15

Page 16: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Anomaly Detection: Chi-Square Test Anomaly detected:

D > the 1- quantile of the chi-square distribution with freedom of degree of k=n-1 at a level of significance

Higher level of => more sensitive => more false positive

Level of significance: the probability of rejecting the null hypothesis in a statistical test when it is true http://www.merriam-webster.com/dictionary/level%20of%20significance

Building Dependable Distributed Systems, Copyright Wenbing Zhao 16

Page 17: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Anomaly Detection:

Chi-Square Test: Example Observation period: 100 requests A issued 45 requests on b1 b1 issued 35 invocations on C,

and 20 invocations on D Link(A-b1): expected value is

100*0.4=40, observed 45 Link(C-b1): expected: 100*0.3=30, observed 35 Link(D-b1): expected: 100*0.3, observed 20 D=(45-40)2/40 + (35-30)2/30+(20-30)2/30 = 4.79 Chi-square test: degree of freedom is 2 (only 3 cells), for

=0.1, 90% quantile is 4.6 => anomaly detected

Building Dependable Distributed Systems, Copyright Wenbing Zhao 17

Page 18: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Path Shapes Modeling The shape of a runtime path is defined to be the ordered

set of component classes A path shape is represented as a tree in which a node

represents a component class The directional edge represents the causal relationship between

two adjacent nodes

Building Dependable Distributed Systems, Copyright Wenbing Zhao 18

Page 19: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Path Shapes Modeling The probabilistic context-free grammar (PCFG) is used

for path shape modeling (in Chomsky Normal Form, CNF) A list of terminal symbols, Tk,

component classes in a path shape form Tk A list of nonterminal symbols, Ni

Denote the stages of the production rules N1: start symbol, often denoted as S $: the end of a rule All other nonterminal symbols are to be replaced by production rules

(see below) A list of production rules, Ni -> j a list of terminals and

nonterminals) A list of probabilities Rij = P(Ni -> j )

Building Dependable Distributed Systems, Copyright Wenbing Zhao 19

Page 20: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Path Shape Modeling: Example Path shape for 4 end-user requests 100% probability for the call to transit from A to B

R1j: SA, p=1.0

R2j: AB, p=1.0

Building Dependable Distributed Systems, Copyright Wenbing Zhao 20

Page 21: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Path Shape Modeling: Example For B, 3 possible transitions: to C with 25%, to D with 25%, and to both C&D with 50 probability

R3j: BC, p=0.25 | BD, p=0.25 | BCD, p=0.5

Once a call reaches C or D, it must transit to E, hence: R4j: CE, p=1.0

R5j: DE, p=1.0

E is the last stop for all R5j: E$, p=1.0

Building Dependable Distributed Systems, Copyright Wenbing Zhao 21

Page 22: Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems.

Path Shape Modeling: Anomaly Detection The path shape of new requests can be judged to see if

they confirm to the grammar An anomaly is detected if a path shape does not conform

to the grammar PCFG itself only detect fault, but not pinpoint root cause

(localization of fault) Need to use other method, such as decision tree

Building Dependable Distributed Systems, Copyright Wenbing Zhao 22