EEC 688/788 Secure and Dependable Computing Lecture 11 Wenbing Zhao Department of Electrical and...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of EEC 688/788 Secure and Dependable Computing Lecture 11 Wenbing Zhao Department of Electrical and...
EEC 688/788EEC 688/788Secure and Dependable ComputingSecure and Dependable Computing
Lecture 11Lecture 11
Wenbing ZhaoWenbing ZhaoDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering
Cleveland State UniversityCleveland State University
[email protected]@ieee.org
22
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
OutlineOutline
• Dependability concepts• Fault, error and failure• Fault/failure detection in distributed systems• Consensus in asynchronous distributed systems
33
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Dependability and its AttributesDependability and its Attributes• Two alternative definitions (each focus on a
different aspects)• Def#1 of Dependability: ability to deliver
service that can justifiably be trusted – Aimed at generalizing availability, reliability, safety,
confidentiality, integrity, maintainability, that are then attributes of dependability
– Focus on trust, i.e. accepted dependence– => Dependence of system A on system B is the
extent to which system A’s dependability is (or would be) affected by that of system B
44
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Dependability and its AttributesDependability and its Attributes
• Def#2 of dependability: ability to avoid service failures that are more frequent or more severe than is acceptable– A system can, and usually does, fail. Is it however still
dependable? When does it become undependable?– This def defines the criterion for deciding whether or
not, in spite of service failures, a system is still to be regarded as dependable
– Dependability failure fault(s)
55
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Dependability Related TerminologyDependability Related Terminology
• A system is an entity that interacts with other entities, i.e., other systems, including hardware, software, humans, and the physical world with its natural phenomena
• These other systems are the environment of the given system
• The system boundary is the common frontier between the system and its environment
System
Environment
System Boundary
66
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Dependability Related TerminologyDependability Related Terminology
• Service delivered by a system: work done that benefits its users
• User: another system that interacts with the former
• Function of a system: what the system is intended to do
• (Functional) Specification: description of the system function
• Correct service: when the delivered service implements the system function
77
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Attributes of a Dependable SystemAttributes of a Dependable System• For a system to be dependable, it must be
– Available - e.g., ready for use when we need it– Reliable - e.g., able to provide continuity of service
while we are using it– Safe - e.g., does not have a catastrophic
consequence on the environment– Secure - e.g., able to preserve confidentiality
88
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Quantitative Dependability Quantitative Dependability MeasuresMeasures
• Reliability - a measure of continuous delivery of proper service - or, equivalently, of the time to failure– It is the probability of surviving (potentially despite failures) over
an interval of time
• For example, the reliability requirement might be stated as a 0.999999 availability for a 10-hour mission. In other words, the probability of failure during the mission may be at most 10-6
• Hard real-time systems such as flight control and process control demand high reliability, in which a failure could mean loss of life
99
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Quantitative Dependability Quantitative Dependability MeasuresMeasures
• Availability - a measure of the delivery of correct service with respect to the alternation of correct service and out-of-service– It is the probability of being operational at a given instant of time
• A 0.999999 availability means that the system is not operational at most one hour in a million hours
• A system with high availability may in fact fail. However, failure frequency and recovery time should be small enough to achieve the desired availability
• Soft real-time systems such as telephone switching and airline reservation require high availability
1010
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
1111
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Fault, Error, and FailureFault, Error, and Failure• The adjudged or hypothesized cause of an error is called a fault
• An error is a manifestation of a fault in a system, in which the logical state of an element differs from its intended value
• A service failure occurs if the error propagates to the service interface and causes the service delivered by the system to deviate from correct service
• The failure of a component causes a permanent or transient fault in the system that contains the component
• Service failure of a system causes a permanent or transient external fault for the other system(s) that receive service from the given system
1212
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
FaultFault
• Faults can arise during all stages in a computer system's evolution - specification, design, development, manufacturing, assembly, and installation - and throughout its operational life
• Most faults that occur before full system deployment are discovered through testing and eliminated
• Faults that are not removed can reduce a system's dependability when it is in the field
• A fault can be classified by its duration, nature of output, and correlation to other faults
1313
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Fault Types - Based on DurationFault Types - Based on Duration
• Permanent faults are caused by irreversible device/software failures within a component due to damage, fatigue, or improper manufacturing, or bad design and implementation– Permanent software faults are also called Bohrbugs– Easier to detect
• Transient/intermittent faults are triggered by environmental disturbances or incorrect design– Transient software faults are also referred to as Heisenbugs– Study shows that Heisenbugs are the majority software faults– Harder to detect
1414
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Fault Types - Based on Nature of OutputFault Types - Based on Nature of Output
• Malicious fault: The fault that causes a unit to behave arbitrarily or malicious. Also referred to as Byzantine fault– A sensor sending conflicting outputs to different processors– Compromised software system that attempts to cause service
failure
• Non-malicious faults: the opposite of malicious faults– Faults that are not caused with malicious intention– Faults that exhibit themselves consistently to all observers, e.g.,
fail-stop
• Malicious faults are much harder to detect than non-malicious faults
1515
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Fail-Stop SystemFail-Stop System
• A system is said to be fail-stop if it responds to up to a certain maximum number of faults by simply stopping, rather than producing incorrect output
• A fail-stop system typically has many processors running the same tasks and comparing the outputs. If the outputs do not agree, the whole unit turns itself off
• A system is said to be fail-safe if one or more safe states can be identified, that can be accessed in case of a system failure, in order to avoid catastrophe
1616
Wenbing ZhaoWenbing Zhao
Fault Types - Based on CorrelationFault Types - Based on Correlation
• Components fault may be independent of one another or correlated
• A fault is said to be independent if it does not directly or indirectly cause another fault
• Faults are said to be correlated if they are related. Faults could be correlated due to physical or electrical coupling of components
• Correlated faults are more difficult to detect than independent faults
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing
1717
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Fail Fast to Reduce HeisenbugsFail Fast to Reduce Heisenbugs
• The bugs that software developers hate most:– The ones that show up only after hours of successful
operation, under unusual circumstances– The stack trace usually does not provide useful
information
• This kind of bugs might be caused by many reasons, such as – Not checking the boundary of an array– Invalid defensive programming <= what fail fast
addresses
• Reference– http://www.martinfowler.com/ieeeSoftware/failFast.pdf
1818
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Fail Fast to Reduce HeisenbugsFail Fast to Reduce Heisenbugs• Invalid defensive programming
– Making your software robust by working around problems automatically– This results in the software “failing slowly” – That is, it facilitates error propagation - the program continues working
right after an error but fails in strange ways later on
• Example:public int maxConnections() {
string property = getProperty(“maxConnections”);
if (property == null) {
return 10;
}
else {
return property.toInt();
}
}
1919
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Fail Fast to Reduce HeisenbugsFail Fast to Reduce Heisenbugs• Fail fast programming
– When a problem occurs, it fails immediately & visibly– It may sound like it would make your software more fragile, but
it actually makes it more robust– Bugs are easier to find and fix, so fewer go into production
• Example:public int maxConnections() {
string property = getProperty(“maxConnections”);
if (property == null) { throw new NullReferenceException(“maxConnections property not
found in “ + this.configFilePath);
}
else { return property.toInt(); }
}
2020
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Approaches to Achieving DependabilityApproaches to Achieving Dependability
• Fault Avoidance - how to prevent, by construction, the fault occurrence or introduction
• Fault Removal - how to minimize, by verification, the presence of faults
• Fault Tolerance - how to provide, by redundancy, a service complying with the specification in spite of faults
• Fault Forecasting - how to estimate, by evaluation, the presence, the creation, and the consequence of faults
2121
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Graceful DegradationGraceful Degradation
• If a specified fault scenario develops, the system must still provide a specified level of service. Ideally, the performance of the system degrades gracefully– The system must not suddenly collapse when a fault
occur, or as the size of the faults increases– Rather it should continue to execute part of the work
load correctly
2222
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Failure Detection in Failure Detection in Distributed SystemsDistributed Systems
• Consider the failure detection problem in an asynchronous distributed system, where– No upper bound on process time– No upper bound on clock drift rate– No upper bound in networking delay
• In an asynchronous distributed system, you cannot tell a crashed process from a slow one, even if you can assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through– This leads to Fischer, Lynch and Paterson to proof that it is
impossible to reach a consensus in a fully asynchronous distributed system
2323
Consensus ProblemConsensus Problem
• Safety:– Only a value that has been proposed may be chosen– Only a single value is chosen, and– A process never learns that a value has been chosen
unless it actually has been
• Liveness:– Some proposed value is eventually chosen and, if a
value has been chosen, then a process can eventually learn the value
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
2424
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
Impossibility ResultsImpossibility Results• FLP Impossibility of Consensus
– A single faulty process can prevent consensus– Because a slow process is indistinguishable from a crashed one
• Chandra/Toueg Showed that FLP Impossibility applies to many problems, not just consensus– In particular, they show that FLP applies to group membership, reliable
multicast– So these practical problems are impossible in asynchronous systems– They also look at the weakest condition under which consensus can be
solved
• Ways to bypass the impossibility result – Use unreliable failure detector– Use a randomized consensus algorithm
2525
The Paxos Algorithm – Consensus The Paxos Algorithm – Consensus for Asynchronous Systemsfor Asynchronous Systems
• Contribution: separately consider safety and liveness issues. Safety can be guaranteed and liveness is ensured during period of synchrony
• Participants of the algorithm are divided into three categories– Proposers: those who propose values– Accepters: those who decide which value to choose– Learners: those who are interested in learning the
value chosen
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
2626
The Paxos AlgorithmThe Paxos Algorithm
• How to choose a value– Use a single acceptor: straightforward but not
fault tolerant– Use a number of acceptors: a value is chosen
if the majority of the acceptors have accepted it
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
2727
The Paxos AlgorithmThe Paxos Algorithm
• Requirements for choosing a value– P1. An acceptor must accept the first proposal that it
receives– P2. If a proposal with value v is chosen, then every
higher-numbered proposal that is chosen has value v
• Since the proposal numbers are totally ordered, P2 guarantees the safety property
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
2828
The Paxos AlgorithmThe Paxos Algorithm
• How to guarantee P2?– P2a: If a proposal with value v is chosen, then every
higher-numbered proposal accepted by any acceptor has value v
• But what if an acceptor that has never accepted v accepted a proposal with v’?– P2b: if a proposal with value v is chosen, then every
higher-numbered proposal issued by any proposer has value v
• P2b implies P2a, which implies P2
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
2929
The Paxos AlgorithmThe Paxos Algorithm
• How to ensure P2b?• P2c: For any v and n, if a proposal with value v
and number n is issued, then there is a set S consisting of a majority of acceptors such that either– (a) no acceptor in S has accepted any proposal
numbered less than n, or– (b) v is the value of the highest-numbered proposal
among all proposals numbered less than n accepted by the acceptors in S
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
3030
The Paxos AlgorithmThe Paxos Algorithm
• To ensure P2c, an acceptor must promise:– It will not accept any more proposals
numbered less than n, once it has accepted a proposal n
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
3131
The Paxos AlgorithmThe Paxos Algorithm
• Phase 1. – (a) A proposer selects a proposal number n and
sends a prepare request with number n to a majority of acceptors.
– (b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
3232
The Paxos AlgorithmThe Paxos Algorithm
• Phase 2. – (a) If the proposer receives a response to its prepare
requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
– (b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
3333
The Paxos AlgorithmThe Paxos Algorithm
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
3434
Paxos ExamplesPaxos Examples
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
3535
Paxos ExamplesPaxos Examples
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao
3636
Paxos ExamplesPaxos Examples
04/19/2304/19/23EEC688/788: Secure & Dependable EEC688/788: Secure & Dependable
ComputingComputing Wenbing ZhaoWenbing Zhao