Peter Popov Centre for Software Reliability City University, United Kingdom [email protected]

Peter Popov IPP-BAS, Sofia, 10 May 2006 1

Software fault-tolerance with off-the-shelf components:from conceptual models to empirical

studies

Peter PopovCentre for Software Reliability

City University, United Kingdom

[email protected]

www.csr.city.ac.uk


Talk outline• What is CSR?

• Software design diversity - Why?

• Conceptual models of software fault-tolerance– Early conceptual models ‘on average’

– New results • effect of testing • models ‘in particular’

• Experimental studies with diverse redundancy: off-the-shelf RDBMS– fault diversity

– performance implications

• Conclusions


What is CSR?• The Centre for Software Reliability (CSR) at City

University, London, is an independent Research Centre in the School of Informatics founded in 1983 by Bev Littlewood.

• Over the years, CSR has attracted over 8 million of international and UK research funding, and has built an international reputation for research achievements in the areas of:– Software dependability (particularly safety and reliability)

modelling– Software fault tolerance– Software metrics and quality assurance– Fundamental issues for safety critical systems


CSR's research achievements• Developing novel techniques and tools for assessing software

reliability (e.g. software reliability growth, Bayesian methods for dependability assessment, etc.)

• Defining the limits for evaluating systems with ultra-high reliability requirements

• Developing models of common-mode failures in diverse and redundant systems

• Developing novel mechanisms for software fault-tolerance • Producing a comprehensive framework for software data collection• Developing a widely-used method for assessing efficacy of software

standards and methods• Developing a new understanding of the fundamentals of software

testing• Providing a formal foundation for the software metrics area


Recent Projects on Diversity• DISCS (Diversity In Safety-Critical Systems) - 1997-2000,

funded by EPSRS, UK• DISPO (DISPO2-DISPO5) 1997-2007, funded by British

Energy– most of modelling work presented here completed as part of DISPO

projects

• DOTS (Diversity with Off-the-Shelf Software), 2000-2004, funded by EPSRS, UK– the experimental work with Off-the-Shelf presented here has been

completed in the DOTS project

• DIRC (Dependability Interdisciplinary Research Collaboration) - 2000-2006, EPSRC– diversity in socio-technical systems has been the main concern


Software design diversity: Why• The idea of redundancy (i.e. multiple software

channels) for increased reliability/availability is not new: – has been used in hardware design for a very long time.

• simple redundancy does not work with software – software failures are deterministic: whenever a

software fault is triggered a failure will result– software does not ware out– software channels work in parallel, but must be:

• different by design (design diversity)• work in (slightly) conditions (data diversity)


Software design diversity (2)• Surprisingly, various homogeneous fail-over schemes dominate the

market of FT ‘enterprise’ applications. These are ineffective!• U.S.-Canada Power System Outage Task Force, Final Report on the

August 14th (2003) Blackout in the United States and Canada– https://reports.energy.gov/BlackoutFinal-Web.pdf

EMS Server Failures. FE’s EMS system includes several server nodes that perform the higher functions of the EMS. Although any one of them can host all of the functions, FE’s normal system configuration is to have a number of host subsets of the applications, with one server remaining in a “hot-standby” mode as a backup to the others should any fail. At 14:41 EDT, the primary server hosting the EMS alarm processing application failed, due either to the stalling of the alarm application, “queuing” to the remote EMS terminals, or some combination of the two. Following preprogrammed instructions, the alarm system application and all other EMS software running on the first server automatically transferred (“failedover”) onto the back-up server. However, because the alarm application moved intact onto the backup while still stalled and ineffective, the backup server failed 13 minutes later, at 14:54 EDT. Accordingly, all of the EMS applications on these two servers stopped running.(Part 2, p 32)


Software design diversity (3)

• The main architectures for symmetric software fault-tolerance:– NVP (N-version programming), Avizienis, 1977– Recovery Blocks, Randell, 1975– NSCP (N-self-checking programming), Laprie

et al., 1987

• Endless number of architectures with asymmetric fault-tolerance (i.e. a ‘primary’ channel and a ‘checker’)


Examples: diverse, modular redundancy• “natural” 1-out-of-2 scheme (e.g. communication,

alarm, protection)

Channel 1

Channel 2inputs

Parallel (OR, 1-out-of-2)arrangements

inputs

Channel 1

Channel 2

Channel 3

Bespoke adjudicator

Systemoutput

• Voted system (e.g. control)


Examples: primary/checker systems

Primarysoftware

checker

Computation Input System

output

Approved/ rejected

• Checker will usually be bespoke (possibly on OTS platform)• If simpler than primary high quality is affordable• Safety kernel idea


Achievement vs. Assessment• Cost-benefit analysis is always needed:

– design diversity is more expensive than non-diverse redundancy, or solutions without redundancy

• especially in 80s, when the area was actively researched

– what are the benefits of design diversity, how much one gains from diverse redundancy?

• Assessing the benefits is a problem much harder for (diverse) software than for hardware

• NVP ‘implicitly’ assumed independence of failures of the channels– huge controversy, at times very entertaining exchange in

the IEEE Transaction on Software Engineering in mid 80s.


Is failure independence realistic?• Knight and Leveson experiment (FTCS-15, 1985 and

TSE, 1986)– 27 software versions developed to the same specification by

students in two US Universities– tested on 1,000,000 test cases and the versions’ reliability

‘measured’– Coincident failures observed much more frequently than

independence would suggest• i.e. refuted convincingly the hypothesis of statistical independence

between the failures of the independently developed versions!

• Eckhardt & Lee model (TSE, 1985)– probabilistic model demonstrates why independently

developed versions will not fail independently


Eckhardt and Lee model• Model of software development

– population of possible versions ={1, 2...}

– probabilistic measure S(), i.e. S(i) is the probability that version i will be developed

• Demand space modelled probabilistically– D={x1, x2...} - demand space,

– Q() probabilistic measure: the likelihood of different demands being chosen in operation.

.,0

;,1),(

xonfailnotdoesprogramif

xonfailsprogramifx


Eckhardt and Lee model (2)• The random variable (,X) represents the

performance of a random program on a random demand: this is a model for the uncertainty both in software development and usage.

),()().,()( xSxxS

is the probability that a randomly chosen program fails for a particular demand x (‘difficulty’ function).

(X) is a random variable– upper case X represents a random demand, i.e chosen in

operation at random according to Q()


Eckhardt and Lee model (3)

There is no reason to expect that independently developed software versions will fail independently on a randomly chosen demand, X, even though they fail conditionally independently on a given demand, x.

.)()(

)()()()(

)()()(),(),(

)1),(),(()(

2

222

2121

2121

XonfailsPVar

VarxQx

xQSSxx

XXPXonfailbothandP

F

F


Littlewood and Miller model• A generalisation of the EL model for the case of

‘forced diversity’– the development teams are kept apart but forced to use

different methodologies, e.g. programming languages, etc.

• Model of forced diversity– probabilistic measures SA() and SB() for development

methodologies, A and B. • a version (with a specific set of scores,(,x)) may be very likely

with methodology A and very unlikely with methodology B

– The model in every other aspect is identical to the EL model.


Littlewood and Miller model (2)

• Since covariance can be negative, then with forced diversity one may do even better than the unattainable independence under the EL model

• Littlewood & Miller in their TSE paper 1989 applied their model to Knight & Leveson’s data and discovered negative covariance.– For them the two methodologies were represented by the

programs developed by students from different universities.

).()(),(

),(

XonfailsPXonfailsPCov

XonfailsXonfailsP

BABA

BA


Limitations of EL and LM models

• Eckhardt and Lee (EL) and Littlewood and Miller (LM) models deal with a ‘snapshot’ of the population of versions– extended by allowing the versions to evolve

through their being tested and fixing the detected faults

• These are models ‘on average’ – extended by looking at models of a particular

pair of versions (models ‘in particular’).


A new model ‘on average’

version i

no testing

version itested with jTesting with suite j

Testing with suite k

Testing:- test suite (a given test generation procedure may be instantiated differently, i.e. different sets of test cases can be generated) - independently generated for each channel of the system; - the same test suite used;- adjudication (oracle: perfect/imperfect, back-to-back)- fault-removal (perfect/imperfect, new faults?)

version itested with k


Modelling the testing process

={t1,t2,...} with M() , i.e. M(t) = P(T=t)

• Extended score function:

.,0

,,1),,(

xonfailnotdoestwithtestedif

xonfailstwithtestediftx

),,( x is the score of on x before testing


Comparison of testing regimes• Testing with oracles:

– Detailed analysis with perfect oracles:• testing with oracles on independently chosen

testing suites;• testing with oracles on the same testing suite;

– Speculative analysis of oracle imperfection

• ‘back-to-back’ testing - lower and upper bounds identified under simplifying assumptions


Marginal probability of system failureWithout forced diversity

Independent test suites (EL result applies)

,),( 211 T

FTi VarXonfailP

).,(

)(.),(),(

212

221

XonfailPVar

xQxTVarVarXonfailP

iTF

T

F

TF

Tc

The same suite (EL result does not apply). May be significantly worse than with independent suites


Marginal system pfd (2)

,,

),(

)()()()( TBTAF

TBTA

i

Cov

XonfailBAP

.)(),(),,(,

),(

)()(

)()(

F

BATBTAF

TBTAc

xQxTxTCovCov

XonfailBAP

With forced DiversityIndependent suites (LM result applies).

The same suite (LM result does not apply).

May be better or worse than independent suites!

Better than with independent suites is counterintuitive but possible (a contrived example constructed )!


Back-to-back testing• Back-to-back testing is a special case of testing with

the same suite. It cannot do better (lower bound on system pfd) than the same suite with perfect oracles.

• In the worst case (an upper bound on system pfd) the back-to-back testing only removes the faults which cause non-coincident failures, i.e. with no impact on system pfd (untested population).

• Bounds on the performance of back-to-back testing obtained.

• Detailed modelling of realistic back-to-back testing (partial overlapping of failure regions) does not seem feasible and worthy!


In summary• Performance of testing regimes (no account of the cost):

– (best in terms of average system reliability achievable) independent testing with oracles;

– (worse) testing with the same suite and oracles;– (worst) back-to-back testing.

• Accounting for the cost may change this ordering! A trade-off can be struck, which depends on cost of test suite generation and cost of testing.

• Counterintuitive observation: – forced diversity combined with testing with the same suite may

lead to better system pfd than testing with independent suites (i.e. better result can be achieved more cheaply!)


Models ‘in particular’

P1P5

P9theta(x)

x1x2

x3x4

x5x6

x7x8

x9x10

x11x12

x13x14

x15x16

x17x18

x19x20

pfd

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

programs

demands

15

9

pfds of individual programs

difficulty function,

average pfd


Model in particular: full knowledge of scores

),Ωcov(ΩPP

xxxP

XontogetherfailBAP

BABA

DxBA

),(),()(

),(

• Scores and operational profile as with models ‘on average’.

• Similar to the LM result, except that PA and PB are pfd of the channels, not of an ‘average version’, i.e. are estimable.


With failure data about sub-domains in demand space...

S1

S2

S3

S4

• "coarse-grain" info: P(failure |Si ) for each subdomain Si

upper and lower bounds:• should not trust pfd to be any lower than

P(Si ) P(A fails | Si ) P(B fails | Si )• can trust pfd to be lower than

P(S ) min ( P(A fails | S), P(B fails | S ) )• can improve estimate with observation from actual behaviour of A and B channels in new system.

)SX,Ωcov(Ω

SXfailsBPSXfailsAP

SXfailBAP

iBA

ii

i

|

)|()|(

)|,(


Model in particular: summary

• Useful bounds can be obtained without detailed knowledge of dependence between channel failures– if these are tight, then the details (covariance terms) become

irrelevant– increasing the number of sub-domains allows one to seek

tighter bounds

• Models ‘in particular’ can be applied to concrete FT system, while the models ‘on average’ are only useful as guidelines as to what is believable and what is not, but useless for actual reliability prediction


Fault Diversity among Off-The-Shelf SQL Servers

The work was completed together with Ilir Gashi, a PhD student of at CSR, working under my

supervision


Motivation for the study• Fault-tolerance with off-the-shelf software becomes

cheaper than with bespoke development but what is the dependability gain? – Empirical evidence is needed that the effort to build

fault-tolerance with OTS is worthy– But what software to use?

• Toy examples? Open to criticism that findings are not applicable to complex software:

– the gains may be very different– difficulties of building FT solutions with diverse OTS s/w may be too

high and the good idea is not practicable

• We avoid the first criticism by having decided to study complex OTS software such as RDBMS (SQL servers)

– we plan to address the second criticism in the near future


Motivation for the study (2)• Reliability gain depends on operational

environment!– We tried statistical testing - created a prototype of a

harness (with TU Plovdiv, Dr Valentin Mollov)• ‘discovered’ a fault of MSSQL fixed in SPs and actually could

measure the gain for different profiles

– What if profile is unknown - always true prior to deployment of general purpose software such as RDBMS

• evidence of the effectiveness of fault tolerance can be gathered from the known faults, i.e. reported bugs, for the OTS software

– some of faults may have been fixed in the new releases but there is no strong reason (unless we have empirical evidence on the contrary) to think that the ‘pattern’ of the faults, which lead to coincident failures will change.


Fault-Tolerant database replication with Diverse SQL servers

• Effectiveness of the architecture in the end depends on the assumptions made about the failures– with database replication the assumption of

‘fail-silent’ failure is very common– the study allows us to validate this

assumption on the reported bugs


Size of the study • Servers included in study:

– Open source: • PostgreSQL v. 7.0.0 (PG)• Interbase v. 6.0 (IB)

– Closed development: • Oracle v. 8.0.5 (Oracle)• MSSQL v. 7 (MSSQL)

• 181 known bug reports for all the servers together were collected.


IB faults


PG, Oracle, MSSQL


2 - Version combinations


Detectability of failures• Percentage of bugs causing crash failure varies between servers

from 13% (MS SQL) to 21% (Oracle and PostgreSQL). • A non-diverse scheme would only detect the self-evident failures:

– crash failures, – failures reported by the server itself and poor performance failures.

For all four servers, less than 50% of bugs cause such failures.

• With diverse pairs detectability is greatly improved: – all the possible two-version fault-tolerant configurations detect the

failures caused by at least 94% of the bugs used in the study. – None of the bugs caused a failure in more than two servers.

• Other issues:– diagnosability (if different results received from replicas which, if any,

is giving a correct answer).– recovery


Future work• Interpreting the results. What do the empirical studies like

ours really tell us? Work in progress.• The later versions of these servers are installed and used

in a new study with new bugs collected:– consistency established between the ‘snapshots’ (i.e. releases)– used FB 1.0, FB1.5 (Interbase’s successor) and PostgreSQL 7.2

• Work on diagnosing the failed server when a discrepancy between the two diverse replicas detected – using data diversity (functional redundancy built in SQL servers)

• a paper submitted to EDCC’06 and another one in preparation for IEEE Transactions on Dependable and Secure Computing


Performance Implications of Diverse redundancy with SQL servers

• Measurements using a ‘standard’ performance benchmark application for online-transaction processing, TPC-C.

• The work was completed together with Vladimir Stankovic, a PhD student of at CSR, working under my supervision

FT-node


Regimes of Operation of Middleware• Slowest response,

– waits for responses to statements from all replicas– adjudicates them– returns the adjudicated response, if any, to the client

• Fastest response– waits for the first response to a statement (the first

response)– returns it to the client – waits for the rest of the responses, adjudicates them

• Optimistic– waits for the first response– returns it to the client, ignores all later responses– a server can ‘skip’ READ statements if a response already

received.


Experimental Harness• client machine: 1.5 GHz Intel Pentium 4

processor, 640 MB RAMBUS RAM and 20GB HDD (Maxtor DiamondM)– Operating system: Windows 2000 Professional, SP4 – Own implementation in Java of TPC-C with

multithreading to allow for simulating concurrent clients

• server machines: 1.5 GHz Intel Pentium 4 processor, 384 MB RAMBUS RAM and 20GB HDD (Seagate U Series).– Operating system: Linux RedHat 6.0 (Hedwig) – Interbase 6.0, PostgreSQL 7.4.0.


Experiment parametersFT-node configurations:• 1IB1PG: an FT-node with a copy of IB and PG.

• 1IB, a single replica of IB;

• 1PG, a single replica of PG;

• 2IB, an FT-node with two replicas of IB, and

• 2PG, an FT-node with two replicas of PG.

Server loads:• a single TPC-C compliant client;

• a single TPC-C compliant +10 additional clients,

• a single TPC-C compliant + 50 additional clients.

Size of experiments:• 10,000 transactions executed by the TPC-C client;


ResultsMean Response Time (1+0 Clients Experiment)

0

100

200

300

400

500

600

700

800

900

Delivery New-Order Order-Status

Payment Stock-Level All 5

Transaction Type

Res

pons

e T

ime

(mse

c)

1IB

1PG

1IB1PG

2IB

2PG


Results (2)Mean Response Time (1+10 Clients Experiment)

0500

1000150020002500

30003500400045005000

Delivery New-Order Order-Status

Payment Stock-Level

All 5

Transaction Type

Re

spo

nse

Tim

e (

mse

c)

1IB

1PG

1IB1PG

2IB

2PG


Results (3)Cumulative Transaction Time under Different Loads

0

200

400

600

800

1+0 Clients 1+10 Clients 1+50 ClientsLoad

Cum

ula

tive T

ime (m

in)

IB

PG

IBPG

2IB

2PG


Summary: Performance measurements• An intriguing possibility to use diversity for

‘performance boost’• More interestingly, diversity offers a range of

‘quality of service’ options controlled by the client applications:– best dependability assurance– and performance– ‘in between’

• Using ‘learning algorithms’, e.g. Bayesian inference, to switch between high dependability assurance and high performance.


Conclusions• Diverse Redundancy is a very mature discipline with more

than 30 years of active research– Its main ‘problem’ has been high cost, but this is changing with

off-the-shelf software (especially in the context of Service-Oriented Architecture)

– functionally similar/identical products (SQL servers, Application servers, BPEL engines, etc.) available as affordable commodities (even open-source)

• Industry is reluctant to accept that the diverse redundancy does not have an alternative:– formal methods cannot cope with complex software currently

deployed• diversity and formal methods can be combined (e.g. APIs of diverse

off-the-shelf products can be ‘harmonised’ using formal methods)

– fault-avoidance is ‘hopeless’ (i.e. too expensive to deliver an acceptable level of dependability)


Research referenced in this talk:1. Littlewood, B., Popov, P., Strigini, L. and Shryane, N., "Modelling the effects of combining

diverse software fault removal techniques", IEEE Transactions on Software Engineering, vol. SE-26, no. 12, pp.1157-1167, 2000.

2. Littlewood, B., Popov, P. and Strigini, L., "Modelling software design diversity - a review", ACM Computing Surveys, vol. 33, no. 2, pp. 177 - 208, 2001.

3. P. Popov, L. Strigini, J. May and S. Kuball, "Estimating Bounds on the Reliability of Diverse Systems", IEEE Transactions on Software Engineering, vol. 29, no. 4, 2003, pp.345-359, 2003.

4. Littlewood, B., Popov, P., Strigini, L., "Assessing the reliability of diverse fault-tolerant software-based systems", Safety Science, vol. 40, pp. 781-796, Pergamon, 2002.

5. Littlewood B, Popov P, Strigini L., "A note on reliability estimation of functionally diverse systems", Reliability Engineering and System Safety, vol.66, pp 93 - 95, 1999.

6. Popov P. and Strigini L., "Conceptual models for the reliability of diverse systems - new results", Proc. 28th International Symposium on Fault-Tolerant Computing (FTCS-28), Munich, Germany, IEEE Computer Society Press, pp. 80-89, 1998.

7. P. Popov, B. Littlewood, "The effect of testing on the reliability of fault-tolerant software", Proc International Conference on Dependable Systems and Networks (DSN2004), pp. 265-274, IEEE Computer Society, 2004.


Research referenced in this talk (2):8. Gashi I., Popov P., Strigini L., "Fault diversity among off-the-shelf SQL database

servers", Proc. DSN 2004, International Conference on Dependable Systems and Networks,Florence, Italy, IEEE Computer Society Press, pp:389-398, 2004.

9. Gashi I., Popov P., Stankovic V., Strigini L., "On Designing Dependable Services with Diverse Off-The-Shelf SQL Servers", in "Architecting Dependable Systems II", Lecture Notes in Computer Science, (R. de Lemos, C. Gacek and A. Romanovsky, Eds.), vol. 3069, pp. 191-214, Springer-Verlag, 2004.

10. P. Popov, L. Strigini, A. Kostov, V. Mollov and D. Selensky, "Software Fault-Tolerance with Off-the-Shelf SQL Servers", Proc. 3rd International Conference on Component-Based Software Systems (ICCBSS'04), 2-4 Feb. 2004, Redondo Beach, CA, U.S.A., pp. 117-126, Springer, 2004.

11. Littlewood, B., Popov, P. and Strigini, L., "Design Diversity: an Update from Research on Reliability Modelling", Proc. Safety-Critical Systems Symposium 2001, Bristol, U.K., Springer, 2001.

12. P. Popov and L. Strigini, "The Reliability of Diverse Systems: a Contribution using Modelling of the Fault Creation Process", DSN 2001 - The International Conference on Dependable Systems and Networks, Goteborg, Sweden, 2001, 2001.

Peter Popov Centre for Software Reliability City University, United Kingdom [email protected]

Documents

Transcript of Peter Popov Centre for Software Reliability City University, United Kingdom [email protected]