CS444A: Software for Critical Systems. 2 Staff Prof. David L. Dill Prof. Armando Fox.
-
Upload
katherine-mckinney -
Category
Documents
-
view
217 -
download
1
Transcript of CS444A: Software for Critical Systems. 2 Staff Prof. David L. Dill Prof. Armando Fox.
3
TopicTopic
The engineering of software for applications where failure is unacceptable. . . for some value of “failure” and “unacceptable”.
Costs of failure exceed value of the software
4
Critical software is growing in importanceCritical software is growing in importance Computers are getting exponentially smaller,
cheaper, faster, and better connected.
Communications are improving at least as fast.
Increased use of critical software is irresistable Automation of tasks that were previous manual or infeasible. Sophisticated control replacing simple control. Replacing mechanical, analog, digital hardware.
5
Software is growingSoftware is growing
Software will replace mechanical, analog, and digital hardware Cheaper to copy. Easier to manufacture. Easier to upgrade. Provides more functionality.
Software will replace manual processes Cheaper and more reliable than human workers Relieves them of tedious tasks Faster and more predictable
6
Complexity is increasingComplexity is increasing
COTS is coming to softwareLarge projects increasingly use commercial off-the-shelf
componentsCommodity hardware, OS’s, tools, other building blocksExample: Mars Pathfinder
This is good and badCOTS reduces development cost & development timeSophisticated “building blocks” allow creation of more
complex systemsBut they are often brittle: intra-component and inter-
component failure modes are poorly understoodComposition of pieces that were designed separately
sometimes leads to unexpected failure modes
7
Software will be used in safety-critical applicationsSoftware will be used in safety-critical applications All of the above reasons (esp. cost)
Software can make systems saferTCAS - Aircraft collision avoidance system
Software can enhance system performanceFly-by-wire antilock braking
Software can perform life-saving functionsComputer-controlled pacemakers
8
Software will be used in safety-critical applicationsSoftware will be used in safety-critical applications All of the above reasons (esp. cost)
Software can make systems saferTCAS - Aircraft collision avoidance system
Software can enhance system performanceFly-by-wire antilock braking
Software can perform life-saving functionsComputer-controlled pacemakers
9
SubtopicsSubtopics
Successful engineering of software encompasses many different issues
Relationship of software to the larger system
Software development processes
Software design
Algorithms
Programming practices
10
Goal: Best Of Both WorldsGoal: Best Of Both Worlds
Traditional safety-engineering perspectiveFormal verification, requirements specification, related
formal methodsTraditional hazard/fault analysisFault tolerance
Systems perspectiveDesign techniques and programming practicesAs much “folklore” as formalEspecially recent experience in Internet-scale mission-
critical systems
11
Formal Methodology OutlineFormal Methodology Outline
Safety engineering of systems Hazard identification Hazard avoidance Standards
Requirements specification and tools Specification for reactive systems Model checking Logical specification (Z, VDM?) Theorem proving
Fault toleranceFault models Fault tolerant protocols
Etc.
12
The Case for the Systems PerspectiveThe Case for the Systems Perspective Many visible success stories
The InternetMars PathfinderGargantuan-scale 24x7 mission critical systems: Wal-Mart
financial exchanges, Visa, CIRRUS banking network…
Some spectacular failuresTherac-25 (today)
System design combines engineering judgment and “folklore” with formal methodology
13
The Role of the InternetThe Role of the Internet
The distributed system from hellEvolved over >25 years, lots of legacy code layersWidely distributed, both geographically and administrativelyTransient failure (hardware & software) is a way of lifeYet, it mostly works...What great ideas can we steal?
The Internet is a good testbed for new approaches to reliability “Internet scale” implies large size, exponential growth, and
24x7 operational requirementsPeople don’t die (usually) when systems go downStrong financial incentive spurs industrial deployment :-)
14
Systems Track OutlineSystems Track Outline
Conceptual vocabulary, research landscape
Fault isolation, fault containment, orthogonal guard mechanisms
Transactions, replication, consistency
State maintenance
Availability vs. consistency tradeoffs, harvest and yield
Application-level vs. OS-level mechanisms
Systems case studies
15
GoalsGoals
Identify recurrent design philosophies that work well
Taxonomize the “folklore” in software systems design
Identify fertile crossover areas to the “formal world”
17
MotivationMotivation The "Therac-25" is a classic case study in engineering
failure -- like Tacoma Narrows bridge, Challenger disaster, etc.
Illustrates many problems and issues of software safety.
Shows how not to do it.
Related to assignment.
18
The MachineThe Machine
The Therac-25 is a linear accelerator used for radiation therapy (e.g. cancer treatment).
Safety issues: overdose: Patient is injured or dies from radiation burns. underdose: Serious disease is not treated properly, patient may
be injured or die because of this.
Therac-25 much more dependent on software for safety than its predecessors (Therac-20, Therac-6) "Hardware interlocks" replaced by software.
19
Technical detailsTechnical details Multi-mode machine: protons, electrons, X-rays.
X-rays generated when electron beam collides with target.
- This is inefficient, so electron beam must be very powerful.
Different modes require turntable to be properly positioned with targets, spreaders, etc. between beam and patient.
20
AccidentsAccidentsMachine reliably treated thousands of patients, but occasionally weird things would happen.
There were at least 6 accidents.
Kennestone 1985: Patient treated for breast cancer is unexpectedly burned. Est. 15K-20K rad dose (500 rad to whole body 50% fatal). Patient lost breast, shoulder and arm paralyzed. Patient sued, settled out of court. FDA not informed until much later.
21
Another accidentAnother accidentTyler 1986:
Patient to be treated with electron beam. Operator said to treat with X-ray, then corrected. Patient felt "electric shock”. Operator saw "malfunction 54" and under-dose reading, so
said "proceed" to zap patient again. Patient overdosed a second time (in arm) as he was trying to
escape. Patient died horribly of radiation overdose 5 months later.
22
Software issuesSoftware issuesNo locks on shared variables (race conditions).
Control flow bug: some newly entered data can be ignored.
Timing sensitivity in user interface.
Wrap-around on counters.
23
User interface issuesUser interface issues“Malfunction 54” (patient might have received overdose
or under-dose).
No indication about patient safety with error messages.
“Proceed” button continues after error message
- one patient overdosed twice.
24
System issuesSystem issuesInadequate mechanical checks on turntable
- 3 microswitches for position sensing.
- 1-bit error in encoding makes position inaccurate.
- potentiometer installed later to sense position.
No independent hardware to suppress beam.
Dosage measurement devices (ion chambers) report inaccurate results for very high doses.
Therac-20 had same bugs, but no accidents because of independent protective systems.
25
Management issuesManagement issuesSoftware complacency
- software errors not modelled in fault trees.
- users told “no possibility of overdose”.
Absurdly low probabilities assigned to SW failure.
Guesswork in analyzing observed failures
- blamed microswitches on turntable.
- no actual failures found in microswitches.
- problem was probably software.
Inadequate software processes
- unclear safety analyses.
- no audit trails.
- inadequate testing.
26
Regulatory and legal issuesRegulatory and legal issuesFDA, Canadian regulators not heavily involved
- no software regulation in med. devices (at that time).
- not notified of incidents (no requirement to do so).
- inadequate investigation of early incidents.
When FDA got involved, the machine got fixed.
(speculation) Out of court settlements impeded. dissemination of information about hazards.