1 Black box (Finite State Machine) testing Design for testability Coverage measures Random testing...
Transcript of 1 Black box (Finite State Machine) testing Design for testability Coverage measures Random testing...
1
Black box (Finite State Machine) testingDesign for testabilityCoverage measuresRandom testingConstraint-based testingDebugging and test case minimizationUsing model checkers for testingCoverage revisited (“small model property”)
Topics in Testing We’ve Covered
2
Black box (Finite State Machine) testing
• There “are no Turing machines”
• Vasilevskii and Chow algorithm for conformance testing based on spanning trees and distinguishing sets
• Exhaustive testing that cannot miss bugs is often computationally intractable
Topics in Testing We’ve Covered
a
a b
d
3
Design for testability
• Controllability and observability
• Simulation and stubbing, assertions, downward scalability, etc.
Topics in Testing We’ve Covered
4
Coverage measures
• Not necessarily correlated with fault detection!• Still useful!
• Graph coverage: node and edge (statement and branch coverage)
• Logic coverage• Input space partitioning• Syntax-based coverage
Topics in Testing We’ve Covered
4
1
2 3
x >= yx < y
x = yy = 0
x = x + 1
b1 b2
b3
b1 b2
b3
((a <= b) && !G) || (x >= y)
5
Random testing
• Generate inputs at random• Explore very large numbers of executions• Relies on a good automatic test oracle• Feedback to bias choices away from
redundant and irrelevant inputs is useful
• Good baseline for evaluating other methods, and often very effective
Topics in Testing We’ve Covered
6
Constraint-based testing
• Addresses weaknesses of random testing• E.g., finding needles in haystacks, such as
where hash(x) = y
• Combines concrete and symbolic execution to generate inputs
• Concrete execution helps where symbolic solvers choke
Topics in Testing We’ve Covered
7
Debugging and test case minimization
• Automatic minimization of test cases is very valuable for debugging and reducing regression suite size
• Debugging can be considered as an application of the scientific method
• Various techniques exist for using test cases to localize faults
Topics in Testing We’ve Covered
8
Using model checkers for testing
• Testing based on states, rather than on executions or paths
• Use abstractions to reducestate space
• Use automatic instrumentationto handle the engineeringdifficulties
Topics in Testing We’ve Covered
9
Hang onto your hats• It’s going to be a fast ride• Anything in these slides is fair game for the
test: anything not even mentioned in these slides is not fair game (so I’ll mention valgrind right now to let you know it might show up…)
• So ask questions as we go if something is unclear (so that you think even re-reading the slides isn’t going to help)
NOW BEGINS THE REVIEW
10
Basic Definitions: Testing
What is software testing?• Running a program• In order to find faults
• a.k.a. defects• a.k.a. errors • a.k.a. flaws• a.k.a. faults• a.k.a. BUGS
11
Testing
What isn’t software testing?• Purely static analysis: examining a program’s
source code or binary in order to find bugs, but not executing the program
• Good stuff, and very important, but it’s not testing• We’ll get back to this in a future class
• Fuzzy borderline: if we only symbolically execute the program
• For this class, we’ll call it testing when the program actually runs (but maybe in a virtual machine)
12
Why Testing?
Ideally: we prove codecorrect, using formalmathematical techniques (with a computer, not chalk)
• Extremely difficult: for some trivial programs (100 lines) and many small (5K lines) programs
• Simply not practical to prove correctness in most cases – often not even for safety or mission critical code
13
Why Testing?
Nearly ideally: use symbolic or abstract model checking to prove the system correct• Automatically extracts a mathematical abstraction from
a system• Proves properties over all possible executions
• In practice, can work well for very simple properties (“this program never crashes in this particular way”), but can’t handle complex properties (“this is a working file system”)
• Doesn’t work well for programs with complex data structures (like a file system)
14
Why Does Testing Matter?
NIST report, “The Economic Impacts of Inadequate Infrastructure for Software Testing” (2002)• Inadequate software testing costs the US alone
between $22 and $59 billion annually• Better approaches could cut this amount in half
Major failures: Ariane 5 explosion, Mars Polar Lander, Intel’s Pentium FDIV bug
Insufficient testing of safety-critical software can cost lives: THERAC-25 radiation machine: 3 dead
We want our programs to be reliable• Testing is how, in most cases, we find out if
they are
Mars PolarLander crashsite?
THERAC-25 design
Ariane 5:exception-handlingbug : forced selfdestruct on maidenflight (64-bit to 16-bitconversion: about370 million $ lost)
15
Testing and Monitoring
In this class, we’ll look at which executions of a program to run• I’ll call this problem “the” testing problem
Second problem: how do we know if an execution reveals a bug?• Key question when monitoring deployed
programs to handle faults or send in bug reports from the field
• I’ll (mostly) take this for granted: we have a reference model or assertions to check
16
Example: File System Testing
How hard would it be to just try “all” the possibilities?
Consider only core 7 operations (mkdir, rmdir, creat, open, close, read, write)• Most of these take either a file name or a
numeric argument, or both• Even for a “reasonable” (but not provably safe)
limitation of the parameters, there are 26610
executions of length 10 to try• Not a realistic possibility (unless we have 1012
years to test)
17
The Testing Problem
This is a primary topic of this class: what “questions” do we pose to the software, i.e., • How do we select a small set of executions out
of a very large set of executions?
• Fundamental problem of software testing research and practice
• An open (and essentially unsolvable, in the general case) problem
18
Terms: Verification and Validation
These two terms appear a lot, often in vague or sloppy ways, in the literature• Verification is checking that a program
matches a specification• Validation is making sure it meets the
original requirements – satisfies customers, operates ok onboard the spacecraft, etc.
Verification: “you built it right”Validation: “you built the right thing”
(our focus, forthe most part)
19
Terms: Unit, Integration, System Testing
Stages of testing• Unit testing is the first phase, done by
developers of modules• Integration testing combines unit tested
modules and tests how they interact• System testing tests a whole program to
make sure it meets requirements
• “Design testing” is testing prototypes or very abstract models before implementation – seldom mentioned, but when possible it can save your bacon
• Exhaustive model checking may be possible at this stage
20
Terms: Functional Testing
Functional testing is a related term• Tests a program from a “user’s” perspective – does it
do what it should?• Opposed to unit testing, which often proceeds from
the perspective of other parts of the program• Module spec/interface, not user interaction• Sort of a fuzzy line – consider a file system – how different is
the use by a program and use of UNIX commands at a prompt by a user?
• Building inspector does “unit testing”; you, walking through the house to see if its livable, perform “functional testing”
• Kick the tires vs. take it for a spin?
21
Terms: Regression Testing
Regression testing• Changes can break code, reintroduce old bugs
• Things that used to work may stop working (e.g., because of another “fix”) – software regresses
• Usually a set of cases that have failed (& then succeeded) in the past
• Finding small regressions is an ongoing research area – analyze dependencies
“. . . as a consequence of the introduction of new bugs, program maintenance requires far more system testing. . . . Theoretically, after each fix one must run the entire batch of test cases previously run against the system, to ensure that it has not been damaged in an obscure way. In practice, such regression testing must indeed approximate this theoretical idea, and it is very costly." - Brooks, The Mythical Man-Month
22
Terms: The Oracle Problem
The oracle problem• How to know if a test fails• If the oracle says every execution is good, why
bother running the program?• Some obvious, easily automated approaches:
• The program probably shouldn’t crash• Assertions shouldn’t be violated
• Automatable, but more difficult to apply:• Differential testing (McKeeman, etc.) – when you
have another program, likely correct, that does the same thing, just compare outputs over same inputs
• Last resort, not automatable:• Hand inspection of executions
(oracle: a magical source of truth, often cryptic, given by the gods)
23
Terms: Test (Case) vs. Test Suite
Test (case): one execution of the program, that may expose a bug
Test suite: a set of executions of a program, grouped together• A test suite is made of test cases
Tester: a program that generates tests
Line gets blurry when testing functions, not programs – especially with persistent state
24
Terms: Black Box Testing
Black box testing• Treats a program or system as a • That is, testing that does not look at source
code or internal structure of the system• Send a program a stream of inputs, observe the
outputs, decide if the system passed or failed the test
• Abstracts away the internals – a useful perspective for integration and system testing
• Sometimes you don’t have access to source code, and can make little use of object code
• True black box? Access only over a network
25
Terms: White Box Testing
White box testing• Opens up the box!
• (also known as glass box, clear box, or structural testing)
• Use source code (or other structure beyond the input/output spec.) to design test cases
• Brings us to the idea of coverage
26
Terms: Coverage
Coverage measures or metrics• Abstraction of “what a test suite tests” in a
structural sense• Best explained by giving examples• Common measures:
• Statement coverage• A.k.a line coverage or basic block coverage• Which statements execute in a test suite
• Decision coverage• Which boolean expressions in control structures
evaluated to both true and false during suite execution• Path coverage
• Which paths through a program’s control flow graph are taken in the test suite
27
Terms: Mutation Testing
A mutation of a program is a version of the program with one or more random changes
Mutation testing is another way to measure the quality of a test suite• Amman and Offutt call it syntax-based coverage
Idea: generate a large number of mutants• Run the test suite on these
• If few mutants are detected, the test suite may not be very good
• Difficulties• Cost of testing many versions of a program• How to generate mutants (operators)
• In principle, can subsume many otherforms of coverage
28
Faults, Errors, and Failures
Fault: a static flaw in a program• What we usually think of as “a bug”
Error: a bad program state that results from a fault• Not every fault always produces an error
Failure: an observable incorrect behavior of a program as a result of an error• Not every error ever becomes visible
29
To Expose a Fault with a Test
Reachability: the test much actually reach and execute the location of the fault
Infection: the fault must actually corrupt the program state (produce an error)
Propagation: the error must persist and cause an incorrect output – a failure
30
Controllability and Observability
Goals for a test case:• Reach a fault• Produce an error• Make the error visible as a failure
In order to make this easy the program must be controllable and observable• Controllability:
• How easy it is to drive the program where we want to go
• Observability:• How easy it is to tell what the program is doing
31
Design for Testability
If a program is not designed to be controllable and observable, it generally won’t be
We have to start preparing for testing before we write any code• Testing as an after-the-fact, ad hoc, exercise is
often limited by earlier design choices
32
Test-Driven Development
One way to design for testability is to write the test cases before the code• Idea arising from Extreme Programming and agile
development• Write automated test cases first• Then write the code to satisfy tests
• Helps focus attention on making software well-specified• Forces observability and controllability: you have to be
able to handle the test cases you’ve already written (before deciding they were impractical)
• Reduces temptation to tailor tests to idiosyncratic behaviors of implementation
33
Controllability: Simulation and Stubbing
A key to controllable code is effective simulation and stubbing• Simulation of low-level hardware devices
through a clean driver interface• Real hardware may be slow• May be impossible/expensive to induce some
hardware failure modes on real hardware• Real hardware may be a limited resource
• Stubbing for other routines and code• Other code/modules may not be complete• May be slow and irrelevant to test• May need to simulate failure of other modules
34
Controllability: Downwards Scalability
Another important aspect of controllability is to make code “downwards scalable”• Many faults cause an error only in a corner
case due to a resource limit• An effective strategy for finding errors is to
reduce the resource limits• Test a version of the program with very tight bounds• Finding corner cases is easier if the corners are
close together• Too many programs hard-code resource limits
or make assumptions about resources unconnected to defined limits
• E.g., not checking the result of malloc
35
Observability: Assertions
Assertions improve observability by making (some) errors into failures• Even if the effect of a fault doesn’t propagate, it
may be visible if an assertion checks the state at the right time
Assertions also improve observability by making the error, rather than failure, visible• Know how the state was corrupted
directly, not just eventual effect
36
Observability: Invariant Checkers
Can extend the idea of assertions to writing “full” invariant checkers• Do a crawl of code’s basic data structures• Check various invariants that would be
too expensive to check at runtime• Invariant checker can be written to be
easy-to-use: recursion, memory allocation, etc.• Won’t run on actual system• But be careful! If your invariant checker has
a bug and changes the system state. . .
37
Graph Coverage
Cover all the nodes, edges, or paths of some graph related to the program
Examples:• Statement coverage• Branch coverage• Path coverage• Data flow (def-use) coverage• Model-based testing coverage• Many more – most common kind of
coverage, by far
38
Statement/Basic Block Coverage
if (x < y){ y = 0; x = x + 1;}else{ x = y;}
4
1
2 3
x >= yx < y
x = yy = 0
x = x + 1
if (x < y){ y = 0; x = x + 1;}
3
1
2x >= y
x < y
y = 0x = x + 1
Statement coverage:Cover every node of thesegraphs
Treat as one node becauseif one statement executesthe other must also execute(code is a basic block)
39
Branch Coverage
if (x < y){ y = 0; x = x + 1;}else{ x = y;}
4
1
2 3
x >= yx < y
x = yy = 0
x = x + 1
if (x < y){ y = 0; x = x + 1;}
3
1
2x >= y
x < y
y = 0x = x + 1
Branch coverage vs.statement coverage:Same for if-then-else
But consider this if-thenstructure. For branch coveragecan’t just cover all nodes, butmust cover all edges – get tonode 3 both after 2 and withoutexecuting 2!
40
Path Coverageif (x < y){ y = 0; x = x + 1;}else{ x = y;}
if (x < y){ y = 0; x = x + 1;}
4
1
2 3
x >= yx < y
x = yy = 0
x = x + 1
6
4
5x >= y
x < y
y = 0x = x + 1
How many paths throughthis code are there? Needone test case for each toget path coverage
To get statement and branchcoverage, we only need twotest cases:1 2 4 5 6 and 1 3 4 6
Path coverage needs two more:1 2 4 5 61 3 4 61 2 4 61 3 4 5 6
In general: exponential inthe number of conditional branches!
41
Data Flow Coverage
x = 3;y = 3;
if (w) { x = y + 2;}
if (z) { y = x – 2;}
n = x + y
7
4
6!z
z
y = x - 2
5
3
4!w
w
x = y + 2
2
1
n = x + y
x = 3
y = 3
Def(x)
Def(y)
Def(x)Use(y)
Use(y)Use(x)
Use(x)Def(y)
Annotate program withlocations where variablesare defined and used(very basic staticanalysis)
Def-use pair coverage requiresexecuting all possible pairsof nodes where a variable isfirst defined and then used,without any interveningre-definitions
E.g., this path covers the pairwhere x is defined at 1 and usedat 7: 1 2 3 5 6 7
But this path does NOT:1 2 3 4 5 6 7
May be many pairs,some not actually executable
42
Logic Coverage
if (((a>b) || G)) && (x < y)){ y = 0; x = x + 1;}
3
1
2 ((a <= b) && !G) || (x >= y)
((a>b) || G)) && (x < y)
y = 0x = x + 1
if (x < y){ y = 0; x = x + 1;}
What if, instead of:
we have:
Now, branch coverage will guaranteethat we cover all the edges, but doesnot guarantee we will do so for allthe different logical reasons
We want to test the logic of the guardof the if statement
43
43
Active Clause Coverage
( (a > b) or G ) and (x < y)1 T F T T2 F F T F
duplicate3 F T T T4 F F T F
5 T T T T6 T T F F
With these values for G and (x<y), (a>b) determines the value of the predicate
With these values for (a>b) and (x<y), G determines the value of the predicateWith these values for (a>b) and G, (x<y) determines the value of the predicate
4444
Input Domain Partitioning
Partition scheme q of domain D
The partition q defines a set of blocks, Bq = b1 , b2 ,
… bQ
The partition must satisfy two properties:1. blocks must be pairwise disjoint (no overlap)2. together the blocks cover the domain D (complete)
bi bj = , i j, bi, bj Bq
b1 b2
b3 b = Db Bq
Coverage then means using at least one input from each of b1, b2, b3, . . .
4545
Syntax-Based Coverage
Based on mutation testing (a pet topic of Amman and Offutt, who are heavily into this research area)
Bit different kind of creature than the other coverages we’ve looked at
Idea: generate many syntactic mutants of the original program
Coverage: how many mutants does a test suite kill (detect)?
46
Generation vs. Recognition
Generation of tests based on coverage means producing a test suite to achieve a certain level of coverage• As you can imagine, generally very hard• Consider: generating a suite for 100%
statement coverage easily reaches “solving the halting problem” level
• Obviously hard for, say, mutant-killingRecognition means seeing what level of
coverage an existing test suite reaches
47
Coverage and Subsumption
Sometimes one coverage approach subsumes another• If you achieve 100% coverage of criteria A, you are
guaranteed to satisfy B as well• For example, consider node and edge coverage
• (there’s a subtlety here, actually – can you spot it?)
What does this mean?• Unfortunately, not a great deal• If test suite X satisfies “stronger” criteria A and test suite
Y satisfies “weaker” criteria B• Y may still reveal bugs that X does not!• For example, consider our running example and statement
vs. branch coverage• It means we should take coverage with a grain of salt,
for one thing
48
Levels of Testing
Adapted from Beizer, by Amman and Offutt• Level 0: Testing is debugging• Level 1: Testing is to show the program works• Level 2: Testing is to show the program
doesn’t work• Level 3: Testing is not to prove anything
specific, but to reduce risk of using program• Level 4: Testing is a mental discipline that
helps develop higher quality software
49
What’s So Good About Coverage?
Consider a fault that causes failure every time the code is executed
Don’t execute the code: cannot possibly find the fault!
That’s a pretty good argument for statement coverage
int findLast (int a[], int n, int x) {// Returns index of last element // in a equal to x, or -1 if no// such. n is length of a
int i; for (i = n-1; i >= 0; i--) {
if (a[i] = x) return i;}return 0;
}