Prioritizing Test Cases for Regression Testing Article By: Rothermel, et al. Presentation by:...

47
Prioritizing Test Prioritizing Test Cases for Regression Cases for Regression Testing Testing Article By: Rothermel, et al. Article By: Rothermel, et al. Presentation by: Presentation by: Martin, Otto, and Prashanth Martin, Otto, and Prashanth

Transcript of Prioritizing Test Cases for Regression Testing Article By: Rothermel, et al. Presentation by:...

Prioritizing Test Cases for Prioritizing Test Cases for Regression TestingRegression Testing

Article By: Rothermel, et al.Article By: Rothermel, et al.

Presentation by: Presentation by: Martin, Otto, and PrashanthMartin, Otto, and Prashanth

•Test case prioritization techniques - schedule test cases for execution in an order that attempts to increase their effectiveness at meeting some performance goal.

• One goal is the rate of fault detection - a measure of how quickly faults are detected within the testing process

An improved rate of fault detection during testing can provide faster feedback on the system under test and let software engineers begin correcting faults earlier than might otherwise be possible.

•One application of prioritization techniques involves regression testing

This paper describes several techniques for using test This paper describes several techniques for using test execution information to prioritize test cases for regression execution information to prioritize test cases for regression testing, including:testing, including:

1) techniques that order test cases based on their total 1) techniques that order test cases based on their total coverage of code components, coverage of code components,

2) techniques that order test cases based on their coverage of 2) techniques that order test cases based on their coverage of code components not previously covered, and code components not previously covered, and

3) techniques that order test cases based on their estimated 3) techniques that order test cases based on their estimated ability to reveal faults in the code components that they ability to reveal faults in the code components that they cover.cover.

When the time required to re-execute an entire test suite is When the time required to re-execute an entire test suite is short, test case prioritization may not be cost-effective-it may short, test case prioritization may not be cost-effective-it may be sufficient simply to schedule test cases in any order.be sufficient simply to schedule test cases in any order.

When the time required to execute an entire test suite is When the time required to execute an entire test suite is sufficiently long, however, test-case prioritization may be sufficiently long, however, test-case prioritization may be beneficial because, in this case, meeting testing goals earlier beneficial because, in this case, meeting testing goals earlier can yield meaningful benefits.can yield meaningful benefits.

In general test case prioritization, given program P and test In general test case prioritization, given program P and test suite T, we prioritize the test cases in T with the intent of suite T, we prioritize the test cases in T with the intent of finding an ordering of test cases that will be useful over a finding an ordering of test cases that will be useful over a succession of subsequent modified versions of P.succession of subsequent modified versions of P.

In the case of regression testing, prioritization techniques can In the case of regression testing, prioritization techniques can use information gathered in previous runs of existing test use information gathered in previous runs of existing test cases to help prioritize the test cases for subsequent runs.cases to help prioritize the test cases for subsequent runs.

This paper considers 9 different test case prioritization This paper considers 9 different test case prioritization techniques.techniques.

The first three techniques serve as experimental controlsThe first three techniques serve as experimental controls

The last six techniques represent heuristics that could be The last six techniques represent heuristics that could be implemented using software toolsimplemented using software tools

A source of motivation for these approaches is the conjecture A source of motivation for these approaches is the conjecture that the availability of test execution data can be an asset.that the availability of test execution data can be an asset.

This assumes that past test execution data can be used to This assumes that past test execution data can be used to predict, with sufficient accuracy, subsequent execution behavior.predict, with sufficient accuracy, subsequent execution behavior.

Definition 1. The Test Case Prioritization Problem: Definition 1. The Test Case Prioritization Problem:

Given: T, a test suite, PT, the set of permutations of T, and Given: T, a test suite, PT, the set of permutations of T, and f, a function from PT to the real numbers.f, a function from PT to the real numbers.

PT represents the set of all possible prioritizations PT represents the set of all possible prioritizations (orderings) of T(orderings) of T

f is a function that, applied to any such ordering, yields an f is a function that, applied to any such ordering, yields an award value for that ordering.award value for that ordering.

A challenge: care must be taken to keep the cost of A challenge: care must be taken to keep the cost of performing the prioritization from excessively delaying the performing the prioritization from excessively delaying the very regression testing activities it is intended to facilitate.very regression testing activities it is intended to facilitate.

M3: Optimal prioritization.M3: Optimal prioritization.

Given program P and a set of known faults for P, if we can Given program P and a set of known faults for P, if we can determine, for test suite T, which test cases in T expose determine, for test suite T, which test cases in T expose which faults in P, then we can determine an optimal which faults in P, then we can determine an optimal ordering of the test cases in T for maximizing T's rate of ordering of the test cases in T for maximizing T's rate of fault detection for that set of faults.fault detection for that set of faults.

This is not a practical technique, as it requires a priori This is not a practical technique, as it requires a priori knowledge of the existence of faults and of which test cases knowledge of the existence of faults and of which test cases expose which faults.expose which faults.

However, by using this technique in the empirical studies, However, by using this technique in the empirical studies, we can gain insight into the success of other practical we can gain insight into the success of other practical heuristics, by comparing their solutions to optimal solutions.heuristics, by comparing their solutions to optimal solutions.

M4: Total statement coverage prioritization.M4: Total statement coverage prioritization.

By instrumenting a program, we can determine, for any test By instrumenting a program, we can determine, for any test case, which statements in that program were exercised case, which statements in that program were exercised (covered) by that test case.(covered) by that test case.

We can then prioritize test cases in terms of the total We can then prioritize test cases in terms of the total number of statements they cover by counting the number number of statements they cover by counting the number of statements covered by each test case and then sorting of statements covered by each test case and then sorting the test cases in descending order of that number.the test cases in descending order of that number.

M5: Additional statement coverage prioritization.M5: Additional statement coverage prioritization.

Total statement coverage prioritization schedules test cases in the Total statement coverage prioritization schedules test cases in the order of total coverage achieved; however, having executed a test order of total coverage achieved; however, having executed a test case and covered certain statements, more may be gained in case and covered certain statements, more may be gained in subsequent testing by executing statements that have not yet subsequent testing by executing statements that have not yet been covered.been covered.

Additional statement coverage prioritization iteratively selects a Additional statement coverage prioritization iteratively selects a test case that yields the greatest statement coverage, then test case that yields the greatest statement coverage, then adjusts the coverage information on all remaining test cases to adjusts the coverage information on all remaining test cases to indicate their coverage of statements not yet covered and repeats indicate their coverage of statements not yet covered and repeats this process until all statements covered by at least one test case.this process until all statements covered by at least one test case.

We may reach a point where each statement has been covered by We may reach a point where each statement has been covered by at least one test case, and the remaining unprioritized test cases at least one test case, and the remaining unprioritized test cases cannot add additional statement coverage. We could order these cannot add additional statement coverage. We could order these remaining test cases using any prioritization technique.remaining test cases using any prioritization technique.

M6: Total branch coverage prioritization.M6: Total branch coverage prioritization.

Total branch coverage prioritization is the same as total Total branch coverage prioritization is the same as total statement coverage prioritization, except that it uses test statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather coverage measured in terms of program branches rather than statements.than statements.

In this context, we define branch coverage as coverage of In this context, we define branch coverage as coverage of each possible overall outcome of a (possibly compound) each possible overall outcome of a (possibly compound) condition in a predicate. Thus, for example, each if or while condition in a predicate. Thus, for example, each if or while statement must be exercised such that it evaluates at least statement must be exercised such that it evaluates at least once to true and at least once to false.once to true and at least once to false.

M7: Additional branch coverage prioritization.M7: Additional branch coverage prioritization.

Additional branch coverage prioritization is the same as Additional branch coverage prioritization is the same as additional statement coverage prioritization, except that it additional statement coverage prioritization, except that it uses test coverage measured in terms of program branches uses test coverage measured in terms of program branches rather than statements.rather than statements.

After complete coverage has been achieved the remaining After complete coverage has been achieved the remaining test cases are prioritized by resetting coverage vectors to test cases are prioritized by resetting coverage vectors to their initial values and reapplying additional branch their initial values and reapplying additional branch coverage prioritization to the remaining test cases.coverage prioritization to the remaining test cases.

M8: Total fault-exposing-potential (FEP) prioritization.M8: Total fault-exposing-potential (FEP) prioritization.

Some faults are more easily exposed than other faults, and some Some faults are more easily exposed than other faults, and some test cases are more adept at revealing particular faults than other test cases are more adept at revealing particular faults than other test cases.test cases.

The ability of a test case to expose a fault-that test case's fault The ability of a test case to expose a fault-that test case's fault exposing potential (FEP)-depends not only on whether the test exposing potential (FEP)-depends not only on whether the test case covers (executes) a faulty statement, but also on the case covers (executes) a faulty statement, but also on the probability that a fault in that statement will cause a failure for probability that a fault in that statement will cause a failure for that test casethat test case

Three probabilities that could be used in determining FEP:Three probabilities that could be used in determining FEP:

1) the probability that a statement s is executed (execution 1) the probability that a statement s is executed (execution probability),probability),

2) the probability that a change in s can cause a change in 2) the probability that a change in s can cause a change in program state (infection probability), and program state (infection probability), and

3) the probability that a change in state propagates to output 3) the probability that a change in state propagates to output (propagation probability).(propagation probability).

This paper adopts an approach that uses mutation analysis, to This paper adopts an approach that uses mutation analysis, to produce a combined estimate of propagation-and-infection that produce a combined estimate of propagation-and-infection that does not incorporate independent execution probabilities.does not incorporate independent execution probabilities.

Mutation analysis creates a large number of faulty versions Mutation analysis creates a large number of faulty versions (mutants) of a program by altering program statements, and uses (mutants) of a program by altering program statements, and uses these to assess the quality of test suites by measuring whether these to assess the quality of test suites by measuring whether those test suites can detect those faults (‘kill’ those mutants).those test suites can detect those faults (‘kill’ those mutants).

Given program P and test suite T, we first create a set of mutants Given program P and test suite T, we first create a set of mutants N ={nN ={n11; n; n22; . . . ; n; . . . ; nmm} for P, noting which statement s} for P, noting which statement sjj in P contains in P contains each mutant. Next, for each test case teach mutant. Next, for each test case tii in T, we execute each in T, we execute each mutant version nmutant version nkk of P on t of P on tii, noting whether t, noting whether ti i kills that mutant. kills that mutant.

Having collected this information for every test case and mutant, Having collected this information for every test case and mutant, we consider each test case twe consider each test case tii and each statement s and each statement sjj in P, and in P, and calculate the fault-exposing potential FEP(s, t) of tcalculate the fault-exposing potential FEP(s, t) of tii on s on sjj as the as the ratio of mutants of sratio of mutants of sj j killed by tkilled by ti i to the total number of mutants of to the total number of mutants of ssjj..

To perform total FEP prioritization, given these FEP(s; t) To perform total FEP prioritization, given these FEP(s; t) values, we next calculate, for each test case tvalues, we next calculate, for each test case tii in T, an in T, an award value, by summing the FEP(saward value, by summing the FEP(sjj; t; tii) values for all ) values for all statements sstatements sjj in P. in P.

Given these award values, we then prioritize test cases by Given these award values, we then prioritize test cases by sorting them in order of descending award value.sorting them in order of descending award value.

M9: Additional fault-exposing-potential (FEP) prioritization.M9: Additional fault-exposing-potential (FEP) prioritization.

This lets us account for the fact that additional executions This lets us account for the fact that additional executions of a statement may be less valuable than initial executions.of a statement may be less valuable than initial executions.

We require a mechanism for measuring the value of an We require a mechanism for measuring the value of an execution of a statement, that can be related to FEP values.execution of a statement, that can be related to FEP values.

For this, we use the term confidence. We say that the For this, we use the term confidence. We say that the confidence in statement s, C(s), is an estimate of the confidence in statement s, C(s), is an estimate of the probability that s is correct.probability that s is correct.

If we execute a test case t that exercises s and does not If we execute a test case t that exercises s and does not reveal a fault in s, C(s) should increase.reveal a fault in s, C(s) should increase.

Research QuestionsResearch Questions Can test case prioritization improve the rate of fault Can test case prioritization improve the rate of fault

detection in test suites?detection in test suites? How do the various test case prioritization techniques How do the various test case prioritization techniques

discussed earlier compare to one another in terms of discussed earlier compare to one another in terms of effects on rate of fault detection?effects on rate of fault detection?

Effectiveness MeasuresEffectiveness Measures Use a weighted Average of the Percentage of Faults Use a weighted Average of the Percentage of Faults

Detected (APFD)Detected (APFD) Ranges from 0..100Ranges from 0..100 Higher numbers means faster detectionHigher numbers means faster detection

Problems with APFDProblems with APFD Doesn’t measure cost of prioritizationDoesn’t measure cost of prioritization Cost is normally amortized because test suites are Cost is normally amortized because test suites are

created after the release of a version of the softwarecreated after the release of a version of the software

Effectiveness ExampleEffectiveness Example

Programs usedPrograms used Aristotle program analysis system for Aristotle program analysis system for

test coverage and control graph test coverage and control graph informationinformation

Proteum mutation system to obtain Proteum mutation system to obtain mutation scores.mutation scores.

Used 8 C programs as subjectsUsed 8 C programs as subjects First 7 were created at Siemens, the eighth First 7 were created at Siemens, the eighth

is a European Space Agency programis a European Space Agency program

Siemens Programs - DescriptionSiemens Programs - Description 7 programs used by Siemens in a study that observed the 7 programs used by Siemens in a study that observed the

“fault detecting effectiveness of coverage criteria”“fault detecting effectiveness of coverage criteria” Created faulty versions of these programs by manual Created faulty versions of these programs by manual

seeding them with single errors creating the “number of seeding them with single errors creating the “number of versions” columnversions” column

Using single line faults only allows researchers to determine Using single line faults only allows researchers to determine whether a test case discovers the error or notwhether a test case discovers the error or not

For each of the seven programs, a test case suite was For each of the seven programs, a test case suite was created by Siemens. First via a black box method, they then created by Siemens. First via a black box method, they then completed the suite using white box testing, so that each completed the suite using white box testing, so that each “executable statement, edge, and definition use pair … was “executable statement, edge, and definition use pair … was exercised by at least 30 test cases.exercised by at least 30 test cases.

Kept faulty programs whose errors were detectable by Kept faulty programs whose errors were detectable by between 3 and 350 test casesbetween 3 and 350 test cases

Test suites were created by the researchers by random Test suites were created by the researchers by random selection until a branch coverage adequate test suite was selection until a branch coverage adequate test suite was createdcreated

Proteum was used to create mutants of the seven programsProteum was used to create mutants of the seven programs

Space Program – DescriptionSpace Program – Description 33 versions of space with only one fault in each 33 versions of space with only one fault in each

were created by the ESA, 2 more were created were created by the ESA, 2 more were created by the research teamby the research team

Initial pool of 10 000 test cases were obtained Initial pool of 10 000 test cases were obtained from Vokolos and Franklfrom Vokolos and Frankl

Used these as a base and added cases until Used these as a base and added cases until each statement and edge was exercised by at each statement and edge was exercised by at least 30 test casesleast 30 test cases

Created a branch coverage adequate test suite Created a branch coverage adequate test suite in the same way as the Siemens programin the same way as the Siemens program

Also created mutants via ProteumAlso created mutants via Proteum

Empirical Studies and ResultsEmpirical Studies and Results 4 different studies using the 8 programs4 different studies using the 8 programs

Siemens programs with APFD measured Siemens programs with APFD measured relative to Siemens faultsrelative to Siemens faults

Siemens programs with APFD measured Siemens programs with APFD measured relative to mutantsrelative to mutants

Space with APFD measured relative to Space with APFD measured relative to actual faultsactual faults

Space with APFD measure relative to Space with APFD measure relative to mutantsmutants

Siemens programs with APFD Siemens programs with APFD measured relative to Siemens faults measured relative to Siemens faults – Study Format– Study Format M2 to M9 were applied to each of the M2 to M9 were applied to each of the

1000 test suites, resulting in 8000 1000 test suites, resulting in 8000 prioritized test suitesprioritized test suites

The original 1000 were used as M1The original 1000 were used as M1 Calculated the APFD relative to the Calculated the APFD relative to the

faults provided by the programfaults provided by the program

Example boxplotExample boxplot

Study 1 - Overall observationsStudy 1 - Overall observations M3 is markedly better than all of the others (as M3 is markedly better than all of the others (as

expected)expected) The test case prioritization techniques offered appear to The test case prioritization techniques offered appear to

have some improvement, but more statistics needed to have some improvement, but more statistics needed to be done to confirmbe done to confirm

Upon completion of these statistics, more results were Upon completion of these statistics, more results were revealedrevealed

Branch based coverage did as well or better than Branch based coverage did as well or better than statement coveragestatement coverage

All except one indicates that total branch coverage All except one indicates that total branch coverage did as well or better than additional branch coveragedid as well or better than additional branch coverage

All total statement coverage did as well or better All total statement coverage did as well or better than additional statement coveragethan additional statement coverage

In 5 of 7 programs, even randomly prioritized test In 5 of 7 programs, even randomly prioritized test suites did better than untreated test suitessuites did better than untreated test suites

Example GroupingsExample Groupings

Siemens programs with APFD measured relative to Siemens programs with APFD measured relative to mutants – Study Formatmutants – Study Format Same format as the first study, 9000 test suites used, Same format as the first study, 9000 test suites used,

1000 for each prioritization technique1000 for each prioritization technique But rather than run those test cases on the small subset of But rather than run those test cases on the small subset of

known errors, they were applied to mutated programs that known errors, they were applied to mutated programs that were created to form a larger bed of programs to test were created to form a larger bed of programs to test againstagainst

ResultsResults Additional and Total FEP prioritization outperformed all Additional and Total FEP prioritization outperformed all

others (except optimal)others (except optimal) Branch almost always outperformed statementBranch almost always outperformed statement Total statement outperformed additionalTotal statement outperformed additional But additional branch coverage outperformed total branch But additional branch coverage outperformed total branch

coveragecoverage However, in this study random did not outperform the However, in this study random did not outperform the

controlcontrol

Space with APFD measured relative to Space with APFD measured relative to Actual FaultsActual Faults

M2 – M9 were applied to each of the 50 M2 – M9 were applied to each of the 50 test suites, resulting in 400 test suites, test suites, resulting in 400 test suites, plus the original 50 resulting in 450 total plus the original 50 resulting in 450 total test suitestest suites

Additional FEP outperformed all others, but Additional FEP outperformed all others, but there was no significant difference among there was no significant difference among the restthe rest

Also random is no better than the controlAlso random is no better than the control

Study 3 GroupingsStudy 3 Groupings

Space with APFD measured relative Space with APFD measured relative to mutantsto mutants Same technique as other space study, Same technique as other space study,

only using 132,163 mutant version of only using 132,163 mutant version of the softwarethe software

Additional FEP outperformed all othersAdditional FEP outperformed all others Branch and statement are Branch and statement are

indistinguishableindistinguishable But additional coverage always But additional coverage always

outperforms its total counterpartoutperforms its total counterpart

Study 4 GroupingsStudy 4 Groupings

Threats to ValidityThreats to Validity Construct Validity – You are measuring Construct Validity – You are measuring

what you say you are measuring (and what you say you are measuring (and not something else)not something else)

Internal Validity – Ability to say that the Internal Validity – Ability to say that the causal relationship is truecausal relationship is true

External Validity – Ability to generalize External Validity – Ability to generalize results across the fieldresults across the field

Construct ValidityConstruct Validity APFD is highly accurate, but it is not the APFD is highly accurate, but it is not the

only method of measuring fault detection, only method of measuring fault detection, could also measure percentage of test could also measure percentage of test suite that must be run before all errors suite that must be run before all errors are foundare found

No value to later tests that detect the No value to later tests that detect the same errorsame error

FEP based calculations – Other estimates FEP based calculations – Other estimates may more accurately capture the may more accurately capture the probability of a test case finding a faultprobability of a test case finding a fault

Effectiveness is measured without costEffectiveness is measured without cost

Internal ValidityInternal Validity Instrumentation bias can bias results Instrumentation bias can bias results

especially in APFD and prioritization especially in APFD and prioritization measurement toolsmeasurement tools

Performed code revisionPerformed code revision Also limit problems by running Also limit problems by running

prioritization algorithm on each test prioritization algorithm on each test suite and each subject programsuite and each subject program

External ValidityExternal Validity The Siemens programs are non-trivial but not The Siemens programs are non-trivial but not

representative of real world programs. The space representative of real world programs. The space program is, but is only one programprogram is, but is only one program

Faults in Siemens programs were seeded (not like those Faults in Siemens programs were seeded (not like those in the real world)in the real world)

Faults in space were found during development, but Faults in space were found during development, but these may differ from those found later in the these may differ from those found later in the development process. Plus they are only one set of development process. Plus they are only one set of faults found by one set of programmersfaults found by one set of programmers

Single faults version programs are also not Single faults version programs are also not representative of the real worldrepresentative of the real world

The test suites were created with only a single method, The test suites were created with only a single method, other real world methods existother real world methods exist

These threats can only be answered by more studies These threats can only be answered by more studies with different test suites, programs, and errorswith different test suites, programs, and errors

Additional Discussion And Additional Discussion And Practical ImplicationsPractical Implications

Test case prioritization can substantially Test case prioritization can substantially improve rate of fault detection of test improve rate of fault detection of test suites.suites.

Additional FEP prioritization techniques do Additional FEP prioritization techniques do not always justify the additional expenses not always justify the additional expenses incurred, as is gathered from cases where incurred, as is gathered from cases where specific coverage based techniques specific coverage based techniques outperformed them and also in cases where outperformed them and also in cases where the total gain in APFD, when the additional the total gain in APFD, when the additional FEP techniques did perform the best, was FEP techniques did perform the best, was not large enough.not large enough.

Branch-coverage-based techniques almost Branch-coverage-based techniques almost always performed as well if not better than always performed as well if not better than statement-coverage-based techniques. Thus statement-coverage-based techniques. Thus if the two techniques incur similar costs, if the two techniques incur similar costs, branch-coverage-techniques are advocated.branch-coverage-techniques are advocated.

Total statement and branch coverage Total statement and branch coverage techniques perform almost at par with techniques perform almost at par with the additional branch and statement the additional branch and statement coverage techniques, entitling its use coverage techniques, entitling its use due to its lower complexity.due to its lower complexity. However, this does not apply for However, this does not apply for space space

(Study 4)(Study 4) program where the additional program where the additional branch and statement coverage techniques branch and statement coverage techniques outperformed the total statement and outperformed the total statement and branch coverage techniques by a huge branch coverage techniques by a huge margin.margin.

Randomly prioritized test suites typically Randomly prioritized test suites typically outperform untreated test suites.outperform untreated test suites.

ConclusionConclusion

Any one of the prioritization Any one of the prioritization techniques offer some amount of techniques offer some amount of improved fault detection capabilities.improved fault detection capabilities.

These studies are of interest only to These studies are of interest only to research groups, due to the high research groups, due to the high expense that they incur. However, expense that they incur. However, code coverage based techniques code coverage based techniques have immediate practical have immediate practical implications.implications.

Future WorkFuture Work

Additional studies to be performed using wider Additional studies to be performed using wider range of programs, faults and test suites.range of programs, faults and test suites.

The gap between optimal prioritization and The gap between optimal prioritization and FEP prioritization techniques is yet to be FEP prioritization techniques is yet to be bridged.bridged.

Determining which prioritization technique is Determining which prioritization technique is warranted by particular types of programs and warranted by particular types of programs and test suites.test suites.

Other prioritization objectives have to be Other prioritization objectives have to be investigated.investigated. Version specific techniquesVersion specific techniques Techniques may not only be applied to regression Techniques may not only be applied to regression

testing but also during the initial testing of the testing but also during the initial testing of the software.software.