On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...

On the Evaluation of Defect PredictionModels

Thilo Mende

Software Engineering Group, University of Bremen, Germany (Alumni)

15th COW24.10.2011

1 / 15

Many different eperimental setupsare used in the literature

A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models

Question: What influence does this have on the stability ofresults and the practical predictive benefits?

2 / 15

Many different eperimental setupsare used in the literature

A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models

Question: What influence does this have on the stability ofresults and the practical predictive benefits?

2 / 15

Experimental Setup

Data SetsI NASA MDPI AR (Tosun et al., 2009)I Eclipse (Zimmermann

et al., 2007)# of instances 36–17000

AlgorithmsI RPartI RandomForestI GLM/Logistic

RegressionI Naive Bayes

3 / 15

One important aspect is conclusionstability

I Stability: Consistent results for repeated executionsI Ensure reproducabilityI Protect against cherry picking

I Randomization (due to resampling or the learningalgorithm) may lead to unstable results

In the following, we use 200 runs for each algorithm on eachdata set

4 / 15

Three resampling schemes are oftenused

I 10-fold Cross Validation (10-CV)I 50-times repeated random split (50-RSS)I 10-times 10-fold Cross Validation (10×10-CV)

5 / 15

Resampling schemes differ in termsof variance

Density

0.5 0.6 0.7 0.8

0.82 0.84 0.86 0.88 0.90

0.840 0.845 0.850 0.855

10−CV 50−RSS 10x10−CV

6 / 15

Resampling schemes differ in termsof variance (contd.)

PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3

MW1MC2KC4AR1AR4AR6AR3AR5

1.0 1.5 2.0 2.5 3.0

balance

1.0 1.5 2.0 2.5 3.0

10−CV 50−RSS 10x10−CV● ● ●

Ranking according to variance

7 / 15

Does higher variance matter?

consistency

PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3

MW1MC2KC4AR1AR4AR6AR3AR5

0.6 0.7 0.8 0.9 1.0

shuffled

10−CV 50−RSS 10x10−CV● ● ●

Consistency when comparing Logistic Regression and LDA

8 / 15

The variance has an influence, e.g.when Demsar’s test is used

Nr. of runswith significant differences

RF vs. LDALDA vs. Rpart RF vs. Log.

10-10-CV

50-RSS

9 / 15

There are more sources of variance

I Evaluation measuresI Class ImbalanceI . . .

. . . so one has to be careful to get reproducable results

10 / 15

There are more sources of variance

I Evaluation measuresI Class ImbalanceI . . .

. . . so one has to be careful to get reproducable results

10 / 15

Traditional Performance Measuresare often optimistic

11 / 15

Files Defects

11 / 15

Files Defects LoC

11 / 15

The largest 20% of the files containmost of the defects

0.2 0.4 0.6 0.8

12 / 15

. . . only RandomForst performssignificantly better

LoC−MoM

2 3 4 5 6 7

General pattern is consistent across data sets (Mende et al.,2009; Mende, 2010; Mende et al., 2011)

13 / 15

Random features can perform wellfor regression models

# of Features

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15

# of Features

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

# of Features

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

Recommendations

In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort

For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking

Opportunity for SBSE: Identified defects vs. treatment effort?

15 / 15

Recommendations

15 / 15

Recommendations

15 / 15

On the Evaluation of Defect Prediction Models

Thilo MendeSoftware Engineering Group, University of Bremen, Germany (Alumni)tmende@informatik.uni-bremen.de

Poll: How do you calculate F1?

. . . when there are invalid partitions

1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones

3 Calculate TP/FP/FN per partition, and calculate F1 acrossall partitions? (Forman and Scholz, 2010)

Poll: How do you calculate F1?

. . . when there are invalid partitions

1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones3 Calculate TP/FP/FN per partition, and calculate F1 across

all partitions? (Forman and Scholz, 2010)

Ooops. . .

0.2 0.4 0.6 0.8

rpart Log. NB LDA RF● ● ● ● ●

D’Ambros, M., M. Lanza, and R. Robbes (2010). An extensivecomparison of bug prediction approaches. In MSR. IEEEComputer Society.

Forman, G. and M. Scholz (2010, November). Apples-to-applesin cross-validation studies: pitfalls in classifier performancemeasurement. ACM SIGKDD Explorations Newsletter 12,49–57.

Mende, T. (2010). Replication of defect prediction studies:Problems, pitfalls and recommendations. New York, NY,USA, pp. 1–10.

Mende, T., R. Koschke, and M. Leszak (2009, March).Evaluating defect prediction models for a large, evolvingsoftware system. pp. 247–250.

Mende, T., R. Koschke, and J. Peleska (2011). On the utility ofa defect prediction model during HW/SW integration testing:A retrospective case study. pp. 259–268.

Tosun, A., B. Turhan, and A. Bener (2009). Validation ofnetwork measures as indicators of defective modules insoftware systems. New York, NY, USA, pp. 1–9. ACM.

Zimmermann, T., R. Premraj, and A. Zeller (2007). Predictingdefects for Eclipse. IEEE Computer Society.

On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...

Documents

Transcript of On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...

Software Defect Prediction Models for Quality …ijcsi.org/papers/IJCSI-9-5-2-288-296.pdf · Software Defect Prediction Models for ... model of defect prediction, reliability based

A Framework for Software Defect Prediction Using Neural ... · Software Defect, Software Defect Prediction Model, Neural Network, Quality Management 1. Introduction Software quality

A Critique of Software Defect Prediction Models

Which process metrics can significantly improve defect ...Keywords Software metrics Product metrics Process metrics Defect prediction models Software defect prediction 1 Introduction

Multi-Objective Cross-Project Defect Prediction

Defect prediction from static code features: current ...

Towards Building a Universal Defect Prediction Model

Data collection for software defect prediction

Software Defect Prediction & Release Readiness Assessment

Subgroup Discovery in Defect Prediction

Software Defect Prediction Models for Quality Improvement ... · Software Defect Prediction Models for Quality Improvement: A Literature Study . ... Software metrics has been ...

Survey on Software Defect Prediction Using Machine ... · Survey on Software Defect Prediction Using ... Software defect prediction is seen as a highly important ability when planning

An Empirical Study on Software Defect Prediction with a ... · An Empirical Study on Software Defect Prediction with a Simpliﬁed Metric Set ... defect prediction, software metrics,

Evidence-based defect assessment and prediction for ...

Introduction to Defect Prediction

Nearest Neighbor Sampling for Better Defect Prediction

Personalized Defect Prediction

Manifested flatness defect prediction

Improving Testing with Defect Taxonomiescrest.cs.ucl.ac.uk/cow/25/slides/COW25_Felderer.pdf · • Process for system testing with defect taxonomies, called Defect Taxonomy‐Supported

Survey on Software Defect Prediction