Post on 18-Aug-2020
On the Evaluation of Defect PredictionModels
Thilo Mende
Software Engineering Group, University of Bremen, Germany (Alumni)
15th COW24.10.2011
1 / 15
Many different eperimental setupsare used in the literature
A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models
Question: What influence does this have on the stability ofresults and the practical predictive benefits?
2 / 15
Many different eperimental setupsare used in the literature
A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models
Question: What influence does this have on the stability ofresults and the practical predictive benefits?
2 / 15
Experimental Setup
Data SetsI NASA MDPI AR (Tosun et al., 2009)I Eclipse (Zimmermann
et al., 2007)# of instances 36–17000
AlgorithmsI RPartI RandomForestI GLM/Logistic
RegressionI Naive Bayes
3 / 15
One important aspect is conclusionstability
I Stability: Consistent results for repeated executionsI Ensure reproducabilityI Protect against cherry picking
I Randomization (due to resampling or the learningalgorithm) may lead to unstable results
In the following, we use 200 runs for each algorithm on eachdata set
4 / 15
Three resampling schemes are oftenused
I 10-fold Cross Validation (10-CV)I 50-times repeated random split (50-RSS)I 10-times 10-fold Cross Validation (10×10-CV)
5 / 15
Resampling schemes differ in termsof variance
AULC
Density
05
1015
2025
0.5 0.6 0.7 0.8
AR3
020
4060
80100
0.82 0.84 0.86 0.88 0.90
PC1
0200
400
600
800
0.840 0.845 0.850 0.855
v2.0
10−CV 50−RSS 10x10−CV
6 / 15
Resampling schemes differ in termsof variance (contd.)
PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3
MW1MC2KC4AR1AR4AR6AR3AR5
1.0 1.5 2.0 2.5 3.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
balance
1.0 1.5 2.0 2.5 3.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
AULC
10−CV 50−RSS 10x10−CV● ● ●
Ranking according to variance
7 / 15
Does higher variance matter?
consistency
PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3
MW1MC2KC4AR1AR4AR6AR3AR5
0.6 0.7 0.8 0.9 1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
same
0.6 0.7 0.8 0.9 1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
shuffled
10−CV 50−RSS 10x10−CV● ● ●
Consistency when comparing Logistic Regression and LDA
8 / 15
The variance has an influence, e.g.when Demsar’s test is used
Nr. of runswith significant differences
0
100
50
RF vs. LDALDA vs. Rpart RF vs. Log.
150
200
10-10-CV
50-RSS
10-CV
9 / 15
There are more sources of variance
I Evaluation measuresI Class ImbalanceI . . .
. . . so one has to be careful to get reproducable results
10 / 15
There are more sources of variance
I Evaluation measuresI Class ImbalanceI . . .
. . . so one has to be careful to get reproducable results
10 / 15
Traditional Performance Measuresare often optimistic
Files
%
020
4060
80100
11 / 15
Traditional Performance Measuresare often optimistic
Files Defects
%
020
4060
80100
11 / 15
Traditional Performance Measuresare often optimistic
Files Defects LoC
%
020
4060
80100
11 / 15
The largest 20% of the files containmost of the defects
ddr
AR6
KC4
CM1
MC2
PC3
v2.1
JM1
KC1
AR4
PC1
PC4
v3.0
v2.0
MW1
KC3
MC1
PC2
PC5
0.2 0.4 0.6 0.8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
12 / 15
. . . only RandomForst performssignificantly better
AR
rpart
LoC−MoM
NB
Log.
LDA
RF
2 3 4 5 6 7
AUC
2 3 4 5 6 7
AULC
General pattern is consistent across data sets (Mende et al.,2009; Mende, 2010; Mende et al., 2011)
13 / 15
Random features can perform wellfor regression models
# of Features
Adj
uste
d R
2
0.0
0.2
0.4
0.6
0 10 20 30 40 50
Random
0 10 20 30 40 50
Lines of Code
EclipseEquinox
LuceneMylyn
PDE
1
1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15
Random features can perform wellfor regression models
# of Features
Adj
uste
d R
2
0.0
0.2
0.4
0.6
0 10 20 30 40 50
Random
0 10 20 30 40 50
Lines of Code
EclipseEquinox
LuceneMylyn
PDE
1
1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15
Random features can perform wellfor regression models
# of Features
Adj
uste
d R
2
0.0
0.2
0.4
0.6
0 10 20 30 40 50
Random
0 10 20 30 40 50
Lines of Code
EclipseEquinox
LuceneMylyn
PDE
1
1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15
Recommendations
In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort
For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking
Opportunity for SBSE: Identified defects vs. treatment effort?
15 / 15
Recommendations
In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort
For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking
Opportunity for SBSE: Identified defects vs. treatment effort?
15 / 15
Recommendations
In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort
For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking
Opportunity for SBSE: Identified defects vs. treatment effort?
15 / 15
On the Evaluation of Defect Prediction Models
Thilo MendeSoftware Engineering Group, University of Bremen, Germany (Alumni)tmende@informatik.uni-bremen.de
Poll: How do you calculate F1?
. . . when there are invalid partitions
1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones
3 Calculate TP/FP/FN per partition, and calculate F1 acrossall partitions? (Forman and Scholz, 2010)
Poll: How do you calculate F1?
. . . when there are invalid partitions
1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones3 Calculate TP/FP/FN per partition, and calculate F1 across
all partitions? (Forman and Scholz, 2010)
Ooops. . .
F13
F1 1
0.2
0.4
0.6
0.8
0.2 0.4 0.6 0.8
AR3
0.2 0.4 0.6 0.8
v2.0
0.2 0.4 0.6 0.8
MC1
0.2 0.4 0.6 0.8
PC2
rpart Log. NB LDA RF● ● ● ● ●
D’Ambros, M., M. Lanza, and R. Robbes (2010). An extensivecomparison of bug prediction approaches. In MSR. IEEEComputer Society.
Forman, G. and M. Scholz (2010, November). Apples-to-applesin cross-validation studies: pitfalls in classifier performancemeasurement. ACM SIGKDD Explorations Newsletter 12,49–57.
Mende, T. (2010). Replication of defect prediction studies:Problems, pitfalls and recommendations. New York, NY,USA, pp. 1–10.
Mende, T., R. Koschke, and M. Leszak (2009, March).Evaluating defect prediction models for a large, evolvingsoftware system. pp. 247–250.
Mende, T., R. Koschke, and J. Peleska (2011). On the utility ofa defect prediction model during HW/SW integration testing:A retrospective case study. pp. 259–268.
Tosun, A., B. Turhan, and A. Bener (2009). Validation ofnetwork measures as indicators of defective modules insoftware systems. New York, NY, USA, pp. 1–9. ACM.
Zimmermann, T., R. Premraj, and A. Zeller (2007). Predictingdefects for Eclipse. IEEE Computer Society.