On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...
Transcript of On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...
![Page 1: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/1.jpg)
On the Evaluation of Defect PredictionModels
Thilo Mende
Software Engineering Group, University of Bremen, Germany (Alumni)
15th COW24.10.2011
1 / 15
![Page 2: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/2.jpg)
Many different eperimental setupsare used in the literature
A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models
Question: What influence does this have on the stability ofresults and the practical predictive benefits?
2 / 15
![Page 3: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/3.jpg)
Many different eperimental setupsare used in the literature
A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models
Question: What influence does this have on the stability ofresults and the practical predictive benefits?
2 / 15
![Page 4: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/4.jpg)
Experimental Setup
Data SetsI NASA MDPI AR (Tosun et al., 2009)I Eclipse (Zimmermann
et al., 2007)# of instances 36–17000
AlgorithmsI RPartI RandomForestI GLM/Logistic
RegressionI Naive Bayes
3 / 15
![Page 5: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/5.jpg)
One important aspect is conclusionstability
I Stability: Consistent results for repeated executionsI Ensure reproducabilityI Protect against cherry picking
I Randomization (due to resampling or the learningalgorithm) may lead to unstable results
In the following, we use 200 runs for each algorithm on eachdata set
4 / 15
![Page 6: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/6.jpg)
Three resampling schemes are oftenused
I 10-fold Cross Validation (10-CV)I 50-times repeated random split (50-RSS)I 10-times 10-fold Cross Validation (10×10-CV)
5 / 15
![Page 7: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/7.jpg)
Resampling schemes differ in termsof variance
AULC
Density
05
1015
2025
0.5 0.6 0.7 0.8
AR3
020
4060
80100
0.82 0.84 0.86 0.88 0.90
PC1
0200
400
600
800
0.840 0.845 0.850 0.855
v2.0
10−CV 50−RSS 10x10−CV
6 / 15
![Page 8: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/8.jpg)
Resampling schemes differ in termsof variance (contd.)
PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3
MW1MC2KC4AR1AR4AR6AR3AR5
1.0 1.5 2.0 2.5 3.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
balance
1.0 1.5 2.0 2.5 3.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
AULC
10−CV 50−RSS 10x10−CV● ● ●
Ranking according to variance
7 / 15
![Page 9: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/9.jpg)
Does higher variance matter?
consistency
PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3
MW1MC2KC4AR1AR4AR6AR3AR5
0.6 0.7 0.8 0.9 1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
same
0.6 0.7 0.8 0.9 1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
shuffled
10−CV 50−RSS 10x10−CV● ● ●
Consistency when comparing Logistic Regression and LDA
8 / 15
![Page 10: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/10.jpg)
The variance has an influence, e.g.when Demsar’s test is used
Nr. of runswith significant differences
0
100
50
RF vs. LDALDA vs. Rpart RF vs. Log.
150
200
10-10-CV
50-RSS
10-CV
9 / 15
![Page 11: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/11.jpg)
There are more sources of variance
I Evaluation measuresI Class ImbalanceI . . .
. . . so one has to be careful to get reproducable results
10 / 15
![Page 12: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/12.jpg)
There are more sources of variance
I Evaluation measuresI Class ImbalanceI . . .
. . . so one has to be careful to get reproducable results
10 / 15
![Page 13: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/13.jpg)
Traditional Performance Measuresare often optimistic
Files
%
020
4060
80100
11 / 15
![Page 14: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/14.jpg)
Traditional Performance Measuresare often optimistic
Files Defects
%
020
4060
80100
11 / 15
![Page 15: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/15.jpg)
Traditional Performance Measuresare often optimistic
Files Defects LoC
%
020
4060
80100
11 / 15
![Page 16: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/16.jpg)
The largest 20% of the files containmost of the defects
ddr
AR6
KC4
CM1
MC2
PC3
v2.1
JM1
KC1
AR4
PC1
PC4
v3.0
v2.0
MW1
KC3
MC1
PC2
PC5
0.2 0.4 0.6 0.8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
12 / 15
![Page 17: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/17.jpg)
. . . only RandomForst performssignificantly better
AR
rpart
LoC−MoM
NB
Log.
LDA
RF
2 3 4 5 6 7
AUC
2 3 4 5 6 7
AULC
General pattern is consistent across data sets (Mende et al.,2009; Mende, 2010; Mende et al., 2011)
13 / 15
![Page 18: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/18.jpg)
Random features can perform wellfor regression models
# of Features
Adj
uste
d R
2
0.0
0.2
0.4
0.6
0 10 20 30 40 50
Random
0 10 20 30 40 50
Lines of Code
EclipseEquinox
LuceneMylyn
PDE
1
1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15
![Page 19: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/19.jpg)
Random features can perform wellfor regression models
# of Features
Adj
uste
d R
2
0.0
0.2
0.4
0.6
0 10 20 30 40 50
Random
0 10 20 30 40 50
Lines of Code
EclipseEquinox
LuceneMylyn
PDE
1
1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15
![Page 20: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/20.jpg)
Random features can perform wellfor regression models
# of Features
Adj
uste
d R
2
0.0
0.2
0.4
0.6
0 10 20 30 40 50
Random
0 10 20 30 40 50
Lines of Code
EclipseEquinox
LuceneMylyn
PDE
1
1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15
![Page 21: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/21.jpg)
Recommendations
In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort
For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking
Opportunity for SBSE: Identified defects vs. treatment effort?
15 / 15
![Page 22: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/22.jpg)
Recommendations
In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort
For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking
Opportunity for SBSE: Identified defects vs. treatment effort?
15 / 15
![Page 23: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/23.jpg)
Recommendations
In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort
For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking
Opportunity for SBSE: Identified defects vs. treatment effort?
15 / 15
![Page 24: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/24.jpg)
On the Evaluation of Defect Prediction Models
Thilo MendeSoftware Engineering Group, University of Bremen, Germany (Alumni)[email protected]
![Page 25: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/25.jpg)
Poll: How do you calculate F1?
. . . when there are invalid partitions
1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones
3 Calculate TP/FP/FN per partition, and calculate F1 acrossall partitions? (Forman and Scholz, 2010)
![Page 26: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/26.jpg)
Poll: How do you calculate F1?
. . . when there are invalid partitions
1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones3 Calculate TP/FP/FN per partition, and calculate F1 across
all partitions? (Forman and Scholz, 2010)
![Page 27: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/27.jpg)
Ooops. . .
F13
F1 1
0.2
0.4
0.6
0.8
0.2 0.4 0.6 0.8
AR3
0.2 0.4 0.6 0.8
v2.0
0.2 0.4 0.6 0.8
MC1
0.2 0.4 0.6 0.8
PC2
rpart Log. NB LDA RF● ● ● ● ●
![Page 28: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/28.jpg)
D’Ambros, M., M. Lanza, and R. Robbes (2010). An extensivecomparison of bug prediction approaches. In MSR. IEEEComputer Society.
Forman, G. and M. Scholz (2010, November). Apples-to-applesin cross-validation studies: pitfalls in classifier performancemeasurement. ACM SIGKDD Explorations Newsletter 12,49–57.
Mende, T. (2010). Replication of defect prediction studies:Problems, pitfalls and recommendations. New York, NY,USA, pp. 1–10.
Mende, T., R. Koschke, and M. Leszak (2009, March).Evaluating defect prediction models for a large, evolvingsoftware system. pp. 247–250.
Mende, T., R. Koschke, and J. Peleska (2011). On the utility ofa defect prediction model during HW/SW integration testing:A retrospective case study. pp. 259–268.
![Page 29: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fc1af110f7f922a3a67004b/html5/thumbnails/29.jpg)
Tosun, A., B. Turhan, and A. Bener (2009). Validation ofnetwork measures as indicators of defective modules insoftware systems. New York, NY, USA, pp. 1–9. ACM.
Zimmermann, T., R. Premraj, and A. Zeller (2007). Predictingdefects for Eclipse. IEEE Computer Society.