On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...

29
On the Evaluation of Defect Prediction Models Thilo Mende Software Engineering Group, University of Bremen, Germany (Alumni) 15 th COW 24.10.2011 1 / 15

Transcript of On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...

Page 1: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

On the Evaluation of Defect PredictionModels

Thilo Mende

Software Engineering Group, University of Bremen, Germany (Alumni)

15th COW24.10.2011

1 / 15

Page 2: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Many different eperimental setupsare used in the literature

A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models

Question: What influence does this have on the stability ofresults and the practical predictive benefits?

2 / 15

Page 3: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Many different eperimental setupsare used in the literature

A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models

Question: What influence does this have on the stability ofresults and the practical predictive benefits?

2 / 15

Page 4: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Experimental Setup

Data SetsI NASA MDPI AR (Tosun et al., 2009)I Eclipse (Zimmermann

et al., 2007)# of instances 36–17000

AlgorithmsI RPartI RandomForestI GLM/Logistic

RegressionI Naive Bayes

3 / 15

Page 5: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

One important aspect is conclusionstability

I Stability: Consistent results for repeated executionsI Ensure reproducabilityI Protect against cherry picking

I Randomization (due to resampling or the learningalgorithm) may lead to unstable results

In the following, we use 200 runs for each algorithm on eachdata set

4 / 15

Page 6: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Three resampling schemes are oftenused

I 10-fold Cross Validation (10-CV)I 50-times repeated random split (50-RSS)I 10-times 10-fold Cross Validation (10×10-CV)

5 / 15

Page 7: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Resampling schemes differ in termsof variance

AULC

Density

05

1015

2025

0.5 0.6 0.7 0.8

AR3

020

4060

80100

0.82 0.84 0.86 0.88 0.90

PC1

0200

400

600

800

0.840 0.845 0.850 0.855

v2.0

10−CV 50−RSS 10x10−CV

6 / 15

Page 8: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Resampling schemes differ in termsof variance (contd.)

PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3

MW1MC2KC4AR1AR4AR6AR3AR5

1.0 1.5 2.0 2.5 3.0

balance

1.0 1.5 2.0 2.5 3.0

AULC

10−CV 50−RSS 10x10−CV● ● ●

Ranking according to variance

7 / 15

Page 9: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Does higher variance matter?

consistency

PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3

MW1MC2KC4AR1AR4AR6AR3AR5

0.6 0.7 0.8 0.9 1.0

same

0.6 0.7 0.8 0.9 1.0

shuffled

10−CV 50−RSS 10x10−CV● ● ●

Consistency when comparing Logistic Regression and LDA

8 / 15

Page 10: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

The variance has an influence, e.g.when Demsar’s test is used

Nr. of runswith significant differences

0

100

50

RF vs. LDALDA vs. Rpart RF vs. Log.

150

200

10-10-CV

50-RSS

10-CV

9 / 15

Page 11: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

There are more sources of variance

I Evaluation measuresI Class ImbalanceI . . .

. . . so one has to be careful to get reproducable results

10 / 15

Page 12: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

There are more sources of variance

I Evaluation measuresI Class ImbalanceI . . .

. . . so one has to be careful to get reproducable results

10 / 15

Page 13: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Traditional Performance Measuresare often optimistic

Files

%

020

4060

80100

11 / 15

Page 14: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Traditional Performance Measuresare often optimistic

Files Defects

%

020

4060

80100

11 / 15

Page 15: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Traditional Performance Measuresare often optimistic

Files Defects LoC

%

020

4060

80100

11 / 15

Page 16: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

The largest 20% of the files containmost of the defects

ddr

AR6

KC4

CM1

MC2

PC3

v2.1

JM1

KC1

AR4

PC1

PC4

v3.0

v2.0

MW1

KC3

MC1

PC2

PC5

0.2 0.4 0.6 0.8

12 / 15

Page 17: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

. . . only RandomForst performssignificantly better

AR

rpart

LoC−MoM

NB

Log.

LDA

RF

2 3 4 5 6 7

AUC

2 3 4 5 6 7

AULC

General pattern is consistent across data sets (Mende et al.,2009; Mende, 2010; Mende et al., 2011)

13 / 15

Page 18: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Random features can perform wellfor regression models

# of Features

Adj

uste

d R

2

0.0

0.2

0.4

0.6

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

PDE

1

1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15

Page 19: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Random features can perform wellfor regression models

# of Features

Adj

uste

d R

2

0.0

0.2

0.4

0.6

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

PDE

1

1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15

Page 20: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Random features can perform wellfor regression models

# of Features

Adj

uste

d R

2

0.0

0.2

0.4

0.6

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

PDE

1

1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15

Page 21: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Recommendations

In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort

For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking

Opportunity for SBSE: Identified defects vs. treatment effort?

15 / 15

Page 22: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Recommendations

In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort

For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking

Opportunity for SBSE: Identified defects vs. treatment effort?

15 / 15

Page 23: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Recommendations

In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort

For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking

Opportunity for SBSE: Identified defects vs. treatment effort?

15 / 15

Page 24: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

On the Evaluation of Defect Prediction Models

Thilo MendeSoftware Engineering Group, University of Bremen, Germany (Alumni)[email protected]

Page 25: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Poll: How do you calculate F1?

. . . when there are invalid partitions

1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones

3 Calculate TP/FP/FN per partition, and calculate F1 acrossall partitions? (Forman and Scholz, 2010)

Page 26: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Poll: How do you calculate F1?

. . . when there are invalid partitions

1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones3 Calculate TP/FP/FN per partition, and calculate F1 across

all partitions? (Forman and Scholz, 2010)

Page 27: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Ooops. . .

F13

F1 1

0.2

0.4

0.6

0.8

0.2 0.4 0.6 0.8

AR3

0.2 0.4 0.6 0.8

v2.0

0.2 0.4 0.6 0.8

MC1

0.2 0.4 0.6 0.8

PC2

rpart Log. NB LDA RF● ● ● ● ●

Page 28: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

D’Ambros, M., M. Lanza, and R. Robbes (2010). An extensivecomparison of bug prediction approaches. In MSR. IEEEComputer Society.

Forman, G. and M. Scholz (2010, November). Apples-to-applesin cross-validation studies: pitfalls in classifier performancemeasurement. ACM SIGKDD Explorations Newsletter 12,49–57.

Mende, T. (2010). Replication of defect prediction studies:Problems, pitfalls and recommendations. New York, NY,USA, pp. 1–10.

Mende, T., R. Koschke, and M. Leszak (2009, March).Evaluating defect prediction models for a large, evolvingsoftware system. pp. 247–250.

Mende, T., R. Koschke, and J. Peleska (2011). On the utility ofa defect prediction model during HW/SW integration testing:A retrospective case study. pp. 259–268.

Page 29: On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf · 2011. 11. 8. · Evaluating defect prediction models for a large, evolving software

Tosun, A., B. Turhan, and A. Bener (2009). Validation ofnetwork measures as indicators of defective modules insoftware systems. New York, NY, USA, pp. 1–9. ACM.

Zimmermann, T., R. Premraj, and A. Zeller (2007). Predictingdefects for Eclipse. IEEE Computer Society.