On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...

Post on 18-Aug-2020

4 views 0 download

Transcript of On the Evaluation of Defect Prediction Modelscrest.cs.ucl.ac.uk/cow/15/slides/ThiloMende.pdf ·...

On the Evaluation of Defect PredictionModels

Thilo Mende

Software Engineering Group, University of Bremen, Germany (Alumni)

15th COW24.10.2011

1 / 15

Many different eperimental setupsare used in the literature

A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models

Question: What influence does this have on the stability ofresults and the practical predictive benefits?

2 / 15

Many different eperimental setupsare used in the literature

A review of 107 papers shows:I A large range of data set sizesI More than 11 different evaluation measuresI 7 different resampling schemesI Only few comparisons against simple baseline models

Question: What influence does this have on the stability ofresults and the practical predictive benefits?

2 / 15

Experimental Setup

Data SetsI NASA MDPI AR (Tosun et al., 2009)I Eclipse (Zimmermann

et al., 2007)# of instances 36–17000

AlgorithmsI RPartI RandomForestI GLM/Logistic

RegressionI Naive Bayes

3 / 15

One important aspect is conclusionstability

I Stability: Consistent results for repeated executionsI Ensure reproducabilityI Protect against cherry picking

I Randomization (due to resampling or the learningalgorithm) may lead to unstable results

In the following, we use 200 runs for each algorithm on eachdata set

4 / 15

Three resampling schemes are oftenused

I 10-fold Cross Validation (10-CV)I 50-times repeated random split (50-RSS)I 10-times 10-fold Cross Validation (10×10-CV)

5 / 15

Resampling schemes differ in termsof variance

AULC

Density

05

1015

2025

0.5 0.6 0.7 0.8

AR3

020

4060

80100

0.82 0.84 0.86 0.88 0.90

PC1

0200

400

600

800

0.840 0.845 0.850 0.855

v2.0

10−CV 50−RSS 10x10−CV

6 / 15

Resampling schemes differ in termsof variance (contd.)

PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3

MW1MC2KC4AR1AR4AR6AR3AR5

1.0 1.5 2.0 2.5 3.0

balance

1.0 1.5 2.0 2.5 3.0

AULC

10−CV 50−RSS 10x10−CV● ● ●

Ranking according to variance

7 / 15

Does higher variance matter?

consistency

PC5JM1v3.0MC1v2.1v2.0PC2KC1PC3PC4PC1CM1KC3

MW1MC2KC4AR1AR4AR6AR3AR5

0.6 0.7 0.8 0.9 1.0

same

0.6 0.7 0.8 0.9 1.0

shuffled

10−CV 50−RSS 10x10−CV● ● ●

Consistency when comparing Logistic Regression and LDA

8 / 15

The variance has an influence, e.g.when Demsar’s test is used

Nr. of runswith significant differences

0

100

50

RF vs. LDALDA vs. Rpart RF vs. Log.

150

200

10-10-CV

50-RSS

10-CV

9 / 15

There are more sources of variance

I Evaluation measuresI Class ImbalanceI . . .

. . . so one has to be careful to get reproducable results

10 / 15

There are more sources of variance

I Evaluation measuresI Class ImbalanceI . . .

. . . so one has to be careful to get reproducable results

10 / 15

Traditional Performance Measuresare often optimistic

Files

%

020

4060

80100

11 / 15

Traditional Performance Measuresare often optimistic

Files Defects

%

020

4060

80100

11 / 15

Traditional Performance Measuresare often optimistic

Files Defects LoC

%

020

4060

80100

11 / 15

The largest 20% of the files containmost of the defects

ddr

AR6

KC4

CM1

MC2

PC3

v2.1

JM1

KC1

AR4

PC1

PC4

v3.0

v2.0

MW1

KC3

MC1

PC2

PC5

0.2 0.4 0.6 0.8

12 / 15

. . . only RandomForst performssignificantly better

AR

rpart

LoC−MoM

NB

Log.

LDA

RF

2 3 4 5 6 7

AUC

2 3 4 5 6 7

AULC

General pattern is consistent across data sets (Mende et al.,2009; Mende, 2010; Mende et al., 2011)

13 / 15

Random features can perform wellfor regression models

# of Features

Adj

uste

d R

2

0.0

0.2

0.4

0.6

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

PDE

1

1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15

Random features can perform wellfor regression models

# of Features

Adj

uste

d R

2

0.0

0.2

0.4

0.6

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

PDE

1

1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15

Random features can perform wellfor regression models

# of Features

Adj

uste

d R

2

0.0

0.2

0.4

0.6

0 10 20 30 40 50

Random

0 10 20 30 40 50

Lines of Code

EclipseEquinox

LuceneMylyn

PDE

1

1These results are based on data sets provided by D’Ambros et al. (2010)14 / 15

Recommendations

In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort

For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking

Opportunity for SBSE: Identified defects vs. treatment effort?

15 / 15

Recommendations

In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort

For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking

Opportunity for SBSE: Identified defects vs. treatment effort?

15 / 15

Recommendations

In generalI Use 10×10-CV (with stratification)I Use simple models as benchmarksI Consider the treatment effort

For SBSEI Use simple models as benchmarksI Consider the variance→ to avoid cherry picking

Opportunity for SBSE: Identified defects vs. treatment effort?

15 / 15

On the Evaluation of Defect Prediction Models

Thilo MendeSoftware Engineering Group, University of Bremen, Germany (Alumni)tmende@informatik.uni-bremen.de

Poll: How do you calculate F1?

. . . when there are invalid partitions

1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones

3 Calculate TP/FP/FN per partition, and calculate F1 acrossall partitions? (Forman and Scholz, 2010)

Poll: How do you calculate F1?

. . . when there are invalid partitions

1 Average over all partitions, ignoring invalid ones2 Average over all partitions, using 0 for invalid ones3 Calculate TP/FP/FN per partition, and calculate F1 across

all partitions? (Forman and Scholz, 2010)

Ooops. . .

F13

F1 1

0.2

0.4

0.6

0.8

0.2 0.4 0.6 0.8

AR3

0.2 0.4 0.6 0.8

v2.0

0.2 0.4 0.6 0.8

MC1

0.2 0.4 0.6 0.8

PC2

rpart Log. NB LDA RF● ● ● ● ●

D’Ambros, M., M. Lanza, and R. Robbes (2010). An extensivecomparison of bug prediction approaches. In MSR. IEEEComputer Society.

Forman, G. and M. Scholz (2010, November). Apples-to-applesin cross-validation studies: pitfalls in classifier performancemeasurement. ACM SIGKDD Explorations Newsletter 12,49–57.

Mende, T. (2010). Replication of defect prediction studies:Problems, pitfalls and recommendations. New York, NY,USA, pp. 1–10.

Mende, T., R. Koschke, and M. Leszak (2009, March).Evaluating defect prediction models for a large, evolvingsoftware system. pp. 247–250.

Mende, T., R. Koschke, and J. Peleska (2011). On the utility ofa defect prediction model during HW/SW integration testing:A retrospective case study. pp. 259–268.

Tosun, A., B. Turhan, and A. Bener (2009). Validation ofnetwork measures as indicators of defective modules insoftware systems. New York, NY, USA, pp. 1–9. ACM.

Zimmermann, T., R. Premraj, and A. Zeller (2007). Predictingdefects for Eclipse. IEEE Computer Society.