Revisiting the Experimental Design Choices for Approaches ...€¦ · Ph.D. : Mohamed Sami Rakha...

Revisiting the Experimental Design Choices for Approaches for the Automated Retrieval

of Duplicate Issue Reports

Ph.D. : Mohamed Sami Rakha

Supervisor: Prof. Ahmed E. Hassan

Ph.D. Thesis Defense

Software issues are common

There are many issue tracking systems

Issue tracking systems receive a large number of issue reports

New Issue Reports

Over 1.2 MillionIssue Reports

Different persons may report the same issue

User #2

User #1

Different persons may report the same issue

User #2

User #1

Duplicate

Master

The identification of duplicate issue reports is an annoying and timewasting task

User #2

User #1

Duplicate

20-30% Duplicates! Master

Bettenburg et al. Duplicate bug reports considered harmful… really? [ICSM 2008:]

Information retrieval is leveraged by prior research

Duplicate

AutomatedApproach

Search

Most similar previously reported issues

Ranked list

Master

Developer choose the accurate candidate for the newly reported duplicate… If there is any

Duplicate

AutomatedApproach

Search

Most similar previously reported issues

Ranked list

Developer

Master

Master

Several automated approaches for duplicates retrieval are proposed by prior research

Prior Research

2007: Runeson et al.

2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.

2016: Hindle al.2016: Jie et al.

Prior research focuses on improving the performance

AutomatedApproach

Performance

Recall Rate

Duplicate

The goal of newly proposed approach is to have higher performance than prior ones

Prior Research


2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.


In my thesis, I revisit the experimental design choices from four perspectives

AutomatedApproach

Performance

Recall Rate

Duplicate

The goal of newly proposed approach is to have better performance than prior ones

A BRecall Rate = 0.6 Recall Rate = 0.5

>Example:

Prior Research


2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.


Revisiting the experimental design choices can help better understand the benefits and limitations of prior approaches evaluation.

Needed Effort

Data Filtration

Data Changes

Evaluation Process

Evaluation Process Data Filtration Data ChangesNeeded Effort

Prior research treats all duplicate reports equally

Duplicate reports differ based on the effortneeded for their manual identification


Real Example: Duplicate (B) needs more much effort to be identified

Timereported

BA


16

Identified

A


15 minutes

Timereported

BA


17

Identified

A BIdentified


15 minutes

44 days

Timereported

BA


A duplicate can be hard or easybased on the effort needed for identification

HARDEASY


50%

Within 1 Day

WithoutDiscussions

With onePerson

50%EASY HARD

Studied Systems:


Random Forest Model“Binary Classification”

AUCs0.68 – 0.8

50%50%EASY HARD


Random Forest Model“Binary Classification”

AUCs0.68 – 0.8

50%50%EASY HARD

This model is used to explain the most important factors on the effort needed for identification

Factors such as Description Similarity is one of the top important ones


Thesis Contribution #1:Show the importance of considering the neededeffort for retrieving duplicate reports in theperformance measurement of automated duplicateretrieval approaches

[Published in EMSE 2015]

Data ChangesData FiltrationNeeded Effort Evaluation Process

Testing period (1 Year)

AA Time

Not considered in the evaluation

Many prior research evaluate their approaches on short testing periods


AA B Time

Testing period (1 Year)Not considered in the evaluation

Many prior research evaluate their approaches on short testing periods


AA BB Time


Duplicate (B) match is outside the testing period



TimeB AA B

Ignored


Duplicate (B) match is outside the testing period



TimeB AA B

Ignored

Testing period (1 Year)Not considered in the evaluation

ClassicalEvaluation

I refer to this evaluation by “Classical Evaluation”


Time

ClassicalEvaluationRealisticEvaluation


Instead, I propose “Realistic Evaluation”


Time



Instead, I propose “Realistic Evaluation”

AB A B


Mozilla 2010

Eclipse2008

OpenOffice 2008 - 2010


Problem #1

Number of considered duplicates

Problem #2

Performance overestimation

I compare the two evaluations on frequently used testing data in prior research

Time


Classical evaluation is missing 23.5% of

duplicates 10,043 Duplicates

7,551 Duplicates


Mozilla 2010

Problem #1


Testing period

Time


Recall rate is overestimated by 39%

for REP approach Recall: 0.43

Recall: 0.58

Mozilla 2010


Problem #2


Testing period


Mozilla 2010

Eclipse2008



Problem #1


Problem #2



Mozilla 2010

Eclipse2008



Problem #1


Problem #2


Is the difference in performance between the two evaluations generalize to other testing periods within the tested ITSs


Randomly select 100 chunks as “Testing Periods”

Studying the statistical difference between the classical and realistic

evaluations


Chunk = 1 Year Time



Chunk 1 (Dec2004 – Dec2005)

Chunk 2 (Nov2003 – Nov2004)

Chunk 100 (Sep2007 – Sep2008)

Results 1

Results 2

Results 100

AutomatedApproach


Classical evaluation significantly overestimates the performance of the automated approaches (p-value<0.001)


Thesis Contribution #2:Propose a realistic evaluation for the automatedretrieval of duplicate reports, and analyze priorfindings using such realistic evaluation

[Published in TSE 2017]

Time

Data ChangesNeeded Effort Evaluation Process Data Filtration

RealisticEvaluation

Do we need all these issue reports to search in?

Filtering out old issue reports should lead to reduced search time


No Duplicates are Ignored

Time


RealisticEvaluation

The possible solution is to filter out the old issue reports using a time window based on the report date


No Duplicates are Ignored

Time


RealisticEvaluation

2 Years

Example (1)

AA BB



Time


RealisticEvaluation

4 months

Example (2)

AA BB




RealisticEvaluation


RealisticEvaluation

Results of the limitation by (n) months

It is not possible to limit the issue reports that are searched without decreasing

the performance of automated approaches


A B Time

Testing periodreported


A B Time

(A) is resolved 1

month ago



A B Time

(A) is resolved 1

month ago(B) is resolved 2

years ago



A B Time

Testing period

A new duplicate is more likely

to be duplicate of (A)

(A) is resolved 1

month ago(B) is resolved 2

years ago

reported


I propose filtration approach based on

time threshold from resolution date for

each resolution valueand rank

The thresholds are optimized using

Genetic Algorithm (NSGA-II)


The proposed filtration approach

improves the performance by up

to 60%


Thesis Contribution #3:Propose a genetic algorithm based approach forfiltering out the old issue reports to improve theperformance of duplicate retrieval approaches.


Time

Needed Effort Evaluation Process Data Filtration Data Changes

2011

Just-in-time (JIT) duplicate retrieval feature is introduced

Time


2011

Significantly lessduplicates but more

identification effort is needed

Time


2011

AutomatedApproach

AutomatedApproach

How the JIT duplicate retrieval feature impact the performance evaluation of the automated approaches?


Approach (1): BM25F Approach (2): REP

The performance of both approaches is significantly lower for after-JIT duplicates

Firefox


The relative improvement of the performance of REPover BM25F is significantly larger after the activation of the JIT

Firefox


Thesis Contribution #4:Highlight the impact of the “JIT” feature on theperformance evaluation of duplicate retrievalapproaches

[Submitted to EMSE] – under review

1 2

3 4

Summary

Prior research focuses on improving the performance

AutomatedApproach

Performance

Recall Rate

Duplicate

The goal of newly proposed approach is to have higher performance than prior ones

Prior Research


2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.


In my thesis, I revisit the experimental design choices from four perspectives

AutomatedApproach

Performance

Recall Rate

Duplicate

The goal of newly proposed approach is to have better performance than prior ones

A BRecall Rate = 0.6 Recall Rate = 0.5

>Example:

Prior Research


2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.


Revisiting the experimental design choices can help better understand the benefits and limitations of prior approaches evaluation.

Needed Effort

Data Filtration

Data Changes

Evaluation Process


Thesis Contribution #1:Show the importance of considering the neededeffort for retrieving duplicate reports in theperformance measurement of automated duplicateretrieval approaches

[Published in EMSE 2015]


Thesis Contribution #2:Propose a realistic evaluation for the automatedretrieval of duplicate reports, and analyze priorfindings using such realistic evaluation



Thesis Contribution #3:Propose a genetic algorithm based approach forfiltering out the old issue reports to improve theperformance of duplicate retrieval approaches



Thesis Contribution #4:Highlight the impact of the “JIT” feature on theperformance evaluation of duplicate retrievalapproaches

[EMSE 2017] – under review

1 2

3 4

Summary

Revisiting the Experimental Design Choices for Approaches ...€¦ · Ph.D. : Mohamed Sami Rakha...

Documents

Transcript of Revisiting the Experimental Design Choices for Approaches ...€¦ · Ph.D. : Mohamed Sami Rakha...