Revisiting the Experimental Design Choices for Approaches ...€¦ · Ph.D. : Mohamed Sami Rakha...

Post on 10-Oct-2020

1 views 0 download

Transcript of Revisiting the Experimental Design Choices for Approaches ...€¦ · Ph.D. : Mohamed Sami Rakha...

Revisiting the Experimental Design Choices for Approaches for the Automated Retrieval

of Duplicate Issue Reports

Ph.D. : Mohamed Sami Rakha

Supervisor: Prof. Ahmed E. Hassan

Ph.D. Thesis Defense

Software issues are common

There are many issue tracking systems

Issue tracking systems receive a large number of issue reports

New Issue Reports

Over 1.2 MillionIssue Reports

Different persons may report the same issue

User #2

User #1

Different persons may report the same issue

User #2

User #1

Duplicate

Master

The identification of duplicate issue reports is an annoying and timewasting task

User #2

User #1

Duplicate

20-30% Duplicates! Master

Bettenburg et al. Duplicate bug reports considered harmful… really? [ICSM 2008:]

Information retrieval is leveraged by prior research

Duplicate

AutomatedApproach

Search

Most similar previously reported issues

Ranked list

Master

Developer choose the accurate candidate for the newly reported duplicate… If there is any

Duplicate

AutomatedApproach

Search

Most similar previously reported issues

Ranked list

Developer

Master

Master

Several automated approaches for duplicates retrieval are proposed by prior research

Prior Research

2007: Runeson et al.

2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.

2016: Hindle al.2016: Jie et al.

Prior research focuses on improving the performance

AutomatedApproach

Performance

Recall Rate

Duplicate

The goal of newly proposed approach is to have higher performance than prior ones

Prior Research

2007: Runeson et al.

2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.

2016: Hindle al.2016: Jie et al.

In my thesis, I revisit the experimental design choices from four perspectives

AutomatedApproach

Performance

Recall Rate

Duplicate

The goal of newly proposed approach is to have better performance than prior ones

A BRecall Rate = 0.6 Recall Rate = 0.5

>Example:

Prior Research

2007: Runeson et al.

2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.

2016: Hindle al.2016: Jie et al.

Revisiting the experimental design choices can help better understand the benefits and limitations of prior approaches evaluation.

Needed Effort

Data Filtration

Data Changes

Evaluation Process

Evaluation Process Data Filtration Data ChangesNeeded Effort

Prior research treats all duplicate reports equally

Duplicate reports differ based on the effortneeded for their manual identification

Evaluation Process Data Filtration Data ChangesNeeded Effort

Real Example: Duplicate (B) needs more much effort to be identified

Timereported

BA

Evaluation Process Data Filtration Data ChangesNeeded Effort

16

Identified

A

Real Example: Duplicate (B) needs more much effort to be identified

15 minutes

Timereported

BA

Evaluation Process Data Filtration Data ChangesNeeded Effort

17

Identified

A BIdentified

Real Example: Duplicate (B) needs more much effort to be identified

15 minutes

44 days

Timereported

BA

Evaluation Process Data Filtration Data ChangesNeeded Effort

A duplicate can be hard or easybased on the effort needed for identification

HARDEASY

Evaluation Process Data Filtration Data ChangesNeeded Effort

50%

Within 1 Day

WithoutDiscussions

With onePerson

50%EASY HARD

Studied Systems:

Evaluation Process Data Filtration Data ChangesNeeded Effort

Random Forest Model“Binary Classification”

AUCs0.68 – 0.8

50%50%EASY HARD

Evaluation Process Data Filtration Data ChangesNeeded Effort

Random Forest Model“Binary Classification”

AUCs0.68 – 0.8

50%50%EASY HARD

This model is used to explain the most important factors on the effort needed for identification

Factors such as Description Similarity is one of the top important ones

Evaluation Process Data Filtration Data ChangesNeeded Effort

Thesis Contribution #1:Show the importance of considering the neededeffort for retrieving duplicate reports in theperformance measurement of automated duplicateretrieval approaches

[Published in EMSE 2015]

Data ChangesData FiltrationNeeded Effort Evaluation Process

Testing period (1 Year)

AA Time

Not considered in the evaluation

Many prior research evaluate their approaches on short testing periods

Data ChangesData FiltrationNeeded Effort Evaluation Process

AA B Time

Testing period (1 Year)Not considered in the evaluation

Many prior research evaluate their approaches on short testing periods

Data ChangesData FiltrationNeeded Effort Evaluation Process

AA BB Time

Testing period (1 Year)

Duplicate (B) match is outside the testing period

Not considered in the evaluation

Data ChangesData FiltrationNeeded Effort Evaluation Process

TimeB AA B

Ignored

Testing period (1 Year)

Duplicate (B) match is outside the testing period

Not considered in the evaluation

Data ChangesData FiltrationNeeded Effort Evaluation Process

TimeB AA B

Ignored

Testing period (1 Year)Not considered in the evaluation

ClassicalEvaluation

I refer to this evaluation by “Classical Evaluation”

Data ChangesData FiltrationNeeded Effort Evaluation Process

Time

ClassicalEvaluationRealisticEvaluation

Testing period (1 Year)

Instead, I propose “Realistic Evaluation”

Data ChangesData FiltrationNeeded Effort Evaluation Process

Time

ClassicalEvaluationRealisticEvaluation

Testing period (1 Year)

Instead, I propose “Realistic Evaluation”

AB A B

Data ChangesData FiltrationNeeded Effort Evaluation Process

Mozilla 2010

Eclipse2008

OpenOffice 2008 - 2010

ClassicalEvaluationRealisticEvaluation

Problem #1

Number of considered duplicates

Problem #2

Performance overestimation

I compare the two evaluations on frequently used testing data in prior research

Time

Data ChangesData FiltrationNeeded Effort Evaluation Process

Classical evaluation is missing 23.5% of

duplicates 10,043 Duplicates

7,551 Duplicates

ClassicalEvaluationRealisticEvaluation

Mozilla 2010

Problem #1

Number of considered duplicates

Testing period

Time

Data ChangesData FiltrationNeeded Effort Evaluation Process

Recall rate is overestimated by 39%

for REP approach Recall: 0.43

Recall: 0.58

Mozilla 2010

ClassicalEvaluationRealisticEvaluation

Problem #2

Performance overestimation

Testing period

Data ChangesData FiltrationNeeded Effort Evaluation Process

Mozilla 2010

Eclipse2008

OpenOffice 2008 - 2010

ClassicalEvaluationRealisticEvaluation

Problem #1

Number of considered duplicates

Problem #2

Performance overestimation

Data ChangesData FiltrationNeeded Effort Evaluation Process

Mozilla 2010

Eclipse2008

OpenOffice 2008 - 2010

ClassicalEvaluationRealisticEvaluation

Problem #1

Number of considered duplicates

Problem #2

Performance overestimation

Is the difference in performance between the two evaluations generalize to other testing periods within the tested ITSs

Data ChangesData FiltrationNeeded Effort Evaluation Process

Randomly select 100 chunks as “Testing Periods”

Studying the statistical difference between the classical and realistic

evaluations

ClassicalEvaluationRealisticEvaluation

Chunk = 1 Year Time

Data ChangesData FiltrationNeeded Effort Evaluation Process

ClassicalEvaluationRealisticEvaluation

Chunk 1 (Dec2004 – Dec2005)

Chunk 2 (Nov2003 – Nov2004)

Chunk 100 (Sep2007 – Sep2008)

Results 1

Results 2

Results 100

AutomatedApproach

Data ChangesData FiltrationNeeded Effort Evaluation Process

Classical evaluation significantly overestimates the performance of the automated approaches (p-value<0.001)

Data ChangesData FiltrationNeeded Effort Evaluation Process

Thesis Contribution #2:Propose a realistic evaluation for the automatedretrieval of duplicate reports, and analyze priorfindings using such realistic evaluation

[Published in TSE 2017]

Time

Data ChangesNeeded Effort Evaluation Process Data Filtration

RealisticEvaluation

Do we need all these issue reports to search in?

Filtering out old issue reports should lead to reduced search time

Testing period (1 Year)

No Duplicates are Ignored

Time

Data ChangesNeeded Effort Evaluation Process Data Filtration

RealisticEvaluation

The possible solution is to filter out the old issue reports using a time window based on the report date

Testing period (1 Year)

No Duplicates are Ignored

Time

Data ChangesNeeded Effort Evaluation Process Data Filtration

RealisticEvaluation

2 Years

Example (1)

AA BB

Testing period (1 Year)

The possible solution is to filter out the old issue reports using a time window based on the report date

Time

Data ChangesNeeded Effort Evaluation Process Data Filtration

RealisticEvaluation

4 months

Example (2)

AA BB

Testing period (1 Year)

The possible solution is to filter out the old issue reports using a time window based on the report date

Data ChangesNeeded Effort Evaluation Process Data Filtration

Data ChangesNeeded Effort Evaluation Process Data Filtration

RealisticEvaluation

Data ChangesNeeded Effort Evaluation Process Data Filtration

RealisticEvaluation

Results of the limitation by (n) months

It is not possible to limit the issue reports that are searched without decreasing

the performance of automated approaches

Data ChangesNeeded Effort Evaluation Process Data Filtration

A B Time

Testing periodreported

Data ChangesNeeded Effort Evaluation Process Data Filtration

A B Time

(A) is resolved 1

month ago

Testing periodreported

Data ChangesNeeded Effort Evaluation Process Data Filtration

A B Time

(A) is resolved 1

month ago(B) is resolved 2

years ago

Testing periodreported

Data ChangesNeeded Effort Evaluation Process Data Filtration

A B Time

Testing period

A new duplicate is more likely

to be duplicate of (A)

(A) is resolved 1

month ago(B) is resolved 2

years ago

reported

Data ChangesNeeded Effort Evaluation Process Data Filtration

I propose filtration approach based on

time threshold from resolution date for

each resolution valueand rank

The thresholds are optimized using

Genetic Algorithm (NSGA-II)

Data ChangesNeeded Effort Evaluation Process Data Filtration

The proposed filtration approach

improves the performance by up

to 60%

Data ChangesNeeded Effort Evaluation Process Data Filtration

Thesis Contribution #3:Propose a genetic algorithm based approach forfiltering out the old issue reports to improve theperformance of duplicate retrieval approaches.

[Published in TSE 2017]

Time

Needed Effort Evaluation Process Data Filtration Data Changes

2011

Just-in-time (JIT) duplicate retrieval feature is introduced

Needed Effort Evaluation Process Data Filtration Data Changes

Time

Needed Effort Evaluation Process Data Filtration Data Changes

2011

Significantly lessduplicates but more

identification effort is needed

Time

Needed Effort Evaluation Process Data Filtration Data Changes

2011

AutomatedApproach

AutomatedApproach

How the JIT duplicate retrieval feature impact the performance evaluation of the automated approaches?

Needed Effort Evaluation Process Data Filtration Data Changes

Approach (1): BM25F Approach (2): REP

The performance of both approaches is significantly lower for after-JIT duplicates

Firefox

Needed Effort Evaluation Process Data Filtration Data Changes

The relative improvement of the performance of REPover BM25F is significantly larger after the activation of the JIT

Firefox

Needed Effort Evaluation Process Data Filtration Data Changes

Thesis Contribution #4:Highlight the impact of the “JIT” feature on theperformance evaluation of duplicate retrievalapproaches

[Submitted to EMSE] – under review

1 2

3 4

Summary

Prior research focuses on improving the performance

AutomatedApproach

Performance

Recall Rate

Duplicate

The goal of newly proposed approach is to have higher performance than prior ones

Prior Research

2007: Runeson et al.

2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.

2016: Hindle al.2016: Jie et al.

In my thesis, I revisit the experimental design choices from four perspectives

AutomatedApproach

Performance

Recall Rate

Duplicate

The goal of newly proposed approach is to have better performance than prior ones

A BRecall Rate = 0.6 Recall Rate = 0.5

>Example:

Prior Research

2007: Runeson et al.

2008: Wang et al.

2010: Sun et al.

2010: Sureka et al.

2011: Sun et al.

2012: Zhou al.

2012: Nguyen al.

2015: Aggarwal al.

2016: Hindle al.2016: Jie et al.

Revisiting the experimental design choices can help better understand the benefits and limitations of prior approaches evaluation.

Needed Effort

Data Filtration

Data Changes

Evaluation Process

Evaluation Process Data Filtration Data ChangesNeeded Effort

Thesis Contribution #1:Show the importance of considering the neededeffort for retrieving duplicate reports in theperformance measurement of automated duplicateretrieval approaches

[Published in EMSE 2015]

Data ChangesData FiltrationNeeded Effort Evaluation Process

Thesis Contribution #2:Propose a realistic evaluation for the automatedretrieval of duplicate reports, and analyze priorfindings using such realistic evaluation

[Published in TSE 2017]

Data ChangesNeeded Effort Evaluation Process Data Filtration

Thesis Contribution #3:Propose a genetic algorithm based approach forfiltering out the old issue reports to improve theperformance of duplicate retrieval approaches

[Published in TSE 2017]

Needed Effort Evaluation Process Data Filtration Data Changes

Thesis Contribution #4:Highlight the impact of the “JIT” feature on theperformance evaluation of duplicate retrievalapproaches

[EMSE 2017] – under review

1 2

3 4

Summary