Revisiting the Experimental Design Choices for Approaches ...€¦ · Ph.D. : Mohamed Sami Rakha...
Transcript of Revisiting the Experimental Design Choices for Approaches ...€¦ · Ph.D. : Mohamed Sami Rakha...
Revisiting the Experimental Design Choices for Approaches for the Automated Retrieval
of Duplicate Issue Reports
Ph.D. : Mohamed Sami Rakha
Supervisor: Prof. Ahmed E. Hassan
Ph.D. Thesis Defense
Software issues are common
There are many issue tracking systems
Issue tracking systems receive a large number of issue reports
New Issue Reports
Over 1.2 MillionIssue Reports
Different persons may report the same issue
User #2
User #1
Different persons may report the same issue
User #2
User #1
Duplicate
Master
The identification of duplicate issue reports is an annoying and timewasting task
User #2
User #1
Duplicate
20-30% Duplicates! Master
Bettenburg et al. Duplicate bug reports considered harmful… really? [ICSM 2008:]
Information retrieval is leveraged by prior research
Duplicate
AutomatedApproach
Search
Most similar previously reported issues
Ranked list
Master
Developer choose the accurate candidate for the newly reported duplicate… If there is any
Duplicate
AutomatedApproach
Search
Most similar previously reported issues
Ranked list
Developer
Master
Master
Several automated approaches for duplicates retrieval are proposed by prior research
Prior Research
2007: Runeson et al.
2008: Wang et al.
2010: Sun et al.
2010: Sureka et al.
2011: Sun et al.
2012: Zhou al.
2012: Nguyen al.
2015: Aggarwal al.
2016: Hindle al.2016: Jie et al.
Prior research focuses on improving the performance
AutomatedApproach
Performance
Recall Rate
Duplicate
The goal of newly proposed approach is to have higher performance than prior ones
Prior Research
2007: Runeson et al.
2008: Wang et al.
2010: Sun et al.
2010: Sureka et al.
2011: Sun et al.
2012: Zhou al.
2012: Nguyen al.
2015: Aggarwal al.
2016: Hindle al.2016: Jie et al.
In my thesis, I revisit the experimental design choices from four perspectives
AutomatedApproach
Performance
Recall Rate
Duplicate
The goal of newly proposed approach is to have better performance than prior ones
A BRecall Rate = 0.6 Recall Rate = 0.5
>Example:
Prior Research
2007: Runeson et al.
2008: Wang et al.
2010: Sun et al.
2010: Sureka et al.
2011: Sun et al.
2012: Zhou al.
2012: Nguyen al.
2015: Aggarwal al.
2016: Hindle al.2016: Jie et al.
Revisiting the experimental design choices can help better understand the benefits and limitations of prior approaches evaluation.
Needed Effort
Data Filtration
Data Changes
Evaluation Process
Evaluation Process Data Filtration Data ChangesNeeded Effort
Prior research treats all duplicate reports equally
Duplicate reports differ based on the effortneeded for their manual identification
Evaluation Process Data Filtration Data ChangesNeeded Effort
Real Example: Duplicate (B) needs more much effort to be identified
Timereported
BA
Evaluation Process Data Filtration Data ChangesNeeded Effort
16
Identified
A
Real Example: Duplicate (B) needs more much effort to be identified
15 minutes
Timereported
BA
Evaluation Process Data Filtration Data ChangesNeeded Effort
17
Identified
A BIdentified
Real Example: Duplicate (B) needs more much effort to be identified
15 minutes
44 days
Timereported
BA
Evaluation Process Data Filtration Data ChangesNeeded Effort
A duplicate can be hard or easybased on the effort needed for identification
HARDEASY
Evaluation Process Data Filtration Data ChangesNeeded Effort
50%
Within 1 Day
WithoutDiscussions
With onePerson
50%EASY HARD
Studied Systems:
Evaluation Process Data Filtration Data ChangesNeeded Effort
Random Forest Model“Binary Classification”
AUCs0.68 – 0.8
50%50%EASY HARD
Evaluation Process Data Filtration Data ChangesNeeded Effort
Random Forest Model“Binary Classification”
AUCs0.68 – 0.8
50%50%EASY HARD
This model is used to explain the most important factors on the effort needed for identification
Factors such as Description Similarity is one of the top important ones
Evaluation Process Data Filtration Data ChangesNeeded Effort
Thesis Contribution #1:Show the importance of considering the neededeffort for retrieving duplicate reports in theperformance measurement of automated duplicateretrieval approaches
[Published in EMSE 2015]
Data ChangesData FiltrationNeeded Effort Evaluation Process
Testing period (1 Year)
AA Time
Not considered in the evaluation
Many prior research evaluate their approaches on short testing periods
Data ChangesData FiltrationNeeded Effort Evaluation Process
AA B Time
Testing period (1 Year)Not considered in the evaluation
Many prior research evaluate their approaches on short testing periods
Data ChangesData FiltrationNeeded Effort Evaluation Process
AA BB Time
Testing period (1 Year)
Duplicate (B) match is outside the testing period
Not considered in the evaluation
Data ChangesData FiltrationNeeded Effort Evaluation Process
TimeB AA B
Ignored
Testing period (1 Year)
Duplicate (B) match is outside the testing period
Not considered in the evaluation
Data ChangesData FiltrationNeeded Effort Evaluation Process
TimeB AA B
Ignored
Testing period (1 Year)Not considered in the evaluation
ClassicalEvaluation
I refer to this evaluation by “Classical Evaluation”
Data ChangesData FiltrationNeeded Effort Evaluation Process
Time
ClassicalEvaluationRealisticEvaluation
Testing period (1 Year)
Instead, I propose “Realistic Evaluation”
Data ChangesData FiltrationNeeded Effort Evaluation Process
Time
ClassicalEvaluationRealisticEvaluation
Testing period (1 Year)
Instead, I propose “Realistic Evaluation”
AB A B
Data ChangesData FiltrationNeeded Effort Evaluation Process
Mozilla 2010
Eclipse2008
OpenOffice 2008 - 2010
ClassicalEvaluationRealisticEvaluation
Problem #1
Number of considered duplicates
Problem #2
Performance overestimation
I compare the two evaluations on frequently used testing data in prior research
Time
Data ChangesData FiltrationNeeded Effort Evaluation Process
Classical evaluation is missing 23.5% of
duplicates 10,043 Duplicates
7,551 Duplicates
ClassicalEvaluationRealisticEvaluation
Mozilla 2010
Problem #1
Number of considered duplicates
Testing period
Time
Data ChangesData FiltrationNeeded Effort Evaluation Process
Recall rate is overestimated by 39%
for REP approach Recall: 0.43
Recall: 0.58
Mozilla 2010
ClassicalEvaluationRealisticEvaluation
Problem #2
Performance overestimation
Testing period
Data ChangesData FiltrationNeeded Effort Evaluation Process
Mozilla 2010
Eclipse2008
OpenOffice 2008 - 2010
ClassicalEvaluationRealisticEvaluation
Problem #1
Number of considered duplicates
Problem #2
Performance overestimation
Data ChangesData FiltrationNeeded Effort Evaluation Process
Mozilla 2010
Eclipse2008
OpenOffice 2008 - 2010
ClassicalEvaluationRealisticEvaluation
Problem #1
Number of considered duplicates
Problem #2
Performance overestimation
Is the difference in performance between the two evaluations generalize to other testing periods within the tested ITSs
Data ChangesData FiltrationNeeded Effort Evaluation Process
Randomly select 100 chunks as “Testing Periods”
Studying the statistical difference between the classical and realistic
evaluations
ClassicalEvaluationRealisticEvaluation
Chunk = 1 Year Time
Data ChangesData FiltrationNeeded Effort Evaluation Process
ClassicalEvaluationRealisticEvaluation
Chunk 1 (Dec2004 – Dec2005)
Chunk 2 (Nov2003 – Nov2004)
Chunk 100 (Sep2007 – Sep2008)
Results 1
Results 2
Results 100
AutomatedApproach
Data ChangesData FiltrationNeeded Effort Evaluation Process
Classical evaluation significantly overestimates the performance of the automated approaches (p-value<0.001)
Data ChangesData FiltrationNeeded Effort Evaluation Process
Thesis Contribution #2:Propose a realistic evaluation for the automatedretrieval of duplicate reports, and analyze priorfindings using such realistic evaluation
[Published in TSE 2017]
Time
Data ChangesNeeded Effort Evaluation Process Data Filtration
RealisticEvaluation
Do we need all these issue reports to search in?
Filtering out old issue reports should lead to reduced search time
Testing period (1 Year)
No Duplicates are Ignored
Time
Data ChangesNeeded Effort Evaluation Process Data Filtration
RealisticEvaluation
The possible solution is to filter out the old issue reports using a time window based on the report date
Testing period (1 Year)
No Duplicates are Ignored
Time
Data ChangesNeeded Effort Evaluation Process Data Filtration
RealisticEvaluation
2 Years
Example (1)
AA BB
Testing period (1 Year)
The possible solution is to filter out the old issue reports using a time window based on the report date
Time
Data ChangesNeeded Effort Evaluation Process Data Filtration
RealisticEvaluation
4 months
Example (2)
AA BB
Testing period (1 Year)
The possible solution is to filter out the old issue reports using a time window based on the report date
Data ChangesNeeded Effort Evaluation Process Data Filtration
Data ChangesNeeded Effort Evaluation Process Data Filtration
RealisticEvaluation
Data ChangesNeeded Effort Evaluation Process Data Filtration
RealisticEvaluation
Results of the limitation by (n) months
It is not possible to limit the issue reports that are searched without decreasing
the performance of automated approaches
Data ChangesNeeded Effort Evaluation Process Data Filtration
A B Time
Testing periodreported
Data ChangesNeeded Effort Evaluation Process Data Filtration
A B Time
(A) is resolved 1
month ago
Testing periodreported
Data ChangesNeeded Effort Evaluation Process Data Filtration
A B Time
(A) is resolved 1
month ago(B) is resolved 2
years ago
Testing periodreported
Data ChangesNeeded Effort Evaluation Process Data Filtration
A B Time
Testing period
A new duplicate is more likely
to be duplicate of (A)
(A) is resolved 1
month ago(B) is resolved 2
years ago
reported
Data ChangesNeeded Effort Evaluation Process Data Filtration
I propose filtration approach based on
time threshold from resolution date for
each resolution valueand rank
The thresholds are optimized using
Genetic Algorithm (NSGA-II)
Data ChangesNeeded Effort Evaluation Process Data Filtration
The proposed filtration approach
improves the performance by up
to 60%
Data ChangesNeeded Effort Evaluation Process Data Filtration
Thesis Contribution #3:Propose a genetic algorithm based approach forfiltering out the old issue reports to improve theperformance of duplicate retrieval approaches.
[Published in TSE 2017]
Time
Needed Effort Evaluation Process Data Filtration Data Changes
2011
Just-in-time (JIT) duplicate retrieval feature is introduced
Needed Effort Evaluation Process Data Filtration Data Changes
Time
Needed Effort Evaluation Process Data Filtration Data Changes
2011
Significantly lessduplicates but more
identification effort is needed
Time
Needed Effort Evaluation Process Data Filtration Data Changes
2011
AutomatedApproach
AutomatedApproach
How the JIT duplicate retrieval feature impact the performance evaluation of the automated approaches?
Needed Effort Evaluation Process Data Filtration Data Changes
Approach (1): BM25F Approach (2): REP
The performance of both approaches is significantly lower for after-JIT duplicates
Firefox
Needed Effort Evaluation Process Data Filtration Data Changes
The relative improvement of the performance of REPover BM25F is significantly larger after the activation of the JIT
Firefox
Needed Effort Evaluation Process Data Filtration Data Changes
Thesis Contribution #4:Highlight the impact of the “JIT” feature on theperformance evaluation of duplicate retrievalapproaches
[Submitted to EMSE] – under review
1 2
3 4
Summary
Prior research focuses on improving the performance
AutomatedApproach
Performance
Recall Rate
Duplicate
The goal of newly proposed approach is to have higher performance than prior ones
Prior Research
2007: Runeson et al.
2008: Wang et al.
2010: Sun et al.
2010: Sureka et al.
2011: Sun et al.
2012: Zhou al.
2012: Nguyen al.
2015: Aggarwal al.
2016: Hindle al.2016: Jie et al.
In my thesis, I revisit the experimental design choices from four perspectives
AutomatedApproach
Performance
Recall Rate
Duplicate
The goal of newly proposed approach is to have better performance than prior ones
A BRecall Rate = 0.6 Recall Rate = 0.5
>Example:
Prior Research
2007: Runeson et al.
2008: Wang et al.
2010: Sun et al.
2010: Sureka et al.
2011: Sun et al.
2012: Zhou al.
2012: Nguyen al.
2015: Aggarwal al.
2016: Hindle al.2016: Jie et al.
Revisiting the experimental design choices can help better understand the benefits and limitations of prior approaches evaluation.
Needed Effort
Data Filtration
Data Changes
Evaluation Process
Evaluation Process Data Filtration Data ChangesNeeded Effort
Thesis Contribution #1:Show the importance of considering the neededeffort for retrieving duplicate reports in theperformance measurement of automated duplicateretrieval approaches
[Published in EMSE 2015]
Data ChangesData FiltrationNeeded Effort Evaluation Process
Thesis Contribution #2:Propose a realistic evaluation for the automatedretrieval of duplicate reports, and analyze priorfindings using such realistic evaluation
[Published in TSE 2017]
Data ChangesNeeded Effort Evaluation Process Data Filtration
Thesis Contribution #3:Propose a genetic algorithm based approach forfiltering out the old issue reports to improve theperformance of duplicate retrieval approaches
[Published in TSE 2017]
Needed Effort Evaluation Process Data Filtration Data Changes
Thesis Contribution #4:Highlight the impact of the “JIT” feature on theperformance evaluation of duplicate retrievalapproaches
[EMSE 2017] – under review
1 2
3 4
Summary