Post on 31-Dec-2015
Outline
• Clones and Control Structure Variant Clones• Research Motivation• Approach for mining control structure variant clones• Evaluation of precision and recall• Case study of control structure variant clones• Refactorability evaluation
2
• Clones are common in software systems. The percentage of clones in systems varied from 6.5% to 59.5%, average proportion is 14.6%. (Chen et al. @2014)
Code duplication (Software Clone)
3
• Clones are harmful
• Identified as the worst code smell (Rahman @2010)• Indication of poor software maintainability
(Mondal @2011)• Cause system design quality degrade
Why clone is a problem?
Clone refactoring can eliminate bad effects.
4
• Type-1: Identical code fragments except for variations in whitespace, layout and comments. (Clear)
• Type-2: Syntactically identical fragments except for variations in identifiers, literals, types, whitespace, layout and comments. (Clear)
• Type-3: Copied fragments with further modifications such as changed, added or removed statements, in addition Type-1 variation.
• Type-4: Two or more code fragments that perform the same computation but are implemented by different syntax text.
Clone CategorizationMost widely accepted definition is from Roy @2009
5
• Type-4 clones can be divided into subcategories.
Dispute about Type-4 Clones
• Type-4 clones are syntactically different semantic clones and still undecidable. • Type-4 clones are behaviorally similar code
fragments regarding to their input/output.
6
Definition• Control structure variant clones (CSVC) are
clones use different control structures to implement the same functionality.
Control Structure Variant Clone?
7
From the perspective of clone refactoring, a different strategy is required to refactor Control Structure variant clones. Extract common code fragment Analysis of code functionality
Motivation
8
Jürgens et al [2010] on the clones beyond copy-paste revealed:
– The state-of-the-art clone detectors did not achieve a recall of more than 10%.
– In 52 manually checked methods, 32 were behaviorally similar but syntactically different to other methods.
No approach tailored to find these clones
Motivation
9
Propose an approach to mine control structure variant clones accurately.
The mining process should take into account:1. Control structure matching2. Functional similarity evaluation
Goal
10
•Loop variants• Enhanced for loop• Iterator-based for or while loop• Index-based for or while loop• Do-while loop
•Conditional variants• If-else statement• Conditional expression (Ternary operator ?: )• Switch statement
Common Control Structures in Java
13
Loop Variable: • Start index • End index • Step
We consider two loops L1 and L2 as functionally equivalent, if they have the same loop variable value.
Unified Representation of Loops
14
Java Binding: unique string representing a variable, object type, or method invocation.
IBinding: • IMethodBinding • ITypeBinding• IVariableBinding (Excluded)
Phase 2: Function Similarity Evaluation
20
IMethodBinding represents method signatures.ITypeBinding represents the Java types.
Binding Information
21
2. Ignore the binding keys of the methods which access the next element.
Post-processing of Bindings
23
Study Setup• Select projects.• Select clone detection tool.• Investigation of the results.
Evaluation
25
• Three criteria for tool selection:1. Able to detect clones with control structure variations.2. Available for download.3. Take a reasonable time to detect clones.
• Tried five different clone detection tools:CCFinder –Not able to find semantic cloneJSCtracker –Not able to finish detection processNiCad–Returns abnormal clone groupsDeckard—Not able to finish detectionSebyte works well for our experiment
Selection of Detection tool
27
• Trade off between precision and recall• Identify 285 true positives (TP), 475 false
positives (FP)
Best Threshold
28
Threshold value 0.5 achieved a performance score of 0.64 (precision), and 0.91 (recall)
Best Threshold
29
Q1 : Which variation is most frequently occurring?
Q2 : Does the evolution of a programming language affect the introduction of control structure variant clones?
Case Study
31
Fact: The largest category is Enhanced for loop VS Iterator-based while loop, which has 109 instances.
Answer to Q1: Enhanced for loop and Iterator-based while loop appear most often
Case Study
33
Fact: Enhanced for loop is involved in all top 3 categories, they have 209 clone pairs, account for 73%
Answer to Q2: Enhanced for loop introduced in Java 5, significantly affects the introduction of control structure variant clone.
Case Study
34
Exchange of method invocation expressionsVariations Hindering Refactoring
38
Clone 1
Clone 2
A B
B A
Conclusion
• Control structure variant clones do exist in systems
• They are introduced because the language evolves, e.g., the new feature Enhanced For
• 42% of the clones we found are refactorable
40
• Improve the approach to convert one data structure to another to refactor an additional 19% of the control structure variant clones.
Future Work
41
• Develop code to unify different control structures and perform the refactoring.
Thanks!
42
Visit our Benchmark of Control structure variant clones athttp://users.encs.concordia.ca/~nikolaos/IWSC_2015/