A Survey of Software Clone Detection Techniques€¦ · detection techniques and tools. Burd and...

International Journal of Computer Applications (0975 - 8887)Volume 137 - No.10, March 2016

A Survey of Software Clone Detection Techniques

Abdullah SheneamerDepartment of Computer Science

University of Colorado at Colo. Springs, USAColorado Springs, USA

Jugal KalitaDepartment of Computer Science

University of Colorado at Colo. Springs, USAColorado Springs, USA

ABSTRACTIf two fragments of source code are identical or similar to eachother, they are called code clones. Code clones introduce difficul-ties in software maintenance and cause bug propagation. Softwareclones occur due to several reasons such as code reuse by copyingpre-existing fragments, coding style, and repeated computation us-ing duplicated functions with slight changes in variables or datastructures used. If a code fragment is edited, it will have to bechecked against all related code clones to see if they need to bemodified as well. Removal, avoidance or refactoring of cloned codeare other important issues in software maintenance. However, sev-eral research studies have demonstrated that removal or refactoringof cloned code is sometimes harmful. In this study, code clones,common types of clones, phases of clone detection, the state-of-the-art in code clone detection techniques and tools, and challengesfaced by clone detection techniques are discussed.

KeywordsSoftware Clone, Code Clone, Duplicated Code Detection, CloneDetection

1. INTRODUCTIONWhen a programmer copies and pastes a fragment of code, possiblywith minor or even extensive edits, it is called code cloning. Codeclones introduce difficulties in software maintenance and lead tobug propagation. For example, if there are many identical or nearlyduplicated or copy-pasted code fragments in a software system anda bug is found in one code clone, it has to be detected everywhereand fixed. The presence of duplicated but identical bugs in manylocations within a piece of software increases the difficulty ofsoftware maintenance. Recent research [5, 8, 10, 11] with largesoftware systems [8, 35] has detected that 22.3% of Linux code hasclones. Kamiya et al. [4] has reported 29% cloned code in JDK.Baker [8] has detected clones in large systems in 13% - 20% of thesource code. Baxter et al. [13] also have found that 12.7% code iscloned in a large software system. Mayrand et al. [20] have alsoreported that 5% - 20% code is cloned. Code clone detection canbe useful for code simplification, code maintainability, plagiarismdetection [41, 42], copyright infringement detection, malicioussoftware detection and detection of bug reports. Many code clonedetection techniques have been proposed [1]. The focus of thispaper is to present a review of such clone detection techniques.

The rest of the paper is organized as follows. Section 2 dis-

cusses background material. Prior survey papers on the topic ofcode clone detection are introduced in Section 3. In Section 4,the seven phases in clone detection are described. Several clonedetection approaches, techniques and tools are discussed in greatdetails in Section 5. Evaluation of clone detection techniquesis discussed in Section 6. How detected clones can be removedautomatically and areas related to clone detection are discussed inSection 7. Open problems related to code clone detection researchare covered in Section 8. Finally, the paper is concluded in Section9.

2. BACKGROUNDCode clones are used frequently because they can be created fast,and easily inserted with little expense [2]. However, code clonesaffect software maintenance, may introduce poor design, lead towasted time in repeatedly understanding a fragment of poorlywritten code, increase system size, and reincarnate bugs that arepresent in the original of code segment. All these make it a difficultjob to maintain a large system [1, 2].

Additional reasons that make code clone detection essentialare the following. 1) Detecting cloned code may help detectmalicious software [3]. 2) Code clone detection may find similarcode and help detect plagiarism and copyright infringement[2, 4, 5]. 3) Code clone detection helps reduce the source code sizeby performing code compaction [6]. 4) Code clone detection alsohelps detect crosscutting concerns, which are aspects of a programthat impact other issues that arise when code is duplicated all overthe system [6].

2.1 Basic DefinitionsEach paper in the literature defines clones in its own way [1]. Here,common definitions which used throughout this paper are provided.

Definition 1: Code Fragment. A code fragment (CF) is apart of the source code needed to run a program. It usuallycontains more than five statements that are considered interesting,but it may contain fewer than five statements. It can contain afunction or a method, begin-end blocks or a sequence of statements.

Definition 2: Software Clone/Code Clone/Clone Pair. If acode fragment CF1 is similar to another code fragment CF2syntactically or semantically, one is called a clone of the other. Ifthere is a relation between two code fragments such that they areanalogous or similar to each other, the two are called a clone pair

1


(CF1, CF2).

Definition 3: Clone Class. A clone class is a set of clonepairs where each pair is related by the same relation between thetwo code fragments. A relation between two code fragments is anequivalence relation which is reflexive, symmetric and transitive,and holds between two code fragment if and only if they are thesame sequence.

2.2 Types of ClonesThere are two groups of clones. The first group refers to two codefragments which are similar based on their text [16, 25]. There arethree types within the first group as shown in Table 1.

Type-1 (Exact clones): Two code fragments are the exactcopies of each other except whitespaces, blanks and comments.

Type-2 (Renamed/Parameterized): Two code fragments aresimilar except for names of variables, types, literals and functions.

Type-3 (Near miss clones/Gapped clones): Two copied codefragments are similar, but with modifications such as added orremoved statements, and the use of different identifiers, literals,types, whitespaces, layouts and comments.

The second group refers to two code fragments which aresimilar based on their functions [23]. Such clones are also calledType-4 clones as shown in Table 1.

Type-4 (Semantic clones): Two code fragments are semanti-cally similar, without being syntactically similar.

3. PREVIOUS SURVEYSSeveral reviews related to cloned code have been publishedas shown in Table 2. Most reviews are related to cloned codedetection techniques and tools. Burd and Baily [53] evaluate andcompare three state-of-the-art clone detection tools (CCFinder [4],CloneDr [14], and Covet [55]) and two plagiarism detection tools(JPlag [33] and Moss [56]) in large software applications. Thiswork focuses on benefits of clone identification for preventativemaintenance. The paper shows that there is no one single winner.Their work is not comprehensive in coverage of clone detectiontechniques.

Koschke [72] focuses on clone detection tools and techniques upto 2007. Koschke discusses subareas within the study of clonesand summarizing important results and presents open researchquestions. Koschke compares six clone detectors in terms of re-call, precision, clone type, number of candidates, and running time.

Roy and Cordy [74] also survey state-of-the-art in clone de-tection techniques up to 2007. They provide a taxonomy oftechniques, review detection approaches and evaluate them. Inaddition, this paper discusses how clone detection can assist inother areas of software engineering and how other areas can helpclone detection research.

Bellon et al. [64] compare six clone detectors in terms of re-call, precision, and space and time requirements. They performexperiments on six detectors using eight large C and Java pro-grams. Their approach to evaluation has become the standards for

every a new clone detector introduced since 2007.

Roy et al. [27] classify, compare and evaluate clone detec-tion techniques till 2009. They classify the techniques basedon a number of facets. In addition, they evaluate the classifiedtechniques based on a taxonomy of scenarios.

Rysselberghe and Demeyer [54] compare three representativedetection techniques: simple line matching, parameterized match-ing and metric fingerprints. They provide comparison in terms ofportability, scalability and the number of false positives. However,they evaluate only on small systems which are smaller than 10KLOCs (thousands of lines of code) in length.

Ratten et al. [1] perform a systematic review of existing code cloneapproaches. This review summarizes existing code clone detectiontechniques, tools and management of clones up to 2012.

Fontana et al. [73] discuss refactored code clone detectionand find that certain code quality metrics are improved after therefactoring. They use three clone detection tools and analyzefive versions of two open-source systems. The differences in theevaluation results using the three tools are outlined. The impact ofclone factoring on different quality metrics is

Many methods have been published recently in the literaturefor detecting code clones. The goal is to discuss, compare andanalyze the state-of-the- art tools, and discuss the tools thathave not been discussed in previous survey papers. The selectedcommon clone detectors that were covered by previous surveys.Adding clone detectors that are not covered by previous work.Several tools have been excluded from this survey since they arenot widely used, are similar to papers that selected based on thetexts of the titles, abstracts of the papers, or the use of the same orsimilar computational techniques. We have found relevant papersto include in this survey by reading other survey papers, by startingfrom highly cited papers obtained by searching Google Scholarby using keywords such as ”software clones”, ”code clones” and”duplicated code” as shown in Table 4, but then added recentpapers that may have lower numbers of citations. We compare andclassify techniques and tools considering the types of clones theycan detect, the granularity of code fragments they analyze, thetransformation they perform on code fragments before compar-ison, the manner in which they store code fragments to performcomparison, the method used for comparison, the complexity ofthe comparison process, the outputs they produce, the results theyare able to obtain (in terms of precision and recall), and generaladvantages and disadvantages. We compare this survey with priorsurveys in Table 2.

This study particularly differs from previous surveys in ouridentify the essential strengths and weaknesses of each clonedetection technique and tool in contrast to previous studies, whichfocus on empirically evaluating tools. The goal is not only tocompare the current status of the tools and techniques, but alsoto make an observation indicates that the future potential can bedeveloping a new hybrid technique. The use of evaluation of clonedetection techniques based on recall, precision and F-measuremetrics, scalability, portability and clone relation in order tochoose the right technique for a specific task and several evaluationmetrics can be used.

2


Table 1. Types of code clones. The table presents an original code fragment and four clones, one from each type.Original Code Fragment Code Fragment 1 Code Fragment 2 Code Fragment 3 Code Fragment 4if (a==b){c=a*b; //C1}elsec=a/b; //C2

if (a==b){//comment1c=a*b;}else//comment2c=a/b;

if (g==f){//comment1h=g*f;}else//comment2h=g/f;

if (a==b){//comment1c=a*b;// New Stat.b=a-c;}else//comment2c=a/b;

switch(true){//comment1case a==b:c=a*b;//comment2case a!=b:c=a/b;}

Type-1 Type-2 Type-3 Type-4

Table 2. Summary of Previous Surveys.Reference Year Surveyed

up toNo. of Detectors How to Detectors to be Compared Refactoring Dis-

cussed?Plagiarism Cov-ered?

Open questions?

[53] 2002 2002 5 Supported languages, precision, and recall. No No No[54] 2004 2002 5 Portability, scalability, precision, and recall Yes No No[72] 2007 2007 6 Precision, recall, running time, and clone types. Yes No Yes[74] 2007 2007 23 Comparison technique, complexity, comparison granularity, and clone

refactoring.Yes Yes Yes

[64] 2007 2007 6 Precision, recall, RAM, speed, and clone types. No No No[27] 2009 2009 >30 Clone information, technical aspects, adjustment, evaluation, and clas-

sification and attributes.Yes Yes Yes

[1] 2013 2012 >40 Clone matching technique, advantages, disadvantages, application area,model clone granularity, and tools

Yes Yes Yes

[73] 2013 2012 3 Number of methods, cyclomatic complexity, coupling factor, distancefrom the main sequence, weighted method complexity, lack of cohesion,and response for a class.

Yes No No

Our survey 2015 2015 25 Comparison technique, complexity, comparison granularity, languageindependence, advantages, and disadvantages.[74].

Yes Yes Yes

4. CLONE DETECTION PHASESA clone detector is a tool that reads one or more source files andfinds similarities among fragments of code or text in the files.Since a clone detector does not know where the repeated codefragments occur in advance, it must compare all fragments to findthem. There are many previous proposed techniques that performthe necessary computation and attempt to reduce the number ofcomparisons.

We first discuss the phases of clone detection in general. Aclone detection technique may focus on one or more of the phases.The first four of phases are shown in Figure 1

4.1 Code PreprocessingThis process removes uninteresting pieces of code, converts sourcecode into units, and determines comparison units. The three majorpurposes of this phase are given below.

(1) Remove uninteresting pieces of code. All elements in thesource code that have no bearing in the comparison processare removed or filtered out in this phase.

(2) Identify units of source code. The rest of the source code isdivided into separate fragments, which are used to check forthe existence of direct clone relations to each other. Fragmentsmay be files, classes, functions, begin-end blocks or state-ments.

(3) Identify comparison units. Source units can be divided intosmaller units depending upon the comparison algorithm. Forexample, source units can be divided into tokens.

4.2 TransformationThis phase is used by all approaches except text-based techniquesfor clone detection. This phase transforms the source code into acorresponding intermediate representation for comparison.

There are various types of representations depending on thetechnique. The usual steps in transformation are given below.

(1) Extract Tokens. Tokenization is performed during lexical anal-ysis by compiler front ends in programing languages [5, 8, 9,10]. Each line of source code is converted into a sequence oftokens.

(2) Extract Abstract Syntax Tree. All of the source code is parsedto convert into an abstract syntax tree or parse tree for subtreecomparisons [15, 44].

(3) Extract PDG. A Program Dependency Graph (PDG) repre-sents control and data dependencies. The nodes of a PDGrepresent the statements and conditions in a program. Controldependencies represent flow of control information within theprogram. Data dependencies represent data flow informationin a program. A PDG is generated by semantics-awaretechniques from the source code for sub-graph comparison[31].

Example 1. Given the source code below, the correspondingPDG is given in Figure 2.

i n t main ( ){

i n t a = 0 ; i n t i = 0 ;w h i l e ( i <10){

a = a+ i ;

3


Fig. 1. Four Phases in CCFinder clone detection tool [4].

Fig. 2. Example of Program dependency graph (PDG).

i = i + 1 ;}p r i n t f (”%d ” , a ) ;p r i n t f (”%d ” , i ) ;

}

Table 3. Simple Example ofNormalization.

Original Code Normalizationint a, b, c;a = b + c;

int id, id, id;id = id + id;

(4) Other Transformations. Some techniques apply transforma-tion rules to the source code elements before proceeding withclone detection. These include the CCFinder tool by Kamiyaet al. [4], which has transformation rules for C++ remove tem-plate parameters. The rule is Name ′<′ParameterList′>′ →Name. For example, foo<int > is transformed into foo [4].

(5) Normalization. This step to remove differences is optional.Some tools perform normalization during transformation. Thisinvolves removing comments, whitespaces and differences inspacing as well as normalizing identifiers as shown in Table 3.

4.3 Match DetectionThe result of transformation or normalization is the input to thisphase. Every transformed fragment of code is compared to all otherfragments using a comparison algorithm to find similar source codefragments. The output is a set of similar code fragments either in aclone pair list or a set of combined clone pairs in one class or onegroup as shown in Figure 1. For example, each clone pair may berepresented as a quadruplet (LBegin, LEnd, RBegin, REnd), whereLBegin and LEnd are the left beginning and ending positions of aclone, and RBegin and REnd are the right beginning and endingpositions of another clone that following a clone pair [4].

4


4.4 FormattingThis step converts a clone pair list obtained by the comparison al-gorithm in the previous step into a new clone pair list related to theoriginal source code.

4.5 FilteringNot all clone detectors perform this step. In this phase, code clonesare extracted and a human expert filters out the false-positiveclones. This step is called Manual Analysis [27]. The false positivescan also be filtered out by automated heuristics based on length, di-versity or frequency.

4.6 AggregationThis phase is optional. It can be done in Match Detection phase.To reduce the amount of data, clone pairs can be aggregated intoclusters, groups, sets or classes. For example, clone pairs (C1, C2),(C1, C3), and (C2, C3) can be combined into the clone group (C1,C2, C3).

5. CLONE DETECTION TECHNIQUESMany approaches for detection of code clones have been publishedin the past years. In this section, the characteristics that can be usedto describe clone detection techniques are discussed and follow bya detailed description of a large number of such techniques underappropriate classes.

5.1 Characteristics of Detection TechniquesSeveral characteristics can be used to describe a clone detectiontechnique. These characteristics depend upon how the techniqueworks. Some characteristics that are used throughout the rest of thepaper are given below.

(1) Source Transformations or Normalizations. Some transforma-tion or normalization is applied before going to comparisonphase in most approaches, but other approaches just removecomments and whitespaces. Some approaches perform de-tailed transformations or normalizations so that the comparisonmethods can be applied effectively.

(2) Source Code Representation. A comparison algorithm repre-sents code using its own format or representation scheme. Mostcomparison algorithms use appropriate code that results fromthe transformation or normalization phase.

(3) Comparison Granularity. Different comparison algorithmslook at code at different levels of granularity such as lines, to-kens, nodes of abstract syntax trees (ASTs) or nodes of a pro-gram dependency graph (PDG).

(4) Comparison Algorithms. There are many algorithms that areused to detect clones such as suffix-tree algorithms [5, 10], se-quence matching algorithms [4], dynamic pattern matching al-gorithms [15] and hash-value comparison algorithms [13].

(5) Computational Complexity. A technique must be expandable todetect clones in millions of lines of source code in a large sys-tem. In other words, it must be computationally efficient. Thecomplexity of a tool is based on the kind of transformations ornormalizations performed in addition to the comparison algo-rithm.

(6) Clone Types. Some techniques detect Type-1 clones while oth-ers find Type-1 or Type-2 or Type-3 clones or may even detectall types of clones.

(7) Language Independence. A language-independent tool canwork on any system without any concern. Thus, we shouldbe aware of any language-dependent issues for our chosenmethod.

(8) Output of Clones. Some techniques report clones as clone pairswhile others return clones as clone classes or both. Cloneclasses are better than clone pairs for software maintenance.Clone classes, which are reported without filtering are betterthan the ones that are returned after filtering because the use ofclone classes reduces the number of comparisons and amountof clone data that needs to be reported.

5.2 Categories of Detection TechniquesDetection techniques are categorized into four classes. The tex-tual, lexical, syntactic and semantic classes are discussed. Syntacticapproaches can be divided into tree-based and metric-based tech-niques and semantic approaches can be divided into PDG-basedand hybrid techniques as shown in Table 4. In this section, state-of-the-art in clone detection techniques are described and compared,under these classes and subclasses.

5.2.1 Textual Approaches. Text-based techniques compare twocode fragments and declare them to be clones if the two code frag-ments are literally identical in terms of textual content. Text-basedclone detection techniques generate fewer false positives, are easyto implement and are independent of language. Text-based clonedetection techniques perform almost no transformation to the linesof source code before comparison. These techniques detect clonesbased on similarity in code strings and can find only Type-1 clones.In this section, several well-known textual approaches or text-basedtechniques as shown in Table 5 are discussed. These include Dup[8] by Baker, Duploc tool [15], Ducasse et al. [26], Koschke et al.[14], NICAD by Roy and James [11] and SSD by Seunghak andJeong [12].

5.2.1.1 Dup by Baker. Dup [8] reads source code line byline in the lexical analysis phase. Dup uses normalization, whichremoves comments and whitespaces and also handles identifierrenaming. It hashes each line for comparison among them andextracts matches by a suffix-tree algorithm. The purpose of thistool is to find maximal sections of code that are either exactcopies or near miss clones of each other. The Dup tool can also beclassified as a token-based technique since it tokenizes each linefor line-by-line matching.

To explain Dup’s technique, it is necessary to introduce theterm parameterized string (p-string), which is a string over theunion of two alphabets, say Σ and Π. It also introduces the notionof parameterized match (p-match), which refers to the processin which is a p-string is transformed into another p-string byapplying a renaming function from the symbols of the first p-stringto the symbols of the second p-string. Parameterized matchescan be detected using parameterized strings or p-strings, whichare strings that contain ordinary characters from an alphabetΣ, and parameter characters from a finite alphabet Π. Dupimplements a p-match algorithm. The lexical analyzer producesa string containing non-parameter symbol and zero or moreparameter symbols. When sections of code match except forthe renaming of parameters, such as variables and constants, p-match occurs. Exact match can be detected using a plain suffix tree.

The Dup algorithm encodes a p-string in the following way.The first appearance of each parameter symbol is replaced by

5


Table 4. Techniques for Clone Detection.Approach Technique Tool/Author Year Reference CitationTextual Text Dup 1995 [8] 631

Duploc 1999 [15] 533NICAD 2008 [11] 183SDD 2005 [12] 26

Lexical Token CCFinder 2002 [4] 1126CP-Miner 2006 [7] 395Boreas 2012 [9] 9FRISC 2012 [65] 10CDSW 2013 [66] 9

Syntactic Tree CloneDr 1998 [13] 1003Wahler et al. 2004 [34] 118Koschke et al. 2006 [14] 189Jiang et al. 2007 [28] 385Hotta et al. 2014 [67] 2

Metric Mayrand et al. 1996 [20] 469Kontogiannis et al. 1996 [37] 254Kodhai, et al. 2010 [19] 8Abdul-El-Hafiz et al. 2012 [48] 8Kanika et al. 2013 [29] 3

Semantic Graph Duplix 2001 [31] 474GPLAG 2006 [32] 239Higo and Kusumoto 2009 [49] 21

Hybrid ConQAT 2011 [39] 78Agrawal et al. 2013 [18] -Funaro et al. 2010 [17] 12

zero and each subsequent appearance of a parameter symbol issubstituted by the distance from the previous appearance of thesame symbol. The non-parameter symbol is not changed as shownin Example 2 below. The Dup algorithm uses Definition 4 andProposition 1, given below from [31, 32], to represent parameterstrings in a p-suffix tree. In Definition 4, the f function, calledtransform, computes the j-th symbol value of p-suffix(S,i) inconstant time from j and ( j+i-1).

A parameterized suffix tree (P-suffix tree) is a data structurefor generalization of suffix trees for strings. P-suffix encodingrequires that a p-string P and a p-string P̀ are p-match of eachother if and only if prev(P) = prev(P̀), where prev is the resultingencoding of P. For example, when we have a p-string T that hasthe same encoding as the p-string P, and T and P are a p-match.Therefore, prev is used to test for p-matches. If P is a p-stringpattern and P̀ is a p-string text, P has a p-match starting at positioni of T if and only if prev(P) is a prefix of p-suffix( P̀,i).

Definition 4. If b belongs to alphabet Σ union alphabet Π,f(b,j)=0 if b is a nonegative integer larger than j-1, and otherwise,f(b,j)=b [60].

Proposition 1. Two p-strings P and T p-match when prev(P)= prev(T). Also, P <T when prev(P)< prev(T) and P > T whenprev (P) > prev(T) [60].

Example 2. Let P = yxbxxby$, where x and y are parametersymbols and b and $ are non-parameter symbols. The p-suffixesare 00b21b6$, 0b21b0$, b01b0$, 01b0$, 0b0$, b0$, 0$, and $as shown in Figure 3. The Dup algorithm finds parameterizedduplication by constructing a p-suffix tree and recursing over it.

Fig. 3. A p-suffix tree for the p-string P= yxbxxby$, where Σ = {b,$} andΠ = {x,y}.

In the example in Figure 3, 0 is detected as duplicated in 0b0$,01b0$, 0b21b0$, and 00b21b6$. b0 is detected as duplicated alsoin b01b0$.

5.2.1.2 Duploc by Ducasse et al. Duploc [15] is also a text-based clone detection technique. Duploc uses an algorithm that hastwo steps. The first step transforms source files into normalizedfiles after eliminating noise including all whitespaces, comments

6


and converts all characters to lower case. Noise elimination reducesfalse positives by removing common constructs. It also reducesfalse negatives by removing insignificant differences between codeclones. The second step compares normalized files line-by-lineusing a simple string-matching algorithm. The hits and misses thatthe comparison produces are stored in a matrix and are visualizedas a dotplot [17, 29, 30]. The computational complexity is O(n2)for an input of n lines. Preprocessing transformed lines reduces thesearch space. Each line is hashed into one of a number of buckets.Every occurrence of the same hashed line value is placed in thesame hash bucket. Duploc is able to detect a significant amount ofidentical code duplicates, but it is not able to identify renaming,deletion and insertion. Duploc does not perform lexical analysis orparsing. The advantage of the character-based technique that Du-ploc uses is its high adaptability to diverse programming languages.

Ducasse et al. [26] add one more step to Duploc, a thirdstep of filters. This step extracts interesting patterns in duplicatedcode such as gap size, which is the length of a non-repeated sub-sequence between a clone pair. For example, if the line sequences‘abcghdjgi’ and ‘abcfklngi’ are compared, the gap is of length 4because the lengths of the two non-duplicated subsequences ghdjand fkln are 4. False positives are averted by removing noise and byfiltering. The filter step uses two criteria [26]. 1) Minimum length:It is the smallest length of a sequence to be important. 2) Maximumgap size: It is the largest gap size for sequences to be obtained bycopy-pasting from one another. The algorithm implements filteringin a linear amount of single matches. Ducasse’s tool uses lexicalanalysis to remove comments and whitespaces in code and findsclones using a dynamic pattern matching (DPM) algorithm. Thetool’s output is the number lines of code clone pairs. It partitionslines using a hash function for strings for faster performance. Thecomputational complexity is O(n2), where n is the input size.

Koschke et al. [14] prove that Duploc detects only copy-pasted fragments that are exactly identical. It cannot detect Type-3clone or deal with modifications and insertions in copy-pastedcode, called tolerance to modifications. Duploc cannot detectcopy-pasted bugs [14] because detecting bugs requires semanticinformation and Duploc detects just syntactic clones.

5.2.1.3 NICAD by Roy and James. Roy and James [11] de-velop a text-based code clone detection technique called AccurateDetection of Near-miss Intentional Clones (NICAD). The NICADtool [13, 18] uses two clone detection techniques: text-based andabstract syntax tree-based, to detect Type-1, Type-2 and Type-3cloned code. The structures of the two approaches complementeach other, overcoming the limitations of each technique alone.NICAD has three phases. 1) A parser extracts functions andperforms pretty-printing that breaks different fragments of astatement into lines. 2) The second phase normalizes fragmentsof a statement to ignore editing differences using transformationrules. 3) The third phase checks potential clones for renaming,filtering or abstraction using dynamic clustering for simple textcomparison of potential clones. The longest common subsequence(LCS) algorithm is used to compare two potential clones at a time.Therefore, each potential clone must be compared with all of theothers, which makes the comparison expensive.

NICAD detects near-misses by using flexible pretty-printing.Using agile parsing [50] and the Turing eXtender Language (TXL)transformation rules [51] during parsing and pretty-printing, it caneasily normalize code. By adding normalization to pretty-printing,

it can detect near-miss clones with 100% similarity. After thepotential clones are extracted, the LCS algorithm compares them.The NICAD tool uses percentage of unique strings (PUS) forevaluation. Equation (1) computes the percentage of unique stringsfor each possible clone.

If PUS = 0%, the potential clones are exact clones; other-wise, if PUS is more than 0% and below a certain threshold, thepotential clones are near-miss clones.

PUS =Number of UniqueStrings× 100

TotalNumber of Strings(1)

NICAD finds exact matches only when the PUS threshold is 0%. Ifthe PUS threshold is greater than 0%, clone 1 is matched to clone2 if and only if the size, in terms of number of lines, of the secondpotential clone is between size (clone 1) - size (clone 2) * PUST /100 and size (clone 1) + size (clone 2) * PUST / 100.

NICAD can detect exact and near-miss clones at the blocklevel of granularity. NICAD has high precision and recall [16]. Itcan detect even some exact function clones that are not detectedby the exact matching function used by a tree-based technique[13, 18]. NICAD exploits the benefits of a tree-based techniqueby using simple text lines instead of subtree comparison to obtaingood space complexity and time.

5.2.1.4 SDD by Seunghak and Jeong. Seunghak and Jeong[12] use a text-based code clone detection technique implementedin a tool called the Similar Data Detection (SDD), that can be usedas an Eclipse plug-in. Eclipse is an integrated development environ-ment (IDE) [59]. The Eclipse IDE allows the developer to extendthe IDE functionality via plug-ins. SDD detects repeated code inlarge software systems with high performance. It also detects exactand similar code clones by using an inverted index [57] and an in-dex data structure using a n neighbor distance algorithm [58]. Themean nearest neighbor distance is:

NearestNeighborDistance =

N∑i

[Min(dij)]

N(2)

where N is the number of points and Min(dij) is the distance be-tween each point and its nearest neighbor. SDD is very powerfulfor detection of similar fragments of code in large systems becauseuse of inverted index decreases SDD complexity.

5.2.2 Summary of Textual Approaches. In this section, severaltextual approaches for clone detection are discussed . Dup [8] usesa suffix-tree algorithm to find all similar subsequences using hashvalues of lines, characters or tokens. The complexity of computa-tion is O(n) where n is the input length of the sequence. Duploc[15] uses a dynamic pattern matching algorithm to find a longestcommon subsequence between two sequences. NICAD [11] usesthe Longest Common Subsequence algorithm to compare twolines of potential clones and produces the longest sequence. TheLCS algorithm compares only two sequences at a time. Therefore,the number of comparisons is high because each sequence mustbe compared with all of the other sequences. SDD [12] uses then-neighbor distance algorithm to find near-miss clones. It may leadto detection of false positives.

7


Text-based techniques have limitations as follows [10, 8, 17].1) Identifier renaming cannot be handled in a line-by-line method.2) Code fragments with line breaks are not detected as clones.3) Adding or removing brackets can create a problem whencomparing two fragments of the code where one fragment hasbrackets and the second fragment does not have brackets. 4) Thesource code cannot be transformed in text-based approaches. Somenormalization can be used to improve recall without sacrificinghigh precision [4].

5.2.3 Lexical Approaches. Lexical approaches are also calledtoken-based clone detection techniques. In such techniques, allsource code lines are divided into a sequence of tokens during thelexical analysis phase of a compiler. All tokens of source code areconverted back into token sequence lines. Then the token sequencelines are matched. In this section, several state-of-the-art token-based techniques as shown in Table 6 discussed. These includeCCFinder [4] by Kamiya et al., CP-Miner [8, 13] by Zhenmin etal., Boreas [9] by Yong and Yao, FRISC [9] by Murakami et al.,and CDSW [9] by Murakami et al.. These techniques as examplesare chosen because they are among the best such techniques andcan detect various types of clones with higher recall and precisionthan a text-based technique.

5.2.3.1 CCFinder by Kamiya et al. Kamiya et al. [4] developa suffix tree-matching algorithm called CCFinder. CCFinder hasfour phases. 1) A lexical analyzer removes all whitespaces betweentokens from the token sequence. 2) Next, the token sequence is sentto the transformation phase that uses transformation rules. It alsoperforms parameter replacement where each identifier is replacedwith a special token. 3) The Match Detection phase detectsequivalent pairs as clones and also identifies classes of clonesusing suffix tree matching. 4) The Formatting phase converts thelocations of clone pairs to line numbers in the original source files.

CCFinder applies several metrics to detect interesting clones.These metrics are given below. 1) The length of a code fragmentthat can be used by the number of tokens or the number of linesof the code fragment. 2) Population size of a clone class: It is thenumber of elements in a clone class. 3) Combination of the numberof tokens and the number of elements in a clone class for estimatingwhich code portion could be refactored. 4) Coverage of code clone:It is either the percentage of lines or files that include any clones. Italso optimizes the sizes of programs to reduce the complexity of thetoken matching algorithm. It produces high recall whereas its pre-cision is lower than that of some other techniques [10]. CCFinderaccepts source files written in one programming language at a time.

The line-by-line method used by the Duploc tool [15], dis-cussed earlier, cannot recognize or detect clones with line breakrelocation, the layout of the code is changed. CCFinder performsa more suitable transformation than the line-by-line method [4].CCFinder can also handle name changes, which the line-by-lineapproach cannot handle. However, CCFinder or a token-basedtechnique takes more CPU time and more memory than line-by-line comparison [8, 35, 17, 5]. CCFinder uses a suffix treealgorithm, and so it cannot handle statement insertions anddeletions in code clones [7].

5.2.3.2 CP-Miner by Li et al. Li et al. [8, 35] use a token-based technique to detect code clones and clones related to bugsin large software systems. Their system, CP-Miner, searches forcopy-pasted code blocks using frequent subsequence mining [24].CP-Miner implements two functions.

(i) Detecting copy-pasted code fragments. CP-Miner convertsthe problem into a frequent subsequence mining problem byparsing source code to build a database containing a collectionof sequences. It then implements an enhanced version of theCloSpan algorithm [24] which, is used to help satisfy gapconstraints in frequent subsequences. Each similar statementis mapped to the same token. Similarly, each function nameis mapped onto the same token. All tokens of a statementare hashed using the hashpjw hash function [25]. Afterparsing the source code, CP-Miner produces a database witheach sequence representing a fragment of code. A frequentsub-sequence algorithm and ClonSpan are applied to thisdatabase to find frequent sub-sequences. CP-Miner composeslarger copy-pasted segments both of real copy-paste and falsepositives in groups. It then checks neighboring code segmentsof each duplicate to see if they are a copy-pasted group aswell. If so, the two groups are merged. The larger group ischecked again against the false positives.

(ii) Finding copy-paste related bugs. Frequently, programmersforget to rename identifiers after copy-pasting. Unchangedidentifiers are detected by a compiler and reported as “errors”.These errors become unobserved bugs that can be very hard todetect by a detector. Therefore, CP-Miner uses an Unchange-dRatio threshold to detect bugs.

UnRenamed IDRate =NumUnchanged ID

TotalUnchanged ID(3)

where UnRenamed IDRate is the percentage of unchangedidentifiers, NumUnchanged ID is the number of unchangedidentifiers and TotalUnchanged ID is the total number ofidentifiers in a given copy-pasted fragment. The value ofUnRenamed IDRate can be any value in the range 0 and1. If UnRenamed ID Rate is 0, it means all occurrences ofidentifiers have been changed, and if UnRenamed IDRate is 1,it means all occurrences of the identifier remain unchanged.

CP-Miner can only detect forgot-to-change bugs. Thismeans if the programmer has forgotten to modify or insertsome extraneous statements to the new copy-pasted segment,CP-Miner would not detect the bug because the changed codefragments are now too different from the others [8, 35]. Thisapproach can detect similar sequences of tokenized statementsand avert redundant comparisons, and as a result, it detectscode clones efficiently, even in millions of code lines.

CP-Miner overcomes some limitations of CCFinder and de-tects more copy-pasted segments than CCFinder does. However,CCFinder does not detect code clones that are related to bugs asCP-Miner does because CP-Miner uses an unchanged ratio thresh-old. CCFinder does not completely filter false positives and it de-tects many tiny cloned code blocks which seem to be predomi-nantly false positives. Because CP-Miner handles statement inser-tions and modifications, CP-Miner can detect 17-52% more codeclones than CCFinder. Unfortunately, the frequent subsequencemining algorithm that CCFinder uses has two limitations becauseit divides a long sequence into sub-sequences. First, some frequentsubsequences of two or more statement blocks may be lost. Sec-ond, it is hard to choose the size of short sequences because if thesize is too short, the information may be lost; if the size is too long,the mining algorithm may be very slow [8, 35].

8


Table 5. Summary of Textual Approaches.Author/Tool Transformation Code Representa-

tionComparisonMethod

Complexity Granularity Types ofClone

LanguageIndepen-dence

OutputType

Dup [8] Remove whitespaceand comments

Parameterized stringmatches

Suffix-tree basedon token match-ing

O(n+m) where n isnumber of input linesand m is number ofmatches

Token of lines Type-1, Type-2

Needs lexer Text

Duploc [15] Removes commentsand all white space

Sequence of lines Dynamic PatternMatching

O(n2) where is n isnumber of input lines

Line Type-1, Type-2

Needs lexer Text

NICAD [11] Pretty-printing Methods (Segment se-quences)

Longest Com-mon Subse-quence (LCS)

O(n2) worst casetime and space

Text Type-1, Type-2, Type-3

Needs parser Text

SDD [12] No transformation inverted index and in-dex

N-neighbor dis-tance

O(n) Chunks of sourcecode

Type-1, Type-2, Type-3

No lexer/-parser needs

Visualizationsimilar ofcode

Fig. 4. CP-Miner Process Steps.

5.2.3.3 Boreas by Yang and Yao. Yang and Yao [9] use atoken-based approach called Boreas to detect clones. Boreas usesa novel counting method to obtain characteristic matrices thatidentify program segments effectively. Boreas matches variablesinstead of matching sequences or structures. It uses three terms 1)The Count Environment, 2) The Count Vector, and 3) The CountMatrix. Theses are discussed below.

The computation of the Count Environment (CE) is dividedinto three stages. The first stage is Naı̈ve Counting, which countsthe number variables used and defined in the environments. Thesecond stage is In-statement counting, which counts the numberof regular variables as well as the variables used as if -predicatesand array subscripts, and variables that are defined by expressionswith constants. The third stage is Inter-statement Counting, whichcounts variables used inside a first-level loop, second level loop orthird level loop. 2) Count Vector (CV), which is produced usingm (m-dimensional Count Vector) CEs. The i-th dimension in theCV is the number of the variable in the i-th CE. CV is also calleda characteristic vector. 3) Counting Matrix (CM), which containsall n (n-variables) CVs in code fragments and is an n × m CountMatrix. Boreas uses the cosine of the two vectors angle to comparesimilarity:

Sim(v1,v2) = cos(α) =

∑n

iv1i×v2i√∑n

iv21i×√∑n

iv22i

(4)

where Sim is the cosine similarity between two vectors v1 and v2and α is the angle between them. The similarity between two frag-ments is measured by an improved proportional similarity func-tion. This function compares the CVs of keywords and punctua-tions marks. PropSimilarity is proportional similarity between C1

and C2, which are two occurrences counts:

PropSimilarity =1

(C1 + 1)+

C2

(C1 + 1).(5)

The function prevents incorrect zero similarity. Boreas is not ableto detect code clones of Type-3. Agrawal et al. [18] extend Boreasto detect clones by using a token-based approach to match cloneswith one variable or a keyword and easily detect Type-1 and Type-2clones; they use a textual approach to detect Type-3 clones. SinceAgrawal et al.’s approach combines two approaches, it is a hybridapproach.

5.2.3.4 FRISC by Murakami et al. Murakami et al. [65]develop a token-based technique called FRISC which transformsevery repeated instruction into a special form and uses a suffixarray algorithm to detect clones. FRISC has five steps. 1) Perform-ing lexical analysis and normalization, which transforms sourcefiles into token sequences and replaces every identifier by a specialtoken. 2) Generating statement hash, which generates a hash valuefor every statement between “;”, “{”, and “}” with every tokenincluded in a statement. 3) Folding repeated instructions, whichidentifies every repeated subsequence and divides into the firstrepeated element and its subsequent repeated elements. Then, therepeated subsequences are removed and their numbers of tokens areadded to their first repeated subsequence of elements. 4) Detectingidentical hash subsequences, which detects identical subsequencesfrom the folded hash sequences. If the sum of the numbers oftokens is smaller than the minimum token length, they are notconsidered clones. 5) Mapping identical subsequences to the origi-nal source code, which converts clone pairs to original source code.

FRISC supports Java and C. The authors performed experi-ments with eight target software systems, and found that theprecision with folded repeated instructions is higher than theprecision without by 29.8%, but the recall decreases by 2.9%.

9


FRISC has higher recall and precision than CCFinder but theprecision is lower than CloneDr [13] and CLAN [21].

5.2.3.5 CDSW by Murakami et al. Murakami et al. [66]develop another token-based technique, which detects Type-3clones (gapped clones) using the Smith-Waterman algorithm [68],called CDSW. It eliminates the limitations of AST-based andPDG-based techniques which, require much time to transformsource code into ASTs or PDGs and compare among them. CDSWhas five steps. 1) Performing lexical analysis and normalization,which is the same as the first step of FRISC. 2) Calculatinghash values for every statement, which is the same as FRISC. 3)Identifying similar hash sequences, which identifies similar hashsequences using the Smith-Waterman Algorithm. 4) Identifyinggapped tokens using Longest Common Subsequences (LCS) toidentify every sequence gap. 5) Mapping identical subsequences tothe source code, which converts clone pairs to the original sourcecode. It is also performs the same fifth step as FRISC.

Since Bellon’s references, which are built by manually con-firming a set of candidates to be clones or clone pairs that arejudged as correct [64], do not contain gapped fragments, Murakamiet al. enhance the clone references by adding information aboutgapped lines. Murakami et al. calculate recall, precision, andf-measure using Bellon’s [64] and their own clone referencesresulting in improved recall, precision, and f-measure. recallincreased by 4.1%, precision increased by 3.7%, and f-measureincreased by 3.8% in the best case. recall increased by 0.49%,precision increased by 0.42% and f-measure increased by 0.43%in the worst case [66]. The results are different because CDSWreplaces all variables and identifiers with special tokens that ignoretheir types. Because CDSW does not normalize all variables andidentifiers, it cannot detect clones that have different variablenames [66].

5.2.4 Summary of Lexical Approaches. The suffix-tree basedtoken matching algorithm used by CCFinder finds all similarsubsequences in a transformed token sequence. CCFinder cannotdetect statement insertions and deletions in copy-pasted code. Itdoes not completely eliminate false positives. The frequent sub-sequence mining technique used by CP-Miner discovers frequentsubsequences in a sequence database. A frequent subsequencemining technique avoids unnecessary comparisons, which makesCP-Miner efficient. CP-Miner detects 17%-52% more code clonesthan CCFinder. A limitation of a frequent subsequence miningalgorithm is that a sequence database is needed. Boreas worksfast by using two functions: cosine similarity and proportionalsimilarity. FRISC detects more false positives than the other toolsbut misses some clone references [65]. CDSW’s accuracy is basedon the match, mismatch and gap parameters. If these parametersare changed, the results are different.

Token-based techniques have limitations as follows. 1) Token-based techniques depend on the order of program lines. If thestatement order is modified in duplicated code, the duplicated codewill not be detected. 2) These techniques cannot detect code cloneswith swapped lines or even added or removed tokens because theclone detection is focused on tokens. 3) Token-based techniquesare more expensive in time and space complexity than text-basedtechniques because a source line contains several tokens.

5.2.5 Syntactical Approaches. Syntactical approaches are cate-gorized into two kinds of techniques. The two categories are tree-based techniques and metric-based techniques. A list of syntactical

techniques found in the literature is shown in Table 7. In this sec-tion, several common tree-based and metric-based techniques arediscussed. For the purpose of this study, we choose CloneDR [13]by Baxter al., Wahler [34], Koschke [14], Jiang [28], Mayrand et al.[20], Kontogiannis et al. [37], Kodhai, et al. [19], Abdul-El-Hafiz[48] and Kanika et al. [29].

5.2.5.1 Tree-based Clone Detection Techniques. In thesetechniques, the source code is parsed into an abstract syntax tree(AST) using a parser and the sub-trees are compared to find clonedcode using tree-matching algorithms.CloneDr by Baxter et al Baxter et al. [13] use a tree-based codeclone detection technique implemented in a tool called CloneDr. Itcan detect exact clones, near miss clones and refactored code usingan AST. After the source code is parsed into an AST, it finds clonesby applying three main algorithms. The first algorithm detects sub-tree clones, the second algorithm detects variable-size sequences ofsub-tree clones such as sequences of declarations or statements, andthe third algorithm finds more complex near-miss clones by gener-alizing other clone combinations. The method splits sub-trees usinga hash function and then compares sub-trees in the same bucket.The first algorithm finds sub-tree clones and compares each sub-tree with other sub-trees. Near-miss clones that cannot be detectedby comparing sub-trees can be found using similarity computation:

Similarity =2?SN

(2?SN + LEFT +RIGHT )(6)

where SN is the number of shared nodes, LEFT is the number ofnodes in sub-tree1 and RIGHT is the number of nodes in sub-tree2.The second algorithm finds clone sequences in ASTs. It compareseach pair of sub-trees and looks for maximum length sequences.The third algorithm finds complex near-miss clones. It abstractseach pair of clones after all clones are detected.

CloneDr cannot detect semantic clones. Text-based techniquesdo not deal with modifications such as renaming of identifierssince there is no lexical information. Tree-based techniques mayproduce false positives since two fragments of the sub-tree maynot be duplicated. Because a tree-based method hashes subtrees, itcannot detect duplicated code which has modifications.Wahler et al Wahler et al. [34] detect clones which are representedas an abstract syntax tree (AST) in XML by applying frequent item-set mining. Frequent itemset mining is a data mining technique thatlooks for sequences of actions or events that occur frequently. Aninstance is called a transaction, each of which has a number of fea-tures called items. This tool uses frequent itemsets to identify fea-tures in large amounts of data using the Apriori algorithm [52].For each itemset, they compute its support count, which is the fre-quency of occurrence of an itemset or the number of transactionsin which it appears:

σ(I) =|{T εD | I ⊆ T}|

|D|≥ σ (7)

where T is a transaction, I is an itemset, which is a subset of thetransaction T, and D is a database. If an itemset’s frequency ismore than a certain given support count σ, it is called a frequentitemset.

There are two steps to find frequent itemsets. The first stepis the join step. The first step finds Lk, which are frequent itemsets

10


Table 6. Summary of Lexical Approaches.Author/Tool Transformation Code Representa-




OutputType

CCFinder [4] Remove whites-pace, comments, andperform parameterreplacement

Normalized Se-quences and parame-terized tokens

Suffix-tree basedon token match-ing

O(n) where n is sizeof source file

Token Type-1, Type-2

Needs lexerand trans-formationrules

Clonepairs/ Cloneclasses

CP-Miner[7]

Map each statemen-t/identifier to a num-ber with similar state-ments/identifiers

Basic blocks Frequent subse-quence miningtechnique

O(n2) where n isnumber of code lines

Sequence of to-kens

Type-1, Type-2

Needs parser Clone pairs

Boreas [9] Filter useless charac-ters and extracs tokens

Variables match-ing based on othercharacteristics

Cosine SimilarityFunction

N/A Vector Type-1, Type-2, Type-3

Needs parser Clustering

FRISC [65] Remove whitespaces,comments, mappingfrom transformedsequence into orig-inal, and replaceparameters

Hash sequence of to-kens

Suffix array N/A Token se-quences


Needs lexer Clonepairs/ Cloneclasses

CDSW [66] Remove whitespace,comments; map fromtransformed sequenceinto original, and pa-rameter replace

Hash values for everystatement

Smith-WatermanAlignment

O(nm) where n islength of first tokensequences and m islength of second to-ken sequences

Token Se-quences


Needs lexer Clone pairs

of size k. A set of candidate k-itemsets is generated by combiningLk−1 with itself. The second step is the prune step, which findsfrequent k-itemsets fromCk. This process is repeated until no morefrequent k-itemsets are found. In this approach, the statements ofa program become items in the database D. Clones are a sequenceof source code statements that occur more than once. Therefore,the support count is σ = 2

|D| . Let there be statements b1...bk ina program. The join step combines two frequent (k-1)-itemsets ofthe form I1 = b1...bk, I2 = b2...bk−1.Koschke et al Koschke et al. [14] also detect code clones using anabstract syntax tree (AST). Their method finds syntactic clones bypre-order traversal, applies suffix tree detection to find full subtreecopies, and decomposes the resulting Type-1 and Type-2 token se-quences. This approach does not allow structural parameters. It canfind Type-1 and Type-2 clones in linear time and space. AST-baseddetection can be used to find syntactic clones with more effort thanToken-based suffix trees and with low precision. AST-based de-tection also scales worse than Token-based detection. Token-basedsuffix tree clone detectors can be adapted to a new language in ashort time whereas using AST needs a full abstract syntax tree andsub-tree comparison method. Using abstract syntax suffix trees [14]detects clones in less time.Deckard by Jiang et al Jiang et al. [28] also use the tree-based tech-nique and compute certain characteristic vectors to capture struc-tural information about ASTs in Euclidean space. Locality Sen-sitive Hashing (LSH) [61] is a technique for clustering similaritems using the Euclidean distance metric. The Jiang et al. toolis called Deckard. Deckard’s phases include the following. 1) Aparser uses a formal syntactic grammar and transforms source filesinto parse trees. 2) The parse trees are used to produce a set ofvectors that capture structure information about the trees. 3) Thevectors are clustered using the Locality Sensitive Hashing algo-rithm (LSH) that helps find a query vector’s near-neighbors. Finally,post-processing is used to generate clone reports. Deckard detectsre-ordered statements and non-contiguous clones.Deckard is language independent with lower speed than the Boreastool [28], discussed in Subsection 5.2.3, because of less set-up timeand less comparison time [11, 8]. Deckard also requires construct-ing ASTs, which requires more time.

Hotta et al Hotta et al. [67] compare and evaluate methods for de-tection of coarse-grained and fine-grained unit-level clones. Theyuse a coarse-grained detector that detects block-level clones fromgiven source files. Their approach has four steps. 1) Lexical andsyntactic analysis to detect all blocks from the given source filessuch as classes, methods and block statements. 2) Normalization ofevery block detected in the previous step. This step detects Type-1and Type-2 clones. 3) Hashing every block using the hashCode()function java.lang.String. 4) Grouping blocks based on their hashvalues. If two normalized blocks have the same hash value, theyare considered equal to each other as in Figure 5. The detectionapproach has high accuracy, but Hotta et al.’s method, which iscoarse-grained does not have high recall compared to fine-graineddetectors, does not tackle gapped code clones, and detects fewerclones. Their approach is much faster than a fine-grained approach,since the authors use hash values of texts of blocks. However, usinga coarse-grained approach alone is not enough because it does nothave more detailed information about the clones. A fine-grainedapproach must be used as a second stage after a coarse-grained ap-proach.

5.2.5.2 Metric-based clone detection techniques. Inmetric-based clone detection techniques, a number of metrics arecomputed for each fragment of code to find similar fragmentsby comparing metric vectors instead of comparing code or ASTsdirectly. Seven software metrics have been used by differentauthors [22, 23, 24].

(1) Number of declaration statements,(2) Number of loop statements,(3) Number of executable statements,(4) Number of conditional statements,(5) Number of return statements,(6) Number of function calls, and(7) Number of parameters.

All of these metrics are computed and their values are stored in adatabase [19]. Pairs of similar methods are also detected by com-parison of the metric values, which are stored in the same database.

11


Fig. 5. Example of Coarse-grained Clone Detection. [67]

Mayrand et al Mayrand et al. [20] compute metrics from names,layouts, expressions and control flows of functions. If two func-tions’ metrics are similar, the two functions are considered tobe clones. Their work identifies similar functions but not similarfragments of code. In reality, similar fragments of codes occurmore frequently than similar functions.

First, source code is parsed to an abstract syntax tree (AST).Next, the AST is translated into an Intermediate RepresentationLanguage (IRL) to detect each function. This tool reports as clonepair two function blocks with similar metrics values. Patenaude etal. [35] extend Mayrand’s tool to find Java clones using a similarmetric-based algorithm.Kontogiannis et al Kontogiannis et al. [36] propose a way to mea-sure similarity between two pieces of source code using an abstractpatterns matching tool. Markov models are used to calculate dis-similarity between an abstract description and a code fragment.Later, they propose two additional methods for detecting clones[37]. The first method performs numerical comparison of the met-ric values that categorize a code fragment to begin-end blocks. Thesecond approach uses dynamic programming to compute and re-port begin-end blocks using minimum edit distance. This approachonly reports similarity measures and the user must go through blockpairs and decide whether or not they are actual clones.Kodhai et al Kodhai et al. [19] combine a metric-based approachwith a text-based approach to detect functional clones in C sourcecode. The process of clone detection has five phases. 1) The In-put and Pre-processing step parses files to remove pre-processorstatements, comments, and whitespaces. The source code is rebuiltto a standard form for easy detection of similarity of the clonedfragments. 2) Template conversion is used in the textual compari-

son of potential clones. It renames data types, variables, and func-tion names. 3) Method identification identifies each method andextracts them. 4) Metric Computation. 5) Type-1 and Type-2 clonedetection. The text-based approach finds clones with high accuracyand reliability, but the metric-based approach can reduce the highcomplexity of the text-based approach by using computed metricsvalues. The limitation of this method is that it just detects Type-1and Type-2 clones, with high time complexity.Abdul-El-Hafiz et al Abdul-El-Hafiz et al. [48] use a metric baseddata mining approach. They use a fractal clustering algorithm. Thistechnique uses four steps. 1) Pre-processing the input source file.2) Extracting all fragments for analysis and related metrics. 3) Par-titioning the set of fragments into a small number of clusters ofthree types using fractal clustering. Primary clusters cover Type-1and Type-2 clones, Intermediate clusters cover Type-3 clones, anda singleton cluster is not a clone. 4) Post-processing, the extractedclone classes from the primary cluster. This technique uses eightmetrics to detect each type of function.MCD Finder by Kanika et al Kanika et al. [29] use a metric-basedapproach to develop the MCD Finder for Java. Their tool performsa metric calculation on the Java byte code instead of directly onthe source code. This approach consists of three phases. 1) TheJava source code is compiled to make it adaptable to requirementof the tool. 2) The computation phase computes metrics that helpdetect potential clones. This approach uses 9 metrics [29] for eachfunction.

(1) Number of calls from a function,(2) Number of statements in a function,(3) Number of parameters passed to a function,(4) Number of conditional statements in a function,

12


(5) Number of non-local variables inside a function,(6) Total number of variables inside a function,(7) Number of public variables inside a function,(8) Number of private variables inside a function, and(9) Number of protected variables inside a function.

The calculated metrics are stored in a database and mapped ontoExcel sheets. 3) The measurement phase performs a comparisonon the similarity of the metric values.

5.2.6 Summary of Syntactical Approaches. In Table 7, a sum-mary of syntactical techniques considering several properties areprovided. The first kind of syntactical approach is the tree-basedtechnique. One such system, CloneDr, finds sub-tree clones withlimitations as follows. 1) It has difficulty performing near-missclone detection, but comparing trees for similarity solves it to someextent. 2) Scaling up becomes hard when the software system islarge and the number of comparisons becomes very large. Splittingthe comparison sub-trees with hash values solves this problem.The parser also parses a full tree. Wahler et al.’s approach detectsType-1 and Type-2 clones only. The clones are detected with a verylow recall. Deckard detects significantly more clones and is muchmore scalable than Wahler et al.’s technique because Deckard usescharacteristic vectors and efficient vector clustering techniques.Koschke et al. show that suffix-tree clone detection scales verywell since a suffix tree finds clones in large systems and reducesthe number of subtree comparisons.

The second kind of the syntactical approach is the metric-basedtechnique. Mayrand et al.’s approach does not detect duplicatedcode at different granularities. Kontogiannis et al.’s approachworks only at block level, it cannot detect clone fragments that aresmaller than a block, and it does not effectively deal with renamedvariables or work with non-contiguous clones code. Kodhai et al.’sapproach only detects Type-1 and Type-2 clones. The limitation ofAbdul-El-Hafiz et al.’s technique is that Type-4 clones cannot bedetected. The MCD Finder is efficient in detecting semantic clonesbecause byte code is platform independent whereas CCFindercannot detect semantic clones. However, even MCD Finder toolcannot detect all clones.

Syntactical techniques have limitations as follows. 1) Tree-based techniques do not handle identifiers and literal values fordetecting clone in ASTs. 2) Tree-based techniques ignore identifierinformation. Therefore, they cannot detect reordered statementclones. 3) Metric-based techniques need a parser or PDG generatorto obtain metrics values. 4) Two code fragments with the samemetric values may not be similar code fragments based on metricsalone.

5.2.7 Semantic Approaches. A semantic approach, which detectstwo fragments of code that perform the same computation but havedifferently structured code, uses static program analysis to obtainmore accurate information similarity than syntactic similarity. Se-mantic approaches are categorized into two kinds of techniques.The two kinds are Graph-based techniques and Hybrid techniques.Several semantic techniques from the literature are shown in Table8. In this section, we discuss Komondoor and Horwitz [22], Du-plix [31] by Krinke, GPLAG [32] by Liu et al., Higo and Kusumoto[49], Hummel et al. [39], Funaro et al. [17], and Agrawal et al.[18].

5.2.7.1 Graph-based Clone Detection Techniques. Agraph-based clone detection technique uses a graph to represent

the data and control flow of a program. One can build a programDependency Graph (PDG) as defined in Definition 1 in Section2. Because a PDG includes both control flow and data flowinformation as given in Definitions 5 and 6, respectively, one candetect semantic clones using PDG [30]. Clones can be detected asisomorphic subgraphs [22]. In PDG edges represent the data andcontrol dependencies between vertices which repeat lines of code,in PDGs.

Definition 5. (Control Dependency Edge). There is a con-trol dependency edge from a vertex to a second program vertex ina Program Dependency Graph if the truth of the condition controlswhether the second vertex will be executed [32].

Definition 6. (Data Dependency Edge). There is a data de-pendency edge from program vertex var1 to var2 if there is somevariable such that:- var1 may be assigned a value, either directly or indirectly throughpointers.- var2 may use the value of the variable, either directly or indirectlythrough pointers.- There is an execution path in the program from the code corre-sponding to var1 to the code corresponding to var2 along whichthere is no assignment to variable [32].Komondoor and Horwitz Komondoor and Horwitz [22] use pro-gram slicing [38] to find isomorphic PDG subgraphs and codeclones. As mentioned earlier, nodes in a PDG represent statementsand predicates, and edges represent data and control dependencesas shown in Example 1 and Figure 3. The slicing clone detectionalgorithm performs three steps. 1) Find pairs of clones by partition-ing all PDG nodes into equivalence classes, where any two nodesin the same class are matching nodes [22]. 2) Remove subsumedclones. A clone pair subsumes another clone pair if and only if eachelement of the clone pair is a subset of another element from an-other clone pair. So, subsumed clone pairs need to be removed. 3)Combine pairs of clones into larger groups using transitive closure.Duplix by Krinke finds maximal similar PDG subgraphs with highprecision and recall. Krinke’s approach is similar to the Komon-door and Horwitz approach [22] although [22] starts from everypair of matching nodes and uses sub-graphs that are not maximaland are just subtrees unlike the ones in [31]. The PDG used byKrinke is similar to AST and the traditional PDG. Thus, the PDGcontains vertices and edges that represent components of expres-sions. It also contains immediate (control) dependency edges. Thevalue dependency edges represent the data flow between expres-sion components. Another edge, the reference dependency edge,represents the assignments of values to variables.GPLAG by Liu et al Liu et al. [32] develop an approach to de-tect software plagiarism by mining PDGs. Their tool is calledGPLAG. Previous plagiarism detection tools were only partiallysufficient for academic use in finding plagiarized programs in pro-gramming classes. These tools were based on program token meth-ods such as JPlag [33] and are unable to detect disguised plagia-rized code well. Plagiarism disguises may include the following[32]. 1) Format alteration such as inserting and removing blanksor comments. 2) Variable renaming where variables names maybe changed without affecting program correctness. 3. Statementreordering, when some statements may be reordered without af-fecting the results. 4) Control replacement such as a for loopcan be substituted by a while loop and vice versa. In addition,an for (int i=0; i<10; i++) {a=b-c;} block can be replaced bywhile (i<10) {a=b-c; i++; }. 5) Code Insertion, where additional

13


Table 7. Summary of Syntactical Approaches.Author/Tool Transformation Code Representa-




OutputType

CloneDr [13] Parse to AST AST Tree matchingtechnique

O(n) where n is number ofAST nodes

AST node Type-1, Type-2

Needs parser Clonepairs

Wahler [34] Parse to AST AST Frequent itemset N/A Line Type-1, Type-2

Needs parser AST nodes

Koschke [14] Parse to AST AST Simple string suffixtree algorithm

O(n) where n is number ofinput nodes

Tokens Type-1, Type-2, Type-3

Needs parser Text

Jiang et al. [28] Parse to parse tree thento a set of vectors.

Vectors Locality-sensitivehashing Tree-Matching algo-rithm (LSH)

O(|T1||T2|d1d2), where|T i| is the size of T i anddi is the minimum of thedepth of T i and the numberof leaves of T i

Vectors Type-1, Type-2, Type-3

Needs aParser

Text

Hotta et al. [67] Parse source code toextract blocks usingJDT

Hashed blocks Group blocks basedon hash values

N/A Blocks Type-1, Type-2

Needs parser Clonepairs,Cloneclasses

Mayrand et al.[20]

Parse to AST then(IRL)

Metrics 21 function metrics Polynomial complexity Metricsfor eachfunction


Needs Dartixtool

Clonepairs,Cloneclasses

Kontogiannis etal. [37]

Transform to featurevectors

Feature vectors Use numericalcomparisons ofmetric values anddynamic program-ming (DP) usingminimum editdistance

O(n2) for Naı̈ve approachand O(nm) for DP-model

Metrics ofa begin-endblock


Needs aparser andan additionaltool

Clonepairs

Kodhai, et al. [19] Remove whitespaces,comments; map-ping and pre-processstatements

Metrics The string match-ing/textual compar-ison

N/A Functional Type-1, Type-2

Needs aparser

Clonepairs andclusters

Abdul-El-Hafiz,et al. [48]

Preprocess and extractMetrics

Metrics Data miningclustering algo-rithm and fractalclustering

O(M2log(M)) where M isthe size of data set

Functional Type-1, Type-2, can be Type-3

A languageindependent

CloneClasses

Kanika et al. [29] Calculate metrics ofJava programs

Metrics Use 3-phase com-parison algorithm:Adaptation, Com-putation and Mea-surement Phases

N/A Metrics ofJava bytecode


Needs com-piler

Outputmappedinto Excelsheets

code may be inserted to disguise plagiarism without affecting theresults.Scorpio by Higo and Kusumoto [49] propose a PDG-based incre-mental two-way slicing approach to detect clones, called Scorpio.Scorpio has two processes: 1) Analysis processing: The inputs arethe source files to the analysis module and the output is a database.PDGs are generated from the algorithms of source files. Then, allthe edges of the PDGs are extracted and stored in a database. 2) De-tection processing: The inputs are source files and the database, andthe output is a set of clone pairs. First, a user provides file namesand the edges are retrieved for the given files from the database.Finally, the clone pairs are detected by the detection model. Thisapproach detects non-contiguous clones while other existing incre-mental detection approaches cannot detect non-contiguous clones.The approach also has faster speed compared to other existing PDGbased clone detection approaches.

5.2.7.2 Hybrid Clone Detection Techniques. A hybrid clonedetection technique uses a combination of two or more techniques.A hybrid approach can overcome problems encountered by indi-vidual tools or techniques.ConQAT Hummel et al. [39] use a hybrid and incremental indexbased technique to detect clones and implement a tool called Con-QAT. Code clones are detected in three phases. 1) Preprocessing,which divides the source code into tokens, normalizes the tokens toremove comments or renamed variables. All normalized tokens are

collected into statements. 2) Detection, which looks for identicalsub-strings. 3) Post-processing, which creates code cloning infor-mation looking up all clones for a single file using a clone index.Statements are hashed using MD5 hashing [62]. Two entries withthe same hash sequence are a clone pair. The approach extracts allclones for a single file from the index and reports maximal clones.Funaro et al Funaro et al. [17] propose a hybrid technique that com-bines a syntactic approach using an abstract syntax tree to identifypotential clones with a textual approach to avoid false positives.The algorithm has four phases: 1) Building a forest of ASTs. 2) Se-rializing the forest and encoding into a string representation withan inverse mapping function. 3) Seeking serialized clones. 4) Re-constructing clones.Agrawal et al Agrawal et al. [18] present a hybrid technique thatcombines token-based and textual approaches to find code cloningto extend Boreas [9], which cannot detect Type-3 code clones. Thetoken approach can easily detect Type-1 and Type-2 code clones.The textual approach can detect Type-3 code clones that are hardto detect with the token approach. The technique has three phases.1) The pre-processing phase removes comments, whitespaces andblank lines. Declarations of variables in a single line are combinedto make it easy for the tool to find the number of variables declaredin the program. 2) The transformation phase breaks the source codeinto tokens and detects Type-1 and Type-2 code clones. 3) Thematch detection phase finds code clones using a matrix represen-tation and then replaces each token with an index value. Then a

14


textual approach looks for the same string values line-by-line. Inthis phase, Type-3 code clones are detected. 4) The filtering phaseremoves false positives.

5.2.8 Summary of Semantic Approaches. In Table 8, a summaryof semantic techniques are provided. The first kind of semanticapproaches includes PDG-based techniques. The approach byKomondoor and Horwitz needs a tool to generate the PDGsubgraph. But, the major benefit of Komondoor and Horwitz toolis that it can detect gapped clones. Komondoor and Horwitz andDuplix detect semantically robust code clones using PDG for pro-cedure extraction, which is a program transformation that can beused to make programs easier to understand and maintain. Duplixcannot be applied to large systems and is very slow. Tools that donot use PDG can be effectively confused by statement reordering,replacing, and code insertion. Since PDG is robust to the disguisesthat confuse other tools, GPLAG is more effective and efficientthan these tools. GPLAG has a limitation that the computationalcomplexity increases exponentially with the size of the softwarecode. Scorpio by Higo and Kusumoto detects non-contiguousclones while other incremental detection approaches cannot doso. It also has faster speed than other PDG based clone detectionapproaches.

The second kind of syntactical approaches is represented byhybrid techniques. Hummel et al. use an approach similar toConQAT, but it is different because Hummel et al. use graph-baseddata-flow models. These two approaches can be combined to speedup clone retrieval. Funaro et al. detect Type-3 clones. They alsouse a textual approach on the source code to remove uninterestingcode. Agrawal et al. can detect clones for code written in C only.Semantic techniques have limitations as follows. 1) PDG-basedtechniques are not scalable to large systems. 2) PDG-basedtechniques need a PDG generator. 3) Graph matching is expensivein PDG-based techniques.

6. EVALUATION OF CLONE DETECTIONTECHNIQUES

In order to choose the right technique for a specific task, severalevaluation metrics can be used. A good technique should show bothhigh recall and precision. In Table 9 and 10, we provide comparingevaluations of tools and summary of evaluation approaches respec-tively. Some evaluation metrics are discussed below.

6.1 Precision and RecallPrecision and recall are the two most common metrics used to mea-sure the quality of a clone finding program. Precision refers to thefraction of candidate clones returned by the detection algorithm thatare actual clones, whereas recall refers to the fraction of relevantcandidate clones returned by the detection algorithm. High preci-sion means that the candidate clones are mostly actual code clones.Low precision means that many candidate clones are not real codeclones. High recall means most clones in the software have beenfound. Low recall means most clones in the software have not beenfound. Precision and recall are calculated as shown in Figure 7 andEquations (8) and (9) respectively:

Precision =CC

AC× 100 (8)

Fig. 6. Precision and Recall. [26]

Recall =CC

PC× 100 (9)

where CC is the number of all correct clones, AC is the number ofall found clones, and PC is the number of clones that exist in thecode. A perfect clone detection algorithm has recall and precisionvalues that are both 100%.

6.1.1 Precision.. A good tool detects fewer false positives, whichmeans high precision. Line-based techniques detect clones of Type-1 with high precision. There are no returned false positives andthe precision is 100%. In contrast, token-based approaches returnmany false positives because of transformation and/or normaliza-tion. Tree-based techniques detect code clones with high precisionbecause of structural information. Metric-based techniques find du-plicated code with medium precision due to the fact that two codefragments may not be the same but have similar metric values. Fi-nally, PDG-based techniques detect duplicated code with high pre-cision because of both structural and semantic information.

6.1.2 Recall.. A good technique should detect most or all of theduplicated code in a system. Line-based techniques find only ex-act copy or Type-1 clones. Therefore, they have low recall. Token-based techniques can find most clones of Type-1, Type-2 and Type-3. So, they have high recall. A tree-based technique does not detectany type of clones, but with the help of other techniques clonescan be detected. Metric-based techniques have low recall whereasPDG-based techniques cannot detect all of clones.

6.2 PortabilityA portable tool is good for multiple languages and dialects. Line-based techniques have high portability but need a lexical analyzer.Token-based techniques need lexical transformation rules. There-fore, they have medium portability. Metric-based techniques needa parser or a PDG generator to generate metric values. They havelow portability. Finally, PDG-based techniques have low portabilitybecause they need a PDG-generator.

6.3 ScalabilityA technique should be able to detect clones in large software sys-tems in a reasonable time using a reasonable amount of memory.Scalability of text-based and tree-based techniques depends on thecomparison algorithms. Token-based techniques are highly scal-able when they use a suffix-tree algorithm. Metrics-based tech-niques are also highly scalable because only metric values of begin-end blocks are compared. PDG-based techniques have low scala-bility because subgraphs matches are expensive.

15


Table 8. Summary of Semantic Approaches.Author/Tool Transformation Code Repre-

sentationComparisonMethod


Language In-dependence

OutputType

Komondoor andHorwitz [22]

PDGs usingCodeSurfer

PDGs Isomorphic PDGsubgraph matchingusing backwardslicing

N/A PDG node Type-3,Type-4

Needs toolfor convertingsource code toPDGs

Clonepairs andCloneClasses

Duplix [31] To PDGs PDGs K-length patch algo-rithm

Non-polynomial com-plexity

PDG Sub-graphs

Type-1,Type-4


CloneClasses

GPLAG [32] PDGs usingCodeSurfer

PDGs Isomorphic PDGsubgraph matchingalgorithm

NP-Complete PDG Node Type-1,Type-2,Type-3


Plagiarizedpair ofprograms

Higo andKusumoto [49]

To PDGs PDGs Code Clone Detec-tion Module

N/A Edges Type-3 Needs toolfor convertingsource code toPDGs

ClonePairs Files

ConQAT [39] Splits source codeinto tokens andremoves commentsand variable names.Then normalizedtokens are groupedinto statements

Tokens Suffix-tree-based al-gorithm then usingIndex-based clonedetection algorithm

O(|f |logN) where f isthe number of statementsand N is the number ofstored tuples

Substrings Type-1,Type-2

Languageindependent

Textclones

Funaro et al. [17] Parsed to AST, thenSerialized AST

AST Textual comparison N/A Specificparts of AST

Type-1,Type-2,Type-3

Needs parser Stringclones

Agrawal et al.[18]

To tokens Tokens Line-by-line textualcomparison

N/A Indexedtokens

Type-1,Type-2,Type-3

Needs lexer Textclones

Table 9. Comparing Evaluations of Tools. It is difficult to compare results of clone detection tools in a fair manner. Theresults given in this table from various sources are pieced together.

Author/Tool Recall Precision F-measure ReferenceDup 1 56% - 81.5% 3.1% - 9.3% 5.95% - 16.04% [8]Duploc 1 4.5% - 81.5% 3.5% - 12.8% 5.91% - 22.13% [15]NICAD 1 13.1% - 76.3% 1.6% - 52.8% 3.13% - 50.07% [11]SDD 2 ≈ 24% ≈ 22% ≈ 22.96% [12]CCFinder 1 44.5% - 100% 0.8% - 6.6% 1.58% - 12.33% [4]CP-Miner 3 ≈ 48% ≈ 41% ≈ 44.22% [7]Boreas 4 ≈ 95% ≈ 5% ≈ 9.5% [9]FRISC 1 ≈ 79%− 98% ≈ 10%− 50% ≈ 17.75%− 66.22% [65]CDSW 1 9.27% - 70.63% 4.43% - 49.97% 8.09% - 28.70% [66]Baxter, et al. 1 14.9% - 48.1% 6% - 40.3% 8.68% - 32.55% [13]Wahler et al.1 N/A N/A N/A [34]Koschke et al. 5 ≈ 33% N/A N/A [14]Jiang et al. 1 3.4% - 85.4% 2.2% - 6.6% 4.29% - 8.91% [28]Hotta et al. 6 23.3% - 50.1% 9.9% - 22.8% 14% - 23.21% [67]Mayrand et al. N/A N/A N/A [20]Kontogiannis et al. N/A N/A N/A [37]Kodhai, et al. 7 ≈ 99% ≈ 100% ≈ 99.50% [19]Abdul-El-Hafiz et al.8 N/A (Difficult to assess) ≈ 52%− 100% N/A [48]Kanika et al. N/A N/A N/A [29]Duplix 1 17.3% - 45.8% 2.9% - 10.5% 5.35% - 17.08% [31]GPLAG N/A N/A N/A [32]Higo and Kusumoto 9 ≈ 97%− 100% ≈ 96%−97%(Type−3) ≈ 96.50%− 98.48% [49]ConQAT N/A N/A N/A [39]Funaro et al. N/A N/A N/A [17]Agrawal et al. N/A N/A N/A [18]

(1): The displayed detectors results are obtained from Murkami et al. [66], who use netbeans, ant, jtcore, swing, weltab,cook,snns, andpostgresql datasets. (2): The results given are obtained from Shafieian and Zou [75], who use EIRC, Spule, and Apache Ant datasets. (3):The results given here are obtained from Dang and Wani [76], who use EIRC dataset. (4): The given results are quoted from Yuan and Guo[9], who use Java SE Development Kit 7 and Linux kernel 2.6.38.6 datasets. (5): The results given are obtained from Falke and Koschke[14], who use bison 1.32, wget 1.5.3, SNNS 4.2, and postgreSQL 7.2 datasets. (6): The results are obtained from Hotta et al. [67], who usenetbeans, ant, jtcore and swing datasets. (7): The given results are obtained from Kodhai et al. [19], who use Weltab dataset. (8): The resultsare obtained from Abdul-El-Hafiz et al. [48], who use Weltab and SNNS datasets. (9): The results are obtained from Higo and Kusumoto[49], who use Ant dataset.

16


6.4 Clone RelationClones reported as clone classes are more useful than clone pairs.Finding clone classes reduces the number of comparisons such asin the case of the NICAD tool [13]. They are more useful for main-tenance than clone pairs.

6.5 Comparison UnitsThere are various levels of comparison units such as source lines,tokens, subtrees and subgraphs. Text-based techniques compare thesource code line-by-line, but their results may not be meaning-ful syntactically. Token-based techniques use tokens of the sourcecode. However, token-based techniques can be less efficient in timeand space than text-based techniques because a source line maycontain several tokens. Tree-based techniques use tree nodes forcomparison units and search for similar trees with expensive com-parison, resulting in low recall. Metric-based techniques use metricvalues for each code fragment but it could be that the metric valuesfor cloned code are not the same. PDG-based techniques use PDGnodes and search for isomorphic subgraphs but graph matching iscostly.

6.6 ComplexityComputational complexity is a major concern in code clone detec-tion techniques. A good clone detection technique should be ableto detect clones in large systems. The complexity of each techniquedepends on the comparison algorithm, comparison unit and the typeof transformation used.

6.7 Transformation or NormalizationTransformation and normalization remove uninteresting clones andremove comments. Thus, these steps help filter noise and detectnear-miss clones. A good tool should filter noise and detect near-miss clones.

7. RELATED AREASCode clone detection techniques can help in areas such as clonerefactoring or removal, clone avoidance, plagiarism detection, bugdetection, code compacting, copyright infringement detection andclone detection in models.

7.1 CLONE REFACTORING OR REMOVALCode clones can be refactored after they are detected. Code clonerefactoring is a technique that restructures existing cloned codewithout changing its external behavior or functionality. Code clonerefactoring improves design, flexibility and simplicity withoutchanging a program’s external behavior. Refactoring or removal isused to improve maintainability and comprehensibility of software[40]. However, Kim et al. [10] show that clone refactoring is notalways a solution to improve software quality because of twoissues. First, clones are frequently short-lived. Refactoring is notgood if there are branches from a block within short distances.Second, long-lived clones that have been modified with otherelements in the same class are hard to be removed or refactored.In addition, a bug that can be fixed easily because the sourcecode is easy to understand improves malleability leading to codeextensibility.

Fig. 7. Simple Example of Extract Method.

Goto et al. [70] present an approach based on coupling or cohesionmetrics to rank extract methods for merging. They use ASTdifferencing to detect syntactic differences between two similarmethods using slice-based cohesion metrics. This method helpsdevelopers or programmers in refactoring similar methods intocohesive methods.

Meng et al. [71] design and implement a new tool calledRASE. RASE is an automated refactoring tool that consists offour different types of clone refactoring approaches: use of extractmethods, use of parameterized types, use of form templates, andadding parameters. RASE performs automated refactoring bytaking two or more methods to perform systematic edits to producethe target code with clones removed.

Fontana et al.[73] conclude that removing duplicated codeleads to an improvement of the system in most cases. They analysethe impact of code refactoring on eight metrics, which five metricson package level and three metrics on class level.

At Package level [73]

(1) LOC: Line of code at Package level,(2) NOM: Number of methods at Package level,(3) CC: Cyclomatic complexity at Package level,(4) CF: Coupling factor at Package level and,(5) DMS: Distance from the main sequence at Package level.

At Class level [73][i)]WMC: Weighted method complexity, LCOM: Lack cohe-sion in methods and, RFC: Response for a Class.

They calculate the metrics value and observe that the quality met-rics are improved in class level. Therefore, refactoring code im-proves the quality of specific classes.

7.2 Types of code refactoringThere are a number of different types of code refactoring ap-proaches that can be performed.

(1)(2)(3)(1) Extract Class. When two classes are identical or have simi-lar subclasses, they can be refactored by creating a new class,moving the relevant fields and methods from the old class intothe new class.

17


Table 10. Summary of Evaluation Approaches.Approach Precision Recall Scalability PortabilityText-based High Low Depends on comparison

algorithmHigh

Token-based Low High High MediumTree-based High Low Depends on comparison

algorithmLow

Metric-based Medium Medium High LowPDG-based High Medium Low Low

Fig. 8. Simple Example of Pull Up Method.

Fig. 9. Simple Example of Add parameter in Method.

(2) Extract Method. The easiest way of refactoring a clone is to usethe extract method of refactoring. The extract method extractsa fragment of source code and redefines it as a new method. Inthis type, code clones can be removed as shown in Figure 7.

(3) Rename Methods. Renaming a method does not reveal themethod’s purpose or change the name of the method.

(4) Push Down Method. Moving a method from a superclass intoa sub-class.

(5) Pull Up Method. Moving a method from a subclass into asuper-class. This type can remove the code clones as shownin Figure 8.

(6) Add Parameter. When a new method or function needs moreinformation from its caller, the code can be refactored as shownin Figure 9.

(7) Replace Constructor With Factory. This can be used whenthere is an object that is created with a type code and laterit needs sub-classes. Constructors can only return an instanceof the object. So, it is necessary to replace the constructor witha factory method as shown in Figure 11.

Fig. 10. Example of Replace Constructor With Factory.

Juillerat et al. [42] propose a new algorithm for removing codeclone using Extract refactoring for Java. Kodhai et al. [40] use Re-name, Add Parameter, and Replace Constructor With Factory meth-ods after detecting functions or methods in C or Java applications toremove or refactor code clones. Choi et al. [41] propose a methodto combine clone metric values to extract clone sets from CCFinderoutput for refactoring. They use the metric graph of Gemini [63] foranalyzing the output of CCFinder.

7.3 CLONE AVOIDANCETwo approaches are discussed to deal with cloning, how to detectclones and how to remove clones. The third approach is avoidance,which tries to disallow the creation of code clones in the softwareright from the beginning. Legue et al. [43] use a code clone detec-tion tool in two ways in software development. The first way usescode clone detection as preventive control where any added codefragment is checked to see if it is a duplicated version of any ex-isting code fragment before adding it to the system. The secondway, problem mining, searches for the modified code fragment inthe system for all similar code fragments.

7.4 Plagiarism DetectionCode clone detection approaches can be used in plagiarism detec-tion of software code. Dup [8] is a technique that is used for findingnear matches of long sections of software code. JPlag [33] is an-other tool that finds similarities among programs written in C, C++,Java, and Scheme. JPlag compares bytes of text and program struc-ture. Yuan et al. [44] propose a count-based clone detection tech-nique called CMCD. CMCD has been used to detect plagiarisms inhomeworks of students.

7.5 Bug DetectionCode clone detection techniques can also help in bug detection. CP-Miner [8, 35] has been used to detect bugs. Higo et al. [45] proposean approach to efficiently detect bugs that are caused by copy-pasteprogramming. Their algorithm uses existing detection tools such asCCFinderX [69].

18


7.6 Code CompactingCode size can be reduced by using code clone detection techniquesand replacing common code using code refactoring techniques [5].

7.7 Copyright InfringementClone detection tools can easily be adapted to detect copyright in-fringement [8].

7.8 Clone Detection in ModelsModel-based development can also use clone detection techniquesto detect duplicated parts of models [46]. Deissenboeck et al. [47]propose an approach for an automatic clone detection using largemodels in graph theory.

8. OPEN PROBLEMS IN CODE CLONEDETECTION

There is no clone detection technique which is perfect in terms ofprecision, recall, scalability, portability and robustness. Therefore,it is necessary to come up with new clone detection approaches thatcan do better by overcoming some of the limitations of existingtechniques. There is hardly any tool that can detect Type-3 andType-4 clones. Type-4 clones requires to solve an undecidableproblem [64]. Clone detection technique can be improved bycombining several different types of methods or reimplementingsystems using a different programming language. This presentsnew challenges for software maintenance, refactoring and clonemanagement.

It is hard to determine which is the best tool for detectionbecause every tool has its strengths and weaknesses. Since text-based and token-based techniques have high recall and AST-basedtechniques have high precision, these techniques may be mergedin a tool to get high recall and precision results. A PDG-basedtechnique detects only Type-3 clones; this technique may beextended to detect Type-1 and Type-2 clones besides Type-3 clones.

Type-1 and Type-2 clones are easier to detected than Type-3clones. Sequence alignment algorithms with gaps may potentiallybe used to detect Type-3 clones. To detect plagiarism in programcode, text similarity measures and local alignment for highscalability may be used.

9. CONCLUSIONSoftware clones occur due to reasons such as code reuse by copy-ing pre-existing fragments, and repeated computations using dupli-cated functions with slight changes in the used variables, data struc-tures or control structures. After introducing code clones, commontypes of clones, phases of clone detection, we have discussed atlength code clone detection techniques, covering a number of tech-niques in detail. Recent code clone detectors and tools that have notbeen discussed in previous survey papers are covered. This surveystands apart from other prior surveys in that it organizes the discus-sion in terms of a classification scheme, categorizes clone detectiontechniques, under classes and subclasses, provides a level of detailnot found in other surveys to facilitate a newcomer to the field toget started and get into the field full-speed, provides extensive com-parison of the methods so that one can choose right method neededfor one’s needs as well as develop new methods that may overcomedrawbacks of existing methods.

10. REFERENCES

[1] Ratten, Dhavleesh,Rajesh Bhatia, and Maninder Singh. Soft-ware clone detection: A systematic review. Information andSoftware Technology 55.7 (2013): 1165-1199.

[2] Yang, Jiachen, et al. Classification model for code clonesbased on machine learning. Empirical Software Engineering(2014): 1-31.

[3] Walenstein, Andrew, and Arun Lakhotia. The software simi-larity problem in malware analysis. Internat. Begegnungs-undForschungszentrum fr Informatik, 2007.

[4] Kamiya, Toshihiro, Shinji Kusumoto, and Katsuro Inoue.CCFinder: a multilinguistic token-based code clone detec-tion system for large scale source code. Software Engineering,IEEE Transactions on 28.7 (2002): 654-670.

[5] Chen, Wen-Ke, Bengu Li, and Rajiv Gupta. Code compactionof matching single-entry multiple-exit regions. Static Analy-sis. Springer Berlin Heidelberg, 2003. 401-417.

[6] Bruntink, Magiel, et al. On the use of clone detection foridentifying crosscutting concern code. Software Engineering,IEEE Transactions on 31.10 (2005): 804-818.

[7] Li, Zhenmin, et al. CP-Miner: Finding copy-paste and re-lated bugs in large-scale software code. Software Engineer-ing, IEEE Transactions on 32.3 (2006): 176-192.

[8] Baker, Brenda S. On finding duplication and near-duplicationin large software systems. refeverse Engineering, 1995., Pro-ceedings of 2nd Working Conference on. IEEE, 1995.

[9] Yuan, Yang, and Yao Guo. Boreas: an accurate and scalabletoken-based approach to code clone detection. Proceedings ofthe 27th IEEE/ACM International Conference on AutomatedSoftware Engineering. ACM, 2012.

[10] Kim, Miryung, et al. An empirical study of code clone ge-nealogies. ACM SIGSOFT Software Engineering Notes. Vol.30. No. 5. ACM, 2005.

[11] Roy, Chanchal Kumar, and James R. Cordy. NICAD: Accu-rate detection of near-miss intentional clones using flexiblepretty-printing and code normalizationProgram Comprehen-sion, 2008. ICPC 2008. The 16th IEEE International Confer-ence on. IEEE, 2008.

[12] Lee, Seunghak, and Iryoung Jeong. SDD: high performancecode clone detection system for large scale source code. Com-panion to the 20th annual ACM SIGPLAN conference onObject-oriented programming, systems, languages, and appli-cations. ACM, 2005.

[13] Baxter, Ira D., et al. Clone detection using abstract syntaxtrees. Software Maintenance, 1998. Proceedings., Interna-tional Conference on. IEEE, 1998.

[14] Koschke,Rainer,Raimar Falke, and Pierre Frenzel. Clone de-tection using abstract syntax suffix trees. Reverse Engineer-ing, 2006. WCRE’06. 13th Working Conference on. IEEE,2006.

[15] S. Ducasse, M. Rieger and S. Demeyer, A Language Inde-pendent Approach for Detecting Duplicated Code, Proc. Int’,lConf. Software Maintenance, pp. 109-118, 1999.

[16] Roy, C. K., and Cordy, J. R., A mutation / injection-based au-tomatic framework for evaluating code clone detection tools, in Proc. The IEEE International Conference on SoftwareTesting, Verification, and Validation Workshops , 2009, pp.157-166.

19


[17] Funaro, Marco, et al. A hybrid approach (syntactic and tex-tual) to clone detection. Proceedings of the 4th InternationalWorkshop on Software Clones. ACM, 2010.

[18] Agrawal, Akshat, and Sumit Kumar Yadav. A hybrid-tokenand textual based approach to find similar code segments.2013 Fourth International Conference on Computing, Com-munications and Networking Technologies (ICCCNT). IEEE,2013.

[19] E. Kodhai, S. Kanmani, A. Kamatchi,R. Radhika, and B. Vi-jaya saranya, Detection of type-1 and type-2 code clones us-ing textual analysis and metrics, Proc. Int. Conf. on RecentTrends in Information, Telecommunication and Computing,2010, pp. 241-243.

[20] J. Mayrand, C. Leblanc and E. Merlo, Experiment on the auto-matic detection of function clones in a software system usingmetrics, Proc. Int. Conf. on Software Maintenance, 1996, pp.244-253.

[21] E. Merlo, detection of plagiarism in university projects us-ing metrics-based spectral similarity, Proc. Dagstuhl Seminar06301: Duplication,Redundancy, and Similarity in Software,2006.

[22] R. Komondoor and S. Horwitz. Using Slicing to Identify Du-plication in Source Code. In SAS, pp. 40-56, 2001.

[23] Gabel, Mark, Lingxiao Jiang, and Zhendong Su. Scal-able detection of semantic clones. Software Engineering,2008. ICSE’08. ACM/IEEE 30th International Conferenceon. IEEE, 2008.

[24] X. Yan, J. Han, and R. Afshar. Clospan: Mining closed se-quential patterns in large datasets, 2003.

[25] A.V. Aho,R. Sethi, and J. Ullman, Compilers: Principles,Techniques and Tools. Addison-Wesley, 1986.

[26] Ducasse, Stphane, Oscar Nierstrasz, and Matthias Rieger. Onthe effectiveness of clone detection by string matching. Jour-nal of Software Maintenance and Evolution: Research andPractice 18.1 (2006): 37-58.

[27] Roy, Chanchal K., James R. Cordy, and Rainer Koschke.Comparison and evaluation of code clone detection tech-niques and tools: A qualitative approach. Science of ComputerProgramming 74.7 (2009): 470-495.

[28] Jiang, Lingxiao, et al. Deckard: Scalable and accurate tree-based detection of code clones. Proceedings of the 29th in-ternational conference on Software Engineering. IEEE Com-puter Society, 2007.

[29] Raheja, Kanika, and Rajkumar Tekchandani. An emerging ap-proach towards code clone detection: metric based approachon byte code. International Journal of Advanced Research inComputer Science and Software Engineering 3.5 (2013).

[30] Sharma, Yogita. Hybrid technique for object oriented soft-ware clone detection. Diss. THAPAR UNIVERSITY, 2011.

[31] Krinke, Jens. Identifying similar code with program de-pendence graphs. Reverse Engineering, 2001. Proceedings.Eighth Working Conference on. IEEE, 2001.

[32] Liu, Chao, et al. GPLAG: detection of software plagiarismby program dependence graph analysis. Proceedings of the12th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 2006.

[33] Prechelt, Lutz, Guido Malpohl, and Michael Philippsen. Find-ing plagiarisms among a set of programs with JPlag. J. UCS8.11 (2002): 1016.

[34] Wahler, Vera, et al. Clone Detection in Source Code by Fre-quent Itemset Techniques. SCAM. Vol. 4. 2004.

[35] Patenaude, J-F., et al. Extending software quality assess-ment techniques to java systems. Program Comprehen-sion, 1999. Proceedings. Seventh International Workshop on.IEEE, 1999.

[36] Kontogiannis, K., M. Galler, and R. DeMori. Detecting codesimilarity using patterns. Working Notes of the Third Work-shop on AI and Software Engineering: Breaking the Toy Mold(AISE). 1995.

[37] Kontogiannis, Kostas A., et al. Pattern matching for clone andconcept detection. Reverse engineering. Springer US, 1996.77-108.

[38] Weiser, Mark. Program slicing. Proceedings of the 5th in-ternational conference on Software engineering. IEEE Press,1981.

[39] Hummel, Benjamin, et al. Index-based code clone detec-tion: incremental, distributed, scalable. Software Maintenance(ICSM), 2010 IEEE International Conference on. IEEE,2010.

[40] Kodhai, Vijayakumar, Balabaskaran, Stalin, and Kanagaraj, etal. Method Level Detection and Removal of Code Clones in Cand Java Programs using Refactoring. In IJJCET, pp. 93-95,2010.

[41] Choi, Eunjong, et al. Extracting code clones for refactoringusing combinations of clone metrics. Proceedings of the 5thInternational Workshop on Software Clones. ACM, 2011.

[42] Juillerat, Nicolas, and Bat Hirsbrunner. An algorithm for de-tecting and removing clones in java code. Proceedings of the3rd Workshop on Software Evolution through Transforma-tions: Embracing the Change, SeTra. Vol. 2006. 2006.

[43] Lague, Bruno, et al. Assessing the benefits of incorporatingfunction clone detection in a development process. SoftwareMaintenance, 1997. Proceedings., International Conferenceon. IEEE, 1997.

[44] Yuan, Yang, and Yao Guo. CMCD: Count matrix based codeclone detection. Software Engineering Conference (APSEC),2011 18th Asia Pacific. IEEE, 2011.

[45] Higo, Yoshiki, K-I. Sawa, and Shinji Kusumoto. Problematiccode clones identification using multiple detection results.Software Engineering Conference, 2009. APSEC’09. Asia-Pacific. IEEE, 2009.

[46] Deissenboeck, Florian, et al. Model clone detection in prac-tice. Proceedings of the 4th International Workshop on Soft-ware Clones. ACM, 2010.

[47] Deissenboeck, Florian, et al. Clone detection in automotivemodel-based development. Proceedings of the 30th interna-tional conference on Software engineering. ACM, 2008.

[48] Abd-El-Hafiz, Salwa K. A metrics-based data mining ap-proach for software clone detection. Computer Software andApplications Conference (COMPSAC), 2012 IEEE 36th An-nual. IEEE, 2012.

[49] Higo, Yoshiki, et al. Incremental code clone detection: APDG-based approach. Reverse Engineering (WCRE), 201118th Working Conference on. IEEE, 2011.

[50] Dean, Thomas R., et al. Agile parsing in TXL. AutomatedSoftware Engineering 10.4 (2003): 311-336.

[51] Cordy, James R. The TXL source transformation language.Science of Computer Programming 61.3 (2006): 190-210.

20


[52] Han, Jiawei. Data Mining: Concepts and Techniques. (2006).[53] Burd, Elizabeth, and John Bailey. Evaluating clone detection

tools for use during preventative maintenance. Source CodeAnalysis and Manipulation, 2002. Proceedings. Second IEEEInternational Workshop on. IEEE, 2002.

[54] Rysselberghe, Filip Van, and Serge Demeyer. Evaluatingclone detection techniques from a refactoring perspective.Proceedings of the 19th IEEE international conference on Au-tomated software engineering. IEEE Computer Society, 2004.

[55] Mayrand, Jean, Claude Leblanc, and Ettore M. Merlo. Ex-periment on the automatic detection of function clones in asoftware system using metrics. Software Maintenance 1996,Proceedings., International Conference on. IEEE, 1996.

[56] Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. Win-nowing: local algorithms for document fingerprinting. Pro-ceedings of the 2003 ACM SIGMOD international conferenceon Management of data. ACM, 2003.

[57] Cutting, Doug, and Jan Pedersen. Optimization for dynamicinverted index maintenance. Proceedings of the 13th annualinternational ACM SIGIR conference on Research and devel-opment in information retrieval. ACM, 1989.

[58] Arya, Sunil, et al. An optimal algorithm for approximate near-est neighbor searching. Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms. Society for Indus-trial and Applied Mathematics, 1994.

[59] http://pages.cs.wisc.edu/ cs302/labs/EclipseTutorial/[60] Baker, Brenda S. Parameterized pattern matching: Algorithms

and applications. Journal of Computer and System Sciences52.1 (1996): 28-42.

[61] Datar, Mayur, et al. Locality-sensitive hashing scheme basedon p-stable distributions. Proceedings of the twentieth annualsymposium on Computational geometry. ACM, 2004.

[62] Rivest,Ronald. The MD5 message-digest algorithm. (1992).[63] Higo, Yoshiki, et al. On software maintenance process im-

provement based on code clone analysis. Product FocusedSoftware Process Improvement. Springer Berlin Heidelberg,2002. 185-197.

[64] Bellon, Stefan, et al. Comparison and evaluation of clone de-tection tools. Software Engineering, IEEE Transactions on33.9 (2007): 577-591.

[65] Murakami, Hiroaki, et al. Folding repeated instructions forimproving token-based code clone detection. Source CodeAnalysis and Manipulation (SCAM), 2012 IEEE 12th Inter-national Working Conference on. IEEE, 2012.

[66] Murakami, Hiroaki, et al. Gapped code clone detection withlightweight source code analysis. Program Comprehension(ICPC), 2013 IEEE 21st International Conference on. IEEE,2013.

[67] Hotta, Keisuke, et al. How Accurate Is Coarse-grained CloneDetection?: Comparision with Fine-grained Detectors. Elec-tronic Communications of the EASST 63 (2014).

[68] Smith, Temple F., and Michael S. Waterman. Identification ofcommon molecular subsequences. Journal of molecular biol-ogy 147.1 (1981): 195-197.

[69] CCFinderX, http://www.ccfinder.net/.[70] Akira Goto, Norihiro Yoshida, Masakazu Ioka, Eunjong Choi,

and Katsuro Inoue. How to extract differences from similarprograms? A cohesion metric approach. In Proceedings of the7th International Workshop on Software Clones, 2013.

[71] Meng, N., Hua, L., Kim, M., McKinley, K. S. Does Auto-mated Refactoring Obviate Systematic Editing?. UPDATE, 6,7.

[72] Koschke, Rainer. Survey of research on software clones.Internat. Begegnungs-und Forschungszentrum fr Informatik,2007.

[73] Arcelli Fontana, Francesca, et al. Software clone detectionand refactoring. ISRN Software Engineering (2013).

[74] Roy, Chanchal Kumar, and James R. Cordy. A survey on soft-ware clone detection research. Technical Report 541, Queen’sUniversity at Kingston, 2007.

[75] Shafieian, Saeed, and Ying Zou. Comparison of Clone De-tection Techniques. Technical report, Queen?s University,Kingston, Canada, 2012.

[76] Dang, S. and Wani, S.A., Performance Evaluation of CloneDetection Tools.

21

A Survey of Software Clone Detection Techniques€¦ · detection techniques and tools. Burd and...

Documents

Transcript of A Survey of Software Clone Detection Techniques€¦ · detection techniques and tools. Burd and...