Hindawi - WASTK: A Weighted Abstract Syntax Tree Kernel … · 2019. 7. 30. · 6...

Research ArticleWASTK A Weighted Abstract Syntax Tree Kernel Method forSource Code Plagiarism Detection

Deqiang Fu12 Yanyan Xu1 Haoran Yu2 and Boyang Yang2

1School of Information Science and Technology Beijing Forestry University No 35 Qinghuadong RoadHaidian District Beijing 100083 China2Jisuan Institute of Technology Beijing Judao Youda Network Technology Co Ltd No 18 Suzhoujie St Room 1204Haidian District Beijing 100080 China

Correspondence should be addressed to Yanyan Xu xuyanyanbjfueducn

Received 28 September 2016 Revised 23 November 2016 Accepted 21 December 2016 Published 13 February 2017

Academic Editor Michele Risi

Copyright copy 2017 Deqiang Fu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In this paper we introduce a source code plagiarism detection method named WASTK (Weighted Abstract Syntax Tree Kernel)for computer science education Different from other plagiarism detection methods WASTK takes some aspects other than thesimilarity between programs into account WASTK firstly transfers the source code of a program to an abstract syntax tree andthen gets the similarity by calculating the tree kernel of two abstract syntax trees To avoid misjudgment caused by trivial codesnippets or frameworks given by instructors an idea similar to TF-IDF (Term Frequency-Inverse Document Frequency) in thefield of information retrieval is applied Each node in an abstract syntax tree is assigned a weight by TF-IDF WASTK is evaluatedon different datasets and as a result performs much better than other popular methods like Sim and JPlag

1 Introduction

Due to the advancement of the Internet source code plagia-rism has become a big issue in the field of computer scienceeducation [1] Some students usually try to copy source codefrom their classmates or search similar source code from theInternet as their assignments without modifying Plagiarismdiminishes the quality of education seriously Students aredeprived of the abilities to make innovations and thinkindependently which may also cause academic dishonesty

As what happened in traditional offline computer scienceeducation online education suffers from plagiarism tooMoreover it is even harder for online education platformsto find a method against source code plagiarism when thenumber of submissions from students goesmuch greater thanthe traditional offline cases Therefore detecting source codeplagiarism becomes heavy load of responsible instructorsrsquodaily work [2]

In order to detect the source code plagiarism auto-matically three problems have already been considered byresearchers in this field [3ndash5]

Problem 1 Computer programs are structured and hierar-chical Looking for a reasonable method to measure thesimilarity between a pair of programs needs to be carefullytreated

Problem 2 Themodifications on programs for the plagiarismpurpose are almost changeless Common distortions forexample comment alteration whitespace padding identifierrenaming code reordering and algebraic expressions can befound with high confidence

Problem 3 The comparison between each pair of programstakes a long time and leads to inefficiency

Additionally there exist some other facts which need tobe considered

Problem 4 Instructors may provide frameworks of program-ming assignments for students to start with The providedframeworks will contribute a lot to the similarity betweeneach pair of programs

HindawiScientific ProgrammingVolume 2017 Article ID 7809047 8 pageshttpsdoiorg10115520177809047

2 Scientific Programming

Problem 5 Solutions for simple problems may be alike Forexample a hundred students without any communicationmay produce very similar programs for the task of calculatingthe sum of two variables

To solve these two problems in this paper two accuratemethods for source code plagiarism detection are presentednamed ASTK (Abstract Syntax Tree Kernel) and WASTK(Weighted ASTK) Since computer programs are structuredin this work they are presented as abstract syntax trees[6] In ASTK a method called tree kernel [7] is usedfor measuring the similarity between a pair of programsAdditionally different from traditional tree kernel methodsto highlight the significance of nontrivial parts and reduceimpacts caused by short sample source code and source codeprovided by instructors WASTK gives weights to every treenode Inspired by the idea of TF-IDF [8] lower weights aregiven to the part of common code and code provided by theframework of coding assignments and higher weights areassigned to those distinguished parts of code

The rest of this paper is organized as follows Section 2introduces related previous work Section 3 illustrates themethods of ASTK and WASTK We show the experimentresults and corresponding analyses in Section 4 Finally inSection 5 we conclude this work and give some discussionsabout possible future work

2 Related Work

There exist some plagiarism detection measures Accordingto the categories identified by Mozgovoy in 2006 [9] thereare mainly fingerprint based and content comparison basedapproaches Content comparison techniques have three sub-categories including string-matching algorithms parameter-ized matching algorithms and parse tree comparison algo-rithms

MOSS (Measure of Software Similarity) proposed byUniversity of California-Berkeley in 2003 [10] as a proposedfingerprint based measure divides each program into k-grams Each gram is made of some substrings of length 119896The possibility of plagiarism is determined according to thenumber of overlapped fingerprints generated by hashing eachgram

There are some famous string-matching algorithmsincluding Plague YAP3 (the third version of Yet AnotherPlague) JPlag and FDPS (Fast Plagiarism Detection System)[11] These methods all firstly convert programs into corre-sponding token sequences Then they use similarity asthe evidence of plagiarism by comparison among tokensequences generated from different programs Plague pro-posed by Whale in 1988 is the first one converting a file intoa token sequence and using a string-matching technique forcomparison [12] YAP3 proposed by Wise in 1996 performsbetter than Plague due to a faster converting method usingRunning Karp-Rabin Greedy String Tiling (RKR-GST) [13]JPlag proposed by Malpohl in 2006 runs even faster thanYAP3 by defining a minimum-matching length [14] FDPSas another algorithm similar to YAP3 pays more attentionson efficiency which is proposed by Mozgovoy et al in 2007

[15] It uses an indexed data structure to store matches whichsupport faster searching

Dup tool proposed by Baker in 1995 is a parameterizedmatching algorithm It firstly uses a lexical analyzer to scana program Then it modifies all identifiers and constants intosame symbols and output the transformed program togetherwith a list of parameter candidates The similarity is deter-mined according to 119901-matches between two transformedfiles where 119901means a parameter [16]

Sim and Brass are based on parse tree comparisonalgorithms Sim proposed by Gitchell and Tran in 1999 getsa token sequence by a lexical analyzer and calculates thesequence alignment as the similarity [17] In order not to beinfluenced by identifiers or statement orders in 2014 Kikuchiet al proposed a modified method which uses syntactic ele-ments for tokens with lexical elements and the method doesnot include identifier names or literal values in the tokens [4]Brass proposed by Belkhouche et al in 2004 is an applicationof parse tree comparison algorithms It represents each file asa binary tree and a symbol table (the data dictionary) con-taining the variables and data structures used in the file [18]

3 Proposed Approach

Definition 1 (abstract syntax tree) An abstract syntax tree isan abstract syntactic structure of a piece of source code in theform of tree

Figure 1 shows an example of an abstract syntax treecreated from short sample code The left part is the sourcecode in two lines and the right part is its correspondingabstract syntax tree after lexical and syntax analysis It is easyto find that a short piece of source code has a large abstractsyntax tree and only leaf nodes show the symbols of sourcecode The in-order traversal result of all the leaf nodes willget the source code

Definition 2 (tree kernel) Tree kernel is an algorithm forcomputing tree structures and measuring the similaritybetween two contents inNatural Language Processing (NLP)which is firstly proposed by Collins and Duffy in 2001 [7]

Between two trees 1198791 and 1198792 a kernel 119870(1198791 1198792) can berepresented as an inner product between two vectors

119870(1198791 1198792) = h (1198791) sdot h (1198792) (1)

Each tree 119879 can be represented as a vector h(119879) =(ℎ1(119879) ℎ2(119879) ℎ119899(119879)) where ℎ119894(119879) is the number of occur-rences of the 119894th subtree in 119879Definition 3 (TF-IDF) In information retrieval TF-IDF isa numerical statistic model that is intended to reflect howimportant a word means to a document in a collection ofdocuments [8]

TF-IDF includes two parts TF is the abbreviation forldquoTerm Frequencyrdquo and IDF is the abbreviation for ldquoInverseDocument Frequencyrdquo TF(119905 119889) denotes the frequency ofword 119905 in document 119889 Similarly DF(119905 119863) is the frequency

Scientific Programming 3

block_item_list

block_item

declaration

decl_specs

type_spec

T_INT

init_declarator_list

init_declarator

declarator

direct_declarator

V_ID

SEMICOLON

block_item

stat

exp_stat

exp

assignment_exp

unary_exp

postfix_exp

primary_exp

V_ID

assignment_operator

ASSIGN

conditional_exp

logical_or_exp

SEMICOLON

additive_exp

mult_expunary_exppostfix_expprimary_exp

V_ID

O_ADD mult_expunary_exppostfix_expprimary_exp

const_valV_INTEGER

int a

ltintgt

ltagt

ltagt

ltagt

ltgt

ltgt

lt=gt

lt+gt

lt2gt

a = a + 2

Figure 1 A piece of source code and its corresponding abstract syntax tree

Source code 1 Source code 2

Lexical analyzer andsyntax analyzer

AST 1 AST 2

Parse tree kernel

Normalized score

TF-IDF TF-IDF

Figure 2 Weighted Abstract Syntax Tree Kernel

of documents containing word 119905 in the set of all documents119863 IDF(119905 119863) is the reciprocal of DF(119905 119863) TF focuses on theimportance of words in a document while IDF concerns theuniversality of words in all documents TF-IDF can help tocompute the importance of words for solving Problems 4 and5

The process of WASTK is shown in Figure 2 Theproposed method firstly transforms programs into abstractsyntax trees Inspired by the idea of TF-IDF weights arecalculated for every subtree giving expressions with highfrequency in a document but low frequency in other doc-uments higher weights and giving expressions that appeareverywhere (in all documents) lower weights Finally treekernel is applied with calculated weights on nodes to deter-minewhether plagiarismhappensThedetails of the designedapproach are shown in Figure 2

31 Adjust the Structure of Abstract Syntax Tree To solveProblem 1 programs are parsed into abstract syntax trees forfurther understanding the semantics

Then some adjustments are applied to each abstractsyntax tree As mentioned in Problem 2 replacing variablenames and size declarations of arrays are common waysto plagiarize To solve this problem all variable tokens arereplaced with a unified tokenThe tokens for size declarationsof arrays and the indices of array elements are unified as well

Because rephrasing expressions into different expressionsis not trivial there is rarely a problem about plagiarismby modifying expressions The structure information ofexpressions related to subtrees in the abstract syntax tree isabandoned The resulting strings of the in-order traversal ofleaf nodes on the subtrees are adopted as replacements of thesubtrees Besides a simplified abstract syntax tree results inless time needed by running ASTK andWASTK which helpsto solve Problem 3

After adjusting the abstract syntax tree in Figure 1 istransformed into a new tree shown in Figure 3 All variablenames are replaced with ldquotemprdquo Because the node ldquoexprdquo isa type of expression it is adjusted to be a leaf node andrepresents the in-order traversal of leaf nodes on the previoussubtree with the root ldquoexprdquo that is to say the dotted portionin Figure 3 is cut out from the original abstract syntaxtree

32 Determine Node Weights In ASTK the weights on allnodes are equal to 1 However in WASTK the weights ofnodes depend on TF-IDF Actually this is the only differencebetween ASTK andWASTK InWASTK it is considered thatweights on abstract syntax tree nodes reflect the significanceof the corresponding subtree fragments of code Taking Prob-lems 4 and 5 into consideration lowerweights are given to thenodes that represent common expressions appearing inmanyother programs

Different types of symbols and expressions frequentlyappear in the programsThey do not have abundant semantic


block_item_list

block_item

declaration

decl_specs

type_spec

T_INT


init_declarator

declarator

direct_declarator

V_ID

SEMICOLON

block_item

stat

exp_stat

exp

assignment_exp

unary_exp

postfix_exp

primary_exp

V_ID

assignment_operator

ASSIGN

conditional_exp

logical_or_exp

SEMICOLON

additive_exp

mult_exp

unary_exp

postfix_exp

primary_exp

V_ID

O_ADD mult_exp

unary_exp

postfix_exp

primary_exp

const_val

V_INTEGER

ltgt

ltgt

lt=gt

lt+gt

lt2gt

lttempgt

lttempgt

lttempgt

ltintgt

(lttemp=temp+2gt)

Figure 3 A new abstract syntax tree after adjustments

Table 1 A list of stop words

Type Function Example(s)Round brackets ()Square brackets []Curly brackets Comma Semicolon Equality sign =Pointer star The star denoting the pointer lowastVariable type The type of a variable intVariable name The name of a variable a b tempPointer name The name of a pointer lowasta lowastb lowasttempSize of array The declaration of an arrayrsquos size 100 N N+10Definition of array The definition of an array a[100] a[N]

meaning and can be treated as stopwords A list of stopwordsis shown in Table 1

Let119879 denote the abstract syntax tree For each subtree 119904 of119879 there is a weight119908119904119879The in-order traversal on 119904 is denotedas word119904 If word119904 is treated as a stop word the weight of 119904is specially adjusted to 0 that is 119908119904119879 = 0 For example theweight of the nodes in Figure 3 is 0 whose type is ldquoT INTrdquoldquoV IDrdquo and ldquoSEMICOLONrdquo On the contrary if word119904 is nota stop word the weight 119908119904119879 can be computed as follows

119908119904119879 = TF (119904 119879) sdot IDF (119904 119879) (2)

By Definition 3 TF(119904 119879) and IDF(119904 119879) can be calculatedas follows

TF (119904 119879) = cnt (119904 119879)119899 (119879)

IDF (119904 119879) = log2 (1 + 119873119888 (119904)) (3)

where cnt(119904 119879) is the frequency of the appearances of subtree119904 isin 119878119879 119899(119879) is the number of subtrees in 119879 and 119888(119904) is thenumber of abstract syntax trees which contains 119904119873 is used torepresent the number of generated abstract syntax trees fromprograms related to a specific assignment

33 Calculate Tree Kernel and Similarity ByDefinition 2 thesimilarity between two abstract syntax trees 1198791 and 1198792 can bemeasured by a tree kernel and denoted by119870(1198791 1198792)

119870(1198791 1198792) = h (1198791) ⊙ h (1198792)= sum119904119894isin(1198781198791cup1198781198792 )

cnt (119904119894 1198791) sdot cnt (119904119894 1198792) (4)

where 119878119879 denotes the set of all subtrees in119879When calculating119870(1198791 1198792) directly it needs to enumerate each subtree inboth 1198791 and 1198792 and then calculate cnt(119904119894 1198791) and cnt(119904119894 1198792)respectively This method appears inefficient and a recursivemethod is applied

119870(1198791 1198792) = sum1199041isin1198781198791

sum1199042isin1198781198792

119862 (1199041 1199042) (5)


119862(1199041 1199042) is calculated as follows

(1) If the roots of 1199041 and 1199042 are both leaf nodes of the ldquoexprdquotype then

119862 (1199041 1199042) = 120582tree sdot dist (word1199041 word1199042) sdot 1199081199041 1198791 sdot 1199081199042 1198792 (6)

where 120582tree is a decay factor to avoid the changesnear the root producing too much influence [6] Asthe height increases the kernel value of the subtreeis penalized by (120582tree)size where size is the height ofthe subtree word119904 is a string denoted the in-ordertraversal on the subtree 119904 as defined in Section 32dist(119886 119887) is defined as follows

dist (119886 119887) = lev119886119887max (|119886| |119887|) (7)

where lev119886119887 is the edit distance between strings 119886and 119887 |119886| is the length of string 119886 Different fromthe traditional tree kernel the similarity betweenexpressions is not equal to 0 It is denoted by the editdistance between two expressions By using the abovedefinition of dist(119886 119887) expression-level modificationsthat arementioned in Problem 2will be discovered byASTK and WASTK

(2) If the root of 1199041 is different from the root of 1199042 then119862 (1199041 1199042) = 0 (8)

(3) If the roots of 1199041 and 1199042 are both leaf nodes then

119862 (1199041 1199042) = 120582tree sdot 1199081199041 1198791 sdot 1199081199042 1198792 (9)

(4) Otherwise

119862 (1199041 1199042) = 120582tree

sdotns(1199041)prod119894

(1 + ns(1199042)max119895119862 (st (1199041 119894) st (1199042 119895)))

sdot 1199081199041 1198791 sdot 1199081199042 1198792

(10)

where ns(119904) is the number of subtrees rooted at childrennodes of root119904 and st(119904 119894) is the subtree rooted at the 119894thchildren node of root119904 Due to the root of 1199041 being the sameas the root of 1199042 ns(1199041) = ns(1199042)

After computing tree kernel119870(1198791 1198792) between1198791 and1198792a normalization is neededThemethod of normalization is asfollows

1198701015840 (1198791 1198792) = 119870 (1198791 1198792)radic119870 (1198791 1198791) sdot 119870 (1198792 1198792)

(11)

where1198701015840(1198791 1198792) is the cosine similarity ofh(1198791) andh(1198792) InASTK and WASTK 1198701015840(1198791 1198792) is the similarity of two piecesof source code

Table 2 The sizes of datasets

Fold Type of set Number of code pairs1 Training set 148101 Testing set 74042 Training set 148092 Testing set 74053 Training set 148093 Testing set 7405

4 Experimental Results

41 Datasets There are two groups of data The programs ineach group are generated from 10 independent submissions ofprograms for the same assignment by applying to 40 differentgenerators Table 2 shows the sizes of datasets

Threefold cross-validation has been adopted for theexperimentThe dataset has been randomly split into 3 partsEach time one part is used as a testing set and the other twoparts are used as a training set

The 10 independent submissions of programs for eachassignment are randomly picked from the submitted solutionof ldquoProblem Irdquo and ldquoProblem Jrdquo by students who attend thefinal exam of C programming course in Harbin University ofScience andTechnologyThese pieces of code are very short asthe average number of lines is no more than 20 Particularlya programming framework is provided in ldquoProblem Jrdquo

The 40 generators are designed by 40 players who attendthe Jisuan-zhidao programming contest (httpsnantijisuankecomcontest) The generator reads original pro-grams as the input and returns the plagiarized programs asresults Only the generated programs functionally equivalentto the original ones are used

All valid generated programs are used in the datasets andlabeled The programs generated from the same program arelabeled as the same origin Any pair of programs that arelabeled as the same origin will be treated as plagiarism

42 Evaluation of Proposed Methods After determining thesimilarity score of a pair of programs it still needs to eval-uate whether these two programs are plagiarized or notTherefore a threshold 120579 of the similarity score is suggestedThe threshold 120579 is determined by setting it to different valuesin 119879 = 001119905 | 0 lt 119905 lt 100 and using a training set tofind which value generates the best outcome The pair withthe similarity score higher than 120579 will be treated as a plagia-rism

In this study precision is a measurement method des-cribed as the number of true plagiarized programs over thenumber of all programs that are marked as plagiarism Recallis another measurement method described as the numberof all found plagiarized programs over the number of allplagiarized programs

Since we have the assumption that the positive case ofplagiarism is rare and the negative case of plagiarism is com-mon in addition to precision and recall Jaccard score ispicked as a reasonable method for evaluation


Table 3 The thresholds 120579 of WASTK ASTK Sim and JPlag

Tool 1st-fold 2nd-fold 3rd-fold AverageWASTK 077 077 077 077ASTK 096 096 096 096Sim 012 015 015 014JPlag 011 010 008 010

Jaccard score is calculated as follows

Jaccard score = TPTP + FP + FN (12)

where TP is the number of plagiarized programs that aremarked as plagiarism FP is the number of nonplagiarizedprograms that are marked as plagiarism FN is the numberof plagiarized programs that are not marked as plagiarism

43 Results Our experiments select Sim and JPlag to com-pare with ASTK and WASTK In a survey Deokate andHanchatementioned four popular plagiarism detection toolsJPlag Sim MOSS and Plaggie in 2016 [19] However MOSSis running on the web server and the results contain nomorethan 250 pairs of code in high similarity Plaggie is similarto JPlag but it only supports Java [20] As a result we selectSim and JPlag as comparison tools while Sim represents theparse tree algorithm and JPlag represents the string-matchingalgorithm

In all experiments the decay factor 120582tree is set to 01empirically Table 3 shows the thresholds 120579 for WASTKASTK Sim and JPlag which are determined by using thetraining data The performance results are judged accordingto Jaccard scoreThe average value is used as the threshold foreach model in the experiments

The results by applying different models on the testingdata of each fold are shown in Table 4 and the correspondingcomparison of these four tools is illustrated in Figure 4

According to the results the precision values of ASTKand WASTK are much higher than Sim and JPlag Therecall values of ASTK and WASTK are much higher thanJPlag but slightly less than Sim The Jaccard score as theoverall evaluation measure shows the advantage of ASTKand WASTK The Jaccard scores of ASTK and WASTK aremuch greater than Sim and JPlag

The added weights in WASTK damage precision butincrease recall The Jaccard score of WASTK is higher thanASTK showing that the TF-IDF weighting is worthy

44 An Online Example WASTK has been applied to anonline exam held on Jisuanke (httpswwwjisuankecom)for code plagiarism detecting and 290 accepted pieces ofsource code are collected about one problem in this examThis problem inputs only two numbers 119886 and 119887 and outputsthe result whether number 119886 can be divided by number119887 The average number of lines of these pieces of code is11

The following illustrates two pieces of code selected to beanalyzed

(1) includestdioh (1) includeltstdiohgt(2) intmain() (2) intmain()(3) (3) int ab(4) intMN (4) scanf(dampa)(5) scanf(dampM) (5) scanf(dampb)(6) scanf(dampN) (6) if(ab==0)(7) if(MN==0) (7) printf(YES)(8) printf(YESn) (8) else(9) else (9) printf(NO)(10) printf(NOn) (10)(11) return 0 (11)(12) (12)

(13)(14)(15)

Their similarity is 0694 detected by WASTK while thesimilarity is 093 detected by Sim It is easy to find that thestructures of these two pieces of code are almost the sameThevariable names in them are different Also themethods of thehead file reference are differentThere is one line of ldquoreturn 0rdquoat the end of the left one but it is not found in the right oneMoreover these two pieces of code contain different linefeedat the beginning and the end of the function ldquomainrdquo

For these short pieces of code of simple problems ahigher similarity is obtained by the ordinary detection toollike Sim but it may lead to a wrong judgment HoweverWASTK acquires a more accurate result by catching theunique features of each piece of code

5 Conclusions

In this paper a method and its enhancement for detectingsource code plagiarism are proposed This method not onlyconsiders the similarity between two pieces of code butalso takes the context of programming into considerationWASTK transforms string based programs to abstract syntaxtrees The nodes on trees are weighted according to both treestructural similarity between a pair of programs and commonstructures of all programs in the dataset

According to the results of the experiments ASTK andWASTK perform much better than other popular methodson the same datasets ASTK and WASTK are both based onthe structure of programs instead of text like JPlag or tokensequences like Sim Besides improved tree kernel considersthe similarity between corresponding expressions for twosubtrees which is helpful for detecting plagiarismwithminorchanges Additionally WASTK adds weights to ASTK whichincreases the recall and Jaccard score by applying TF-IDFWhen the pieces of code have common frameworks or arebased on similar solutions TF-IDF sets lower weights to theirnodes to avoid misjudgments

WASTK can help instructors to detect whether plagiarismexists in the assignments of students in both online and offlinecomputer science education Also it can be applied to onlineprogramming contests to detect plagiarism When programs


Table 4 The results of WASTK ASTK Sim and JPlag on the testing data of each fold

Fold Tool Precision Recall Jaccard TP FP FN1 WASTK 0878981 0532134 0495808 414 57 3641 ASTK 10 048329 048329 376 0 4021 Sim 0271504 0596401 0229362 464 1245 3141 JPlag 0311388 0224936 0150215 175 387 6032 WASTK 0872527 0518954 0482382 397 58 3682 ASTK 10 0461438 0461438 353 0 4122 Sim 0272888 0603922 0231463 462 1231 3032 JPlag 0280702 020915 013617 160 410 6053 WASTK 0871965 050641 047136 395 58 3853 ASTK 10 0442308 0442308 345 0 4353 Sim 0275656 0592308 0231695 462 1214 3183 JPlag 0289286 0207692 0137521 162 398 618

0

02

04

06

08

1

1 2 3

WASTKASTK

SimJPlag

(a) Precision

WASTKASTK

SimJPlag

0010203040506

1 2 3

(b) Recall

0

01

02

03

04

05

1 2 3

WASTKASTK

SimJPlag

(c) Jaccard

Figure 4 Comparison of four tools

contain a common framework or solutions for problems aresimple WASTK will show a better performance

However there is still space for improvementThe currentmethod employs the tree kernel a symmetric similaritymeasurement which may lead the judgment to a wrongdirection Moreover efficiency is still a problem since thecomparison has to be made between each pair of programs

Disclosure

This work is performed when Deqiang Fu is visiting JisuanInstitute of Technology at Beijing Judao Youda NetworkTechnology Co Ltd as a research intern

Competing Interests

The authors declare that they have no competing inter-ests

Acknowledgments

This work is supported by the Fundamental Research Fundsfor the Central Universities (no 2016JX06) the NationalNatural Science Foundation of China (no 61472369) and theComputer Science Education Foundation of Jisuan Instituteof Technology at Beijing Judao Youda Network TechnologyCo Ltd


References

[1] D Kermek and M Novak ldquoProcess model improvement forsource code plagiarism detection in student programmingassignmentsrdquo Informatics in Education vol 15 no 1 pp 103ndash126 2016

[2] E Eret and A Ok ldquoInternet plagiarism in higher educationtendencies triggering factors and reasons among teacher can-didatesrdquoAssessment and Evaluation inHigher Education vol 39no 8 pp 1002ndash1016 2014

[3] D Sraka and B Kaucic ldquoSource code plagiarismrdquo in Proceedingsof the ITI 31st International Conference on Information Technol-ogy Interfaces (ITI rsquo09) pp 461ndash466 CavtatDubrovnik Croa-tia June 2009

[4] H Kikuchi T Goto M Wakatsuki and T Nishino ldquoAsource code plagiarism detecting method using alignment withabstract syntax tree elementsrdquo in Proceedings of the 15th IEEEACIS International Conference on Software Engineering Artifi-cial Intelligence Networking and ParallelDistributed Comput-ing (SNPD rsquo14) 6 1 pages July 2014

[5] Bradley Beth A comparison of similarity techniques for detect-ing source codeplagiarism Department of Computer ScienceThe University of Texas at Austin 2014 httpswwwcsutexasedusimbbethfilesAComparisonOfSimilarityTechniquesForDe-tectingSourceCodePlagiarismpdf

[6] H-J Song S-B Park and S Y Park ldquoComputation of programsource code similarity by composition of parse tree and callgraphrdquoMathematical Problems in Engineering vol 2015 ArticleID 429807 12 pages 2015

[7] M Collins and N Duffy ldquoConvolution kernels for natural lan-guagerdquo in Proceedings of the 15th Annual Neural InformationProcessing Systems Conference (NIPS rsquo01) pp 625ndash632 Vancou-ver Canada December 2001

[8] K S Jones ldquoA statistical interpretation of term specificity andits application in retrievalrdquo Journal of Documentation vol 28no 1 pp 11ndash21 1972

[9] MMozgovoy ldquoDesktop tools for offline plagiarism detection incomputer programsrdquo Informatics in Education vol 5 no 1 pp97ndash112 2006

[10] S Schleimer D S Wilkerson and A Aiken ldquoWinnowing localalgorithms for document fingerprintingrdquo in Proceedings of theACM SIGMOD International Conference on Management ofData pp 76ndash85 San Diego Calif USA June 2003

[11] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquoIEEE Transactions on Computers vol 61 no 3 pp 379ndash3942012

[12] G Whale Plague Plagiarism Detection Using Program Struc-ture School of Electrical Engineering and Computer ScienceUniversity of New South Wales 1988

[13] M J Wise ldquoYap3 improved detection of similarities in com-puter program and other textsrdquo ACM SIGCSE Bulletin vol 28no 1 pp 130ndash134 1996

[14] G Malpohl Jplag detecting software plagiarism 2006 httpwwwipdukade

[15] M Mozgovoy S Karakovskiyz and V Klyuev ldquoFast and reli-able plagiarism detection systemrdquo in Proceedings of the 37thAnnual Frontiers in Education Conference-Global EngineeringKnowledge Without Borders Opportunities Without Passportspp S4H-11ndashS4H-14 IEEEMilwaukeeWis USA October 2007

[16] B S Baker ldquoOn finding duplication and near-duplication inlarge software systemsrdquo in Proceedings of the 2nd Working Con-ference on Reverse Engineering pp 86ndash95 July 1995

[17] D Gitchell and N Tran ldquoSim a utility for detecting similarityin computer programsrdquoACM SIGCSE Bulletin vol 31 pp 266ndash270 1999

[18] B Belkhouche A Nix and J Hassell ldquoPlagiarism detection insoftware designsrdquo in Proceedings of the 42nd Annual SoutheastRegional Conference (ACM-SE 42 rsquo04) pp 207ndash211 HuntsvilleAla USA April 2004

[19] B V Deokate andD B Hanchate ldquoSoftware source code plagia-rism detection a surveyrdquo Journal of Multidisciplinary Engineer-ing Science and Technology vol 3 no 1 pp 3747ndash3750 2016

[20] J Hage P Rademaker andN vanVugtAComparison of Plagia-rism Detection Tools Utrecht University Utrecht The Nether-lands 2010

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



Problem 5 Solutions for simple problems may be alike Forexample a hundred students without any communicationmay produce very similar programs for the task of calculatingthe sum of two variables

To solve these two problems in this paper two accuratemethods for source code plagiarism detection are presentednamed ASTK (Abstract Syntax Tree Kernel) and WASTK(Weighted ASTK) Since computer programs are structuredin this work they are presented as abstract syntax trees[6] In ASTK a method called tree kernel [7] is usedfor measuring the similarity between a pair of programsAdditionally different from traditional tree kernel methodsto highlight the significance of nontrivial parts and reduceimpacts caused by short sample source code and source codeprovided by instructors WASTK gives weights to every treenode Inspired by the idea of TF-IDF [8] lower weights aregiven to the part of common code and code provided by theframework of coding assignments and higher weights areassigned to those distinguished parts of code

The rest of this paper is organized as follows Section 2introduces related previous work Section 3 illustrates themethods of ASTK and WASTK We show the experimentresults and corresponding analyses in Section 4 Finally inSection 5 we conclude this work and give some discussionsabout possible future work

2 Related Work

There exist some plagiarism detection measures Accordingto the categories identified by Mozgovoy in 2006 [9] thereare mainly fingerprint based and content comparison basedapproaches Content comparison techniques have three sub-categories including string-matching algorithms parameter-ized matching algorithms and parse tree comparison algo-rithms

MOSS (Measure of Software Similarity) proposed byUniversity of California-Berkeley in 2003 [10] as a proposedfingerprint based measure divides each program into k-grams Each gram is made of some substrings of length 119896The possibility of plagiarism is determined according to thenumber of overlapped fingerprints generated by hashing eachgram

There are some famous string-matching algorithmsincluding Plague YAP3 (the third version of Yet AnotherPlague) JPlag and FDPS (Fast Plagiarism Detection System)[11] These methods all firstly convert programs into corre-sponding token sequences Then they use similarity asthe evidence of plagiarism by comparison among tokensequences generated from different programs Plague pro-posed by Whale in 1988 is the first one converting a file intoa token sequence and using a string-matching technique forcomparison [12] YAP3 proposed by Wise in 1996 performsbetter than Plague due to a faster converting method usingRunning Karp-Rabin Greedy String Tiling (RKR-GST) [13]JPlag proposed by Malpohl in 2006 runs even faster thanYAP3 by defining a minimum-matching length [14] FDPSas another algorithm similar to YAP3 pays more attentionson efficiency which is proposed by Mozgovoy et al in 2007

[15] It uses an indexed data structure to store matches whichsupport faster searching

Dup tool proposed by Baker in 1995 is a parameterizedmatching algorithm It firstly uses a lexical analyzer to scana program Then it modifies all identifiers and constants intosame symbols and output the transformed program togetherwith a list of parameter candidates The similarity is deter-mined according to 119901-matches between two transformedfiles where 119901means a parameter [16]

Sim and Brass are based on parse tree comparisonalgorithms Sim proposed by Gitchell and Tran in 1999 getsa token sequence by a lexical analyzer and calculates thesequence alignment as the similarity [17] In order not to beinfluenced by identifiers or statement orders in 2014 Kikuchiet al proposed a modified method which uses syntactic ele-ments for tokens with lexical elements and the method doesnot include identifier names or literal values in the tokens [4]Brass proposed by Belkhouche et al in 2004 is an applicationof parse tree comparison algorithms It represents each file asa binary tree and a symbol table (the data dictionary) con-taining the variables and data structures used in the file [18]

3 Proposed Approach

Definition 1 (abstract syntax tree) An abstract syntax tree isan abstract syntactic structure of a piece of source code in theform of tree

Figure 1 shows an example of an abstract syntax treecreated from short sample code The left part is the sourcecode in two lines and the right part is its correspondingabstract syntax tree after lexical and syntax analysis It is easyto find that a short piece of source code has a large abstractsyntax tree and only leaf nodes show the symbols of sourcecode The in-order traversal result of all the leaf nodes willget the source code

Definition 2 (tree kernel) Tree kernel is an algorithm forcomputing tree structures and measuring the similaritybetween two contents inNatural Language Processing (NLP)which is firstly proposed by Collins and Duffy in 2001 [7]

Between two trees 1198791 and 1198792 a kernel 119870(1198791 1198792) can berepresented as an inner product between two vectors

119870(1198791 1198792) = h (1198791) sdot h (1198792) (1)

Each tree 119879 can be represented as a vector h(119879) =(ℎ1(119879) ℎ2(119879) ℎ119899(119879)) where ℎ119894(119879) is the number of occur-rences of the 119894th subtree in 119879Definition 3 (TF-IDF) In information retrieval TF-IDF isa numerical statistic model that is intended to reflect howimportant a word means to a document in a collection ofdocuments [8]

TF-IDF includes two parts TF is the abbreviation forldquoTerm Frequencyrdquo and IDF is the abbreviation for ldquoInverseDocument Frequencyrdquo TF(119905 119889) denotes the frequency ofword 119905 in document 119889 Similarly DF(119905 119863) is the frequency


block_item_list

block_item

declaration

decl_specs

type_spec

T_INT


init_declarator

declarator

direct_declarator

V_ID

SEMICOLON

block_item

stat

exp_stat

exp

assignment_exp

unary_exp

postfix_exp

primary_exp

V_ID

assignment_operator

ASSIGN

conditional_exp

logical_or_exp

SEMICOLON

additive_exp


V_ID


const_valV_INTEGER

int a

ltintgt

ltagt

ltagt

ltagt

ltgt

ltgt

lt=gt

lt+gt

lt2gt

a = a + 2




AST 1 AST 2

Parse tree kernel

Normalized score

TF-IDF TF-IDF











block_item_list

block_item

declaration

decl_specs

type_spec

T_INT


init_declarator

declarator

direct_declarator

V_ID

SEMICOLON

block_item

stat

exp_stat

exp

assignment_exp

unary_exp

postfix_exp

primary_exp

V_ID

assignment_operator

ASSIGN

conditional_exp

logical_or_exp

SEMICOLON

additive_exp

mult_exp

unary_exp

postfix_exp

primary_exp

V_ID

O_ADD mult_exp

unary_exp

postfix_exp

primary_exp

const_val

V_INTEGER

ltgt

ltgt

lt=gt

lt+gt

lt2gt

lttempgt

lttempgt

lttempgt

ltintgt

(lttemp=temp+2gt)






119908119904119879 = TF (119904 119879) sdot IDF (119904 119879) (2)


TF (119904 119879) = cnt (119904 119879)119899 (119879)

IDF (119904 119879) = log2 (1 + 119873119888 (119904)) (3)



119870(1198791 1198792) = h (1198791) ⊙ h (1198792)= sum119904119894isin(1198781198791cup1198781198792 )

cnt (119904119894 1198791) sdot cnt (119904119894 1198792) (4)


119870(1198791 1198792) = sum1199041isin1198781198791

sum1199042isin1198781198792

119862 (1199041 1199042) (5)






dist (119886 119887) = lev119886119887max (|119886| |119887|) (7)




119862 (1199041 1199042) = 120582tree sdot 1199081199041 1198791 sdot 1199081199042 1198792 (9)

(4) Otherwise

119862 (1199041 1199042) = 120582tree


(1 + ns(1199042)max119895119862 (st (1199041 119894) st (1199042 119895)))

sdot 1199081199041 1198791 sdot 1199081199042 1198792

(10)



1198701015840 (1198791 1198792) = 119870 (1198791 1198792)radic119870 (1198791 1198791) sdot 119870 (1198792 1198792)

(11)



























(13)(14)(15)



5 Conclusions







0

02

04

06

08

1

1 2 3

WASTKASTK

SimJPlag

(a) Precision

WASTKASTK

SimJPlag

0010203040506

1 2 3

(b) Recall

0

01

02

03

04

05

1 2 3

WASTKASTK

SimJPlag

(c) Jaccard




Disclosure


Competing Interests


Acknowledgments



References




























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




block_item_list

block_item

declaration

decl_specs

type_spec

T_INT


init_declarator

declarator

direct_declarator

V_ID

SEMICOLON

block_item

stat

exp_stat

exp

assignment_exp

unary_exp

postfix_exp

primary_exp

V_ID

assignment_operator

ASSIGN

conditional_exp

logical_or_exp

SEMICOLON

additive_exp


V_ID


const_valV_INTEGER

int a

ltintgt

ltagt

ltagt

ltagt

ltgt

ltgt

lt=gt

lt+gt

lt2gt

a = a + 2




AST 1 AST 2

Parse tree kernel

Normalized score

TF-IDF TF-IDF











block_item_list

block_item

declaration

decl_specs

type_spec

T_INT


init_declarator

declarator

direct_declarator

V_ID

SEMICOLON

block_item

stat

exp_stat

exp

assignment_exp

unary_exp

postfix_exp

primary_exp

V_ID

assignment_operator

ASSIGN

conditional_exp

logical_or_exp

SEMICOLON

additive_exp

mult_exp

unary_exp

postfix_exp

primary_exp

V_ID

O_ADD mult_exp

unary_exp

postfix_exp

primary_exp

const_val

V_INTEGER

ltgt

ltgt

lt=gt

lt+gt

lt2gt

lttempgt

lttempgt

lttempgt

ltintgt

(lttemp=temp+2gt)






119908119904119879 = TF (119904 119879) sdot IDF (119904 119879) (2)


TF (119904 119879) = cnt (119904 119879)119899 (119879)

IDF (119904 119879) = log2 (1 + 119873119888 (119904)) (3)



119870(1198791 1198792) = h (1198791) ⊙ h (1198792)= sum119904119894isin(1198781198791cup1198781198792 )

cnt (119904119894 1198791) sdot cnt (119904119894 1198792) (4)


119870(1198791 1198792) = sum1199041isin1198781198791

sum1199042isin1198781198792

119862 (1199041 1199042) (5)






dist (119886 119887) = lev119886119887max (|119886| |119887|) (7)




119862 (1199041 1199042) = 120582tree sdot 1199081199041 1198791 sdot 1199081199042 1198792 (9)

(4) Otherwise

119862 (1199041 1199042) = 120582tree


(1 + ns(1199042)max119895119862 (st (1199041 119894) st (1199042 119895)))

sdot 1199081199041 1198791 sdot 1199081199042 1198792

(10)



1198701015840 (1198791 1198792) = 119870 (1198791 1198792)radic119870 (1198791 1198791) sdot 119870 (1198792 1198792)

(11)



























(13)(14)(15)



5 Conclusions







0

02

04

06

08

1

1 2 3

WASTKASTK

SimJPlag

(a) Precision

WASTKASTK

SimJPlag

0010203040506

1 2 3

(b) Recall

0

01

02

03

04

05

1 2 3

WASTKASTK

SimJPlag

(c) Jaccard




Disclosure


Competing Interests


Acknowledgments



References




























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




block_item_list

block_item

declaration

decl_specs

type_spec

T_INT


init_declarator

declarator

direct_declarator

V_ID

SEMICOLON

block_item

stat

exp_stat

exp

assignment_exp

unary_exp

postfix_exp

primary_exp

V_ID

assignment_operator

ASSIGN

conditional_exp

logical_or_exp

SEMICOLON

additive_exp

mult_exp

unary_exp

postfix_exp

primary_exp

V_ID

O_ADD mult_exp

unary_exp

postfix_exp

primary_exp

const_val

V_INTEGER

ltgt

ltgt

lt=gt

lt+gt

lt2gt

lttempgt

lttempgt

lttempgt

ltintgt

(lttemp=temp+2gt)






119908119904119879 = TF (119904 119879) sdot IDF (119904 119879) (2)


TF (119904 119879) = cnt (119904 119879)119899 (119879)

IDF (119904 119879) = log2 (1 + 119873119888 (119904)) (3)



119870(1198791 1198792) = h (1198791) ⊙ h (1198792)= sum119904119894isin(1198781198791cup1198781198792 )

cnt (119904119894 1198791) sdot cnt (119904119894 1198792) (4)


119870(1198791 1198792) = sum1199041isin1198781198791

sum1199042isin1198781198792

119862 (1199041 1199042) (5)






dist (119886 119887) = lev119886119887max (|119886| |119887|) (7)




119862 (1199041 1199042) = 120582tree sdot 1199081199041 1198791 sdot 1199081199042 1198792 (9)

(4) Otherwise

119862 (1199041 1199042) = 120582tree


(1 + ns(1199042)max119895119862 (st (1199041 119894) st (1199042 119895)))

sdot 1199081199041 1198791 sdot 1199081199042 1198792

(10)



1198701015840 (1198791 1198792) = 119870 (1198791 1198792)radic119870 (1198791 1198791) sdot 119870 (1198792 1198792)

(11)



























(13)(14)(15)



5 Conclusions







0

02

04

06

08

1

1 2 3

WASTKASTK

SimJPlag

(a) Precision

WASTKASTK

SimJPlag

0010203040506

1 2 3

(b) Recall

0

01

02

03

04

05

1 2 3

WASTKASTK

SimJPlag

(c) Jaccard




Disclosure


Competing Interests


Acknowledgments



References




























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in








dist (119886 119887) = lev119886119887max (|119886| |119887|) (7)




119862 (1199041 1199042) = 120582tree sdot 1199081199041 1198791 sdot 1199081199042 1198792 (9)

(4) Otherwise

119862 (1199041 1199042) = 120582tree


(1 + ns(1199042)max119895119862 (st (1199041 119894) st (1199042 119895)))

sdot 1199081199041 1198791 sdot 1199081199042 1198792

(10)



1198701015840 (1198791 1198792) = 119870 (1198791 1198792)radic119870 (1198791 1198791) sdot 119870 (1198792 1198792)

(11)



























(13)(14)(15)



5 Conclusions







0

02

04

06

08

1

1 2 3

WASTKASTK

SimJPlag

(a) Precision

WASTKASTK

SimJPlag

0010203040506

1 2 3

(b) Recall

0

01

02

03

04

05

1 2 3

WASTKASTK

SimJPlag

(c) Jaccard




Disclosure


Competing Interests


Acknowledgments



References




























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in

















(13)(14)(15)



5 Conclusions







0

02

04

06

08

1

1 2 3

WASTKASTK

SimJPlag

(a) Precision

WASTKASTK

SimJPlag

0010203040506

1 2 3

(b) Recall

0

01

02

03

04

05

1 2 3

WASTKASTK

SimJPlag

(c) Jaccard




Disclosure


Competing Interests


Acknowledgments



References




























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






0

02

04

06

08

1

1 2 3

WASTKASTK

SimJPlag

(a) Precision

WASTKASTK

SimJPlag

0010203040506

1 2 3

(b) Recall

0

01

02

03

04

05

1 2 3

WASTKASTK

SimJPlag

(c) Jaccard




Disclosure


Competing Interests


Acknowledgments



References




























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




References




























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



Hindawi - WASTK: A Weighted Abstract Syntax Tree Kernel … · 2019. 7. 30. · 6...

Documents

Transcript of Hindawi - WASTK: A Weighted Abstract Syntax Tree Kernel … · 2019. 7. 30. · 6...