Scaling classical clone detection tools for ultra large datasets

30
Scaling Classical Clone Detec/on Tools for Ultra Large Datasets Jeffrey Svajlenko, Iman Keivanloo, Chanchal Roy IWSC 2013

Transcript of Scaling classical clone detection tools for ultra large datasets

Page 1: Scaling classical clone detection tools for ultra large datasets

Scaling  Classical  Clone  Detec/on  Tools  for  Ultra-­‐

Large  Datasets    

Jeffrey  Svajlenko,  Iman  Keivanloo,  Chanchal  Roy  IWSC  2013  

Page 2: Scaling classical clone detection tools for ultra large datasets

Inter-­‐Project  Clone  Detec/on

• Ac>ve  research  topic  in  the  community.  

• Goal:  Construct  inter-­‐project  clone  corpus.  

• Applica*ons  •  Study  Global  Developer  Behavior  •  Discover  Poten>al  APIs  and  Libraries  •  Internet-­‐Scale  Clone  Search  

•  API  Recommenda>on  •  API  Usage  Support  

•  …  

Page 3: Scaling classical clone detection tools for ultra large datasets

Problem:  Inter-­‐Project  Detec/on

• Many  state  of  the  art  tools  do  not  scale  to  large  datasets.  (classical  tools)  

•  Memory  Requirements  •  Computa>onal  Complexity  •  Execu>on  Time  •  Underlying  limita>ons  in  their  algorithms  or  data  structures.  

•  Instead  novel  scalable  techniques  are  used.  •  Challenging  to  develop.  

• Wish  to  use  tools  from  a  variety  of  domains  when  building  an  inter-­‐project  clone  corpus.  

Page 4: Scaling classical clone detection tools for ultra large datasets

Goal  and  Mo/va/on

GOAL  To  scale  classical  clone  detec,on  tools  to  ultra  large  dataset.    MOTIVATION  To  allow  classical  clone  detec>on  tools  to  contribute  to  inter-­‐project  clone  corpuses.    

Page 5: Scaling classical clone detection tools for ultra large datasets

Shuffling  Framework

•  Scales  classical  tools  to  ultra-­‐large  datasets.  • Using  standard  hardware.  • Without  modifying  the  original  tool.  •  Incurs  a  loss  of  recall.    • Method:  Non-­‐Determinis>c  Dataset  Par>>oning  

Page 6: Scaling classical clone detection tools for ultra large datasets

Shuffling  Framework  -­‐  Procedure

1.  The  source  files  of  the  dataset  are  randomly  par>>oned  into  n  equally  sized  subsets.  

Ultra-­‐Large  Dataset  

1   2   3   4  

5   6   7   8  

9   10   11   12  

13   14   15   16  

Subset  size  dictated  by  clone  detec>on  tool’s  scalability  limits.  

Page 7: Scaling classical clone detection tools for ultra large datasets

Shuffling  Framework  -­‐  Procedure

2.  Each  subset  is  searched  independently  by  the  clone  detec>on  tool.  

1   Clone  Detec>on  Tool  

2   Clone  Detec>on  Tool  

16   Clone  Detec>on  Tool   16  

.  .  .  2  

1  

Page 8: Scaling classical clone detection tools for ultra large datasets

Shuffling  Framework  -­‐  Procedure

3.  The  detected  clone  pairs  are  added  to  a  clone  repository.  

1   2   3   4  

5   6   7   8  

9   10   11   12  

13   14   15   16  

Detected  Clones  

Page 9: Scaling classical clone detection tools for ultra large datasets

Shuffling  Framework  -­‐  Procedure 4.  Steps  (1)  through  (3)  are  repeated  for  r  rounds.  

Dataset   Clone  Repository  

r  rounds  

n*r  detec>on  experiments  

Page 10: Scaling classical clone detection tools for ultra large datasets

Shuffling  Framework  -­‐  Evalua/on

Gold  Standard  •  Clone  detec>on  report  of  a  tool  executed  na>vely  (without  shuffling).  

 Total  Recall  •  %  of  gold  standard  found  afer  r  shuffling  rounds  of  n  par>>ons.  

•  Measure  for  unique  clone  pairs  or  unique  cloned  fragments.  

Page 11: Scaling classical clone detection tools for ultra large datasets

Preliminary  Study

•  Test  with  “regular  size”  systems:  •  JHotDraw  (20  KLOC,  285  files)  •  ArgoUML  (190KLOC,  1845  files)  •  JDK1.7  (900KLOC,  6916  files)    

•  Tools:  •  CCFinder,  Deckard,  iClones,  NiCad,  SimCad,  Simian    

•  Shuffling:  15  subsets,  30  shuffling  rounds    

• Measured:  total  recall  afer  each  round  

Page 12: Scaling classical clone detection tools for ultra large datasets

Preliminary  Study  –  JDK1.7

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30  

Recall  

Round  

Deckard  (1834042)  

iClones  (49716)  

NiCad  (8105)  

SimCad  (549923)  

Simian  (217409)  

n  =  15  subsets,  r  =  30  rounds  

Page 13: Scaling classical clone detection tools for ultra large datasets

Preliminary  Study

•  ~60-­‐90%  total  recall  achievable  

•  Shuffling  performance  varies  by  detec>on  tool.  

• Generally,  a  larger  gold  standard  requires  more  rounds  to  get  the  same  total  recall.  

Page 14: Scaling classical clone detection tools for ultra large datasets

Main  Experiment:  Dataset

IJaDataset  2.0:  An  Inter-­‐Project  Java  Corpus  •  Keivanloo  et  al,  2012  (Proc.  MSR)  

•  Crawled  25,000  Open-­‐Source  Java  Projects  

•  3  million  java  source  files,  356  MLOC  

•  Outliers  (>2000  lines)  •  6238  removed  

Page 15: Scaling classical clone detection tools for ultra large datasets

Experiment  -­‐  Hardware

Clone  detec>on  (shuffling):  

• Worksta>on-­‐Class  Hardware  •  Quad  Core  CPU  •  12-­‐16GB  of  RAM  •  Above  Average  Disk  IO  

•  ~$1000  PC    

•  Allocated  on  shared  cloud  resources.  •  Western  Canada  Research  Grid  (Bugaboo  Cluster)  •  Amazon  EC2  Instances  

Page 16: Scaling classical clone detection tools for ultra large datasets

Experiment  -­‐  Tools

•  Simian  • NiCad  • Deckard  • CCFinderX  •  Terminated  without  explana>on.  

•  SimCad  •  Execu>on  aborts  on  troublesome  file.  

•  iClones  •  Compa>bility  issue.  

Page 17: Scaling classical clone detection tools for ultra large datasets

Simian

•  IJaDataset2  •  Scalability  limit:  RAM  •  50,000  file  subsets  (58  par>>ons),  30  rounds  •  8-­‐12hr  to  par>>on,  4-­‐10hr  for  detec>on  (per  round)  

•  Serng  •  Minimum  Clone  Size:  6  lines  •  No  source  normaliza>on  (execu>on  >me)  

• Gold  Standard  •  Amazon  EC2  instance  with  68GB  of  RAM  •  300  billion  clone  pairs,  11  million  cloned  fragments  

Page 18: Scaling classical clone detection tools for ultra large datasets

Simian:  Cloned  Fragment  Recall

0.166903883  

0.476927684  

0.626533533  

0.715431474  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30  

Clon

e  Fragmen

t  Recall  

Round  

Considering  only  clone  classes  with  <=  100  fragments.  

Page 19: Scaling classical clone detection tools for ultra large datasets

Simian:  Clone  Recall  (Trim)

0.24792718  

0.619514665  

y  =  0.0067x  +  0.0533  R²  =  0.99585  

y  =  0.1364ln(x)  +  0.1199  R²  =  0.95064  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30  

Total  R

ecall  

Rounds  

Clone  Pairs   Cloned  Fragments   Linear  (Clone  Pairs)   Log.  (Cloned  Fragments)  

Page 20: Scaling classical clone detection tools for ultra large datasets

NiCad

•  IJaDataset2  •  Scalability:  Limited  data-­‐structure  size.  •  10,000  file  subsets,  289  par>>ons,  20  rounds  •  7-­‐15hr  par>>oning,  23-­‐31hr  detec>on  (per  round)    

•  Serngs:  •  Clone  Size:  10-­‐2500  lines.  •  Minimum  clone  similarity:  70%  

• Gold  Standard  •  Not  possible.  

Page 21: Scaling classical clone detection tools for ultra large datasets

NiCad  –  Detec/on  vs.  Rounds

y  =  245387x  +  767852  R²  =  0.99993  

0.00E+00  

1.00E+05  

2.00E+05  

3.00E+05  

4.00E+05  

5.00E+05  

6.00E+05  

7.00E+05  

8.00E+05  

9.00E+05  

1.00E+06  

0.00E+00  

1.00E+06  

2.00E+06  

3.00E+06  

4.00E+06  

5.00E+06  

6.00E+06  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20  

Uniqu

e  Clon

ed  Fragm

ents  Fou

nd  

Uniqu

e  Clon

es  Fou

nd  

Round  

Unique  Clones  Found  Unique  Clone  Fragments  Found  

Page 22: Scaling classical clone detection tools for ultra large datasets

Deckard

•  IJaDataset  •  Scalability  Limit:  Execu>on  >me.  •  10,000  file  subsets,  289  par>>ons,  20  rounds  •  7-­‐15hr  par>>oning,  5-­‐7  days  detec>on  (per  round)  

•  Serngs:  •  Minimum  Fragment  Size:  50  tokens  •  Sliding  Window:  5  tokens  •  Minimum  Clone  Similarity:  90%  (tree)  

• Gold  Standard  •  Execu>on  >me  too  long.  

Page 23: Scaling classical clone detection tools for ultra large datasets

Deckard:  Detec/on  vs.  Rounds

1.00E+07  

1.10E+07  

1.20E+07  

1.30E+07  

1.40E+07  

1.50E+07  

1.60E+07  

1.70E+07  

1.80E+07  

1.90E+07  

1   2   3   4   5   6   7   8   9   10  

Uniqu

e  Re

ported

 Clone

 Fragm

ents  

Round  

Page 24: Scaling classical clone detection tools for ultra large datasets

Deckard  –  Detec/on  vs.  Rounds  (Trim)

Considering  only  clone  classes  with  <=  10  fragments.  

0.00E+00  

2.00E+06  

4.00E+06  

6.00E+06  

8.00E+06  

1.00E+07  

1.20E+07  

1.40E+07  

1.60E+07  

1.80E+07  

0.00E+00  

2.00E+07  

4.00E+07  

6.00E+07  

8.00E+07  

1.00E+08  

1.20E+08  

1   2   3   4   5   6   7   8   9   10  

Uniqu

e  Clon

e  Fragmen

ts  Fou

nd  

Round  

Clones  

Fragments  

Page 25: Scaling classical clone detection tools for ultra large datasets

Main  Experiment  Conclusions

•  Shuffling  framework  finds  cloned  fragments  faster  than  the  clone  pair  rela>onships  between  them.  

•  A  large  number  of  rounds  may  be  needed  to  detect  a  sizable  number  of  the  clone  pairs.  

•  Appropriate  when  loss  of  recall  is  acceptable.  •  Ex:  contribu>ng  towards  mul>-­‐tool  clone  corpus.  

•  Processing  the  clones  found  in  a  inter-­‐project  clone  corpus  can  become  itself  a  scalability  issue.  

Page 26: Scaling classical clone detection tools for ultra large datasets

Clone  Recovery

How  can  we  improve  clone  pair  discovery?  • Without  a  significant  increase  in  rounds?  

IDEA:  Leverage  Cloned  Fragment  Detec2on  Ability  •  Apply  Transi>ve  Property  on  Clone  Repository.  

•  If  (A,B)  and  (B,C)  then  (A,C)    •  Perform  clone  search  amongst  cloned  fragments.  

Page 27: Scaling classical clone detection tools for ultra large datasets

Transi/ve  Clone  Recovery  Test

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30  

Recall  

Round  

Clone  Recall   Heuris>c  Recall   Recovered  Recall  

NiCad,  JDK1.7  

Page 28: Scaling classical clone detection tools for ultra large datasets

Transi/ve  Clone  Recovery  Test

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30  

Recall  

Round  

Clone  Recall   Heuris>c  Recall   Recovered  Recall  

Simian,  JDK1.7  

Page 29: Scaling classical clone detection tools for ultra large datasets

Future  Work

1.  Inves>gate  addi>onal  tools.  2.  Inves>gate  efficient  clone  recovery  methods.  3.  Directly  compare  with  determinis>c  approach.  4.  Use  the  shuffling  framework  to  contribute  

towards  an  inter-­‐project  clone  corpus  (IJaDataset  2.0).  

Page 30: Scaling classical clone detection tools for ultra large datasets

Thank  You!