Scaling classical clone detection tools for ultra large datasets

Scaling Classical Clone Detec/on Tools for Ultra-‐

Large Datasets

Jeffrey Svajlenko, Iman Keivanloo, Chanchal Roy IWSC 2013

Inter-‐Project Clone Detec/on

• Ac>ve research topic in the community.

• Goal: Construct inter-‐project clone corpus.

• Applica*ons •  Study Global Developer Behavior •  Discover Poten>al APIs and Libraries •  Internet-‐Scale Clone Search

•  API Recommenda>on •  API Usage Support

•  …

Problem: Inter-‐Project Detec/on

• Many state of the art tools do not scale to large datasets. (classical tools)

•  Memory Requirements •  Computa>onal Complexity •  Execu>on Time •  Underlying limita>ons in their algorithms or data structures.

•  Instead novel scalable techniques are used. •  Challenging to develop.

• Wish to use tools from a variety of domains when building an inter-‐project clone corpus.

Goal and Mo/va/on

GOAL To scale classical clone detec,on tools to ultra large dataset. MOTIVATION To allow classical clone detec>on tools to contribute to inter-‐project clone corpuses.

Shuffling Framework

•  Scales classical tools to ultra-‐large datasets. • Using standard hardware. • Without modifying the original tool. •  Incurs a loss of recall. • Method: Non-‐Determinis>c Dataset Par>>oning

Shuffling Framework -‐ Procedure

1.  The source files of the dataset are randomly par>>oned into n equally sized subsets.

Ultra-‐Large Dataset

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Subset size dictated by clone detec>on tool’s scalability limits.


2.  Each subset is searched independently by the clone detec>on tool.

1 Clone Detec>on Tool

2 Clone Detec>on Tool

16 Clone Detec>on Tool 16

. . . 2

1


3.  The detected clone pairs are added to a clone repository.

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Detected Clones

Shuffling Framework -‐ Procedure 4.  Steps (1) through (3) are repeated for r rounds.

Dataset Clone Repository

r rounds

n*r detec>on experiments

Shuffling Framework -‐ Evalua/on

Gold Standard •  Clone detec>on report of a tool executed na>vely (without shuffling).

Total Recall •  % of gold standard found afer r shuffling rounds of n par>>ons.

•  Measure for unique clone pairs or unique cloned fragments.

Preliminary Study

•  Test with “regular size” systems: •  JHotDraw (20 KLOC, 285 files) •  ArgoUML (190KLOC, 1845 files) •  JDK1.7 (900KLOC, 6916 files)

•  Tools: •  CCFinder, Deckard, iClones, NiCad, SimCad, Simian

•  Shuffling: 15 subsets, 30 shuffling rounds

• Measured: total recall afer each round

Preliminary Study – JDK1.7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Recall

Round

Deckard (1834042)

iClones (49716)

NiCad (8105)

SimCad (549923)

Simian (217409)

n = 15 subsets, r = 30 rounds

Preliminary Study

•  ~60-‐90% total recall achievable

•  Shuffling performance varies by detec>on tool.

• Generally, a larger gold standard requires more rounds to get the same total recall.

Main Experiment: Dataset

IJaDataset 2.0: An Inter-‐Project Java Corpus •  Keivanloo et al, 2012 (Proc. MSR)

•  Crawled 25,000 Open-‐Source Java Projects

•  3 million java source files, 356 MLOC

•  Outliers (>2000 lines) •  6238 removed

Experiment -‐ Hardware

Clone detec>on (shuffling):

• Worksta>on-‐Class Hardware •  Quad Core CPU •  12-‐16GB of RAM •  Above Average Disk IO

•  ~$1000 PC

•  Allocated on shared cloud resources. •  Western Canada Research Grid (Bugaboo Cluster) •  Amazon EC2 Instances

Experiment -‐ Tools

•  Simian • NiCad • Deckard • CCFinderX •  Terminated without explana>on.

•  SimCad •  Execu>on aborts on troublesome file.

•  iClones •  Compa>bility issue.

Simian

•  IJaDataset2 •  Scalability limit: RAM •  50,000 file subsets (58 par>>ons), 30 rounds •  8-‐12hr to par>>on, 4-‐10hr for detec>on (per round)

•  Serng •  Minimum Clone Size: 6 lines •  No source normaliza>on (execu>on >me)

• Gold Standard •  Amazon EC2 instance with 68GB of RAM •  300 billion clone pairs, 11 million cloned fragments

Simian: Cloned Fragment Recall

0.166903883

0.476927684

0.626533533

0.715431474

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Clon

e Fragmen

t Recall

Round

Considering only clone classes with <= 100 fragments.

Simian: Clone Recall (Trim)

0.24792718

0.619514665

y = 0.0067x + 0.0533 R² = 0.99585

y = 0.1364ln(x) + 0.1199 R² = 0.95064

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Total R

ecall

Rounds

Clone Pairs Cloned Fragments Linear (Clone Pairs) Log. (Cloned Fragments)

NiCad

•  IJaDataset2 •  Scalability: Limited data-‐structure size. •  10,000 file subsets, 289 par>>ons, 20 rounds •  7-‐15hr par>>oning, 23-‐31hr detec>on (per round)

•  Serngs: •  Clone Size: 10-‐2500 lines. •  Minimum clone similarity: 70%

• Gold Standard •  Not possible.

NiCad – Detec/on vs. Rounds

y = 245387x + 767852 R² = 0.99993

0.00E+00

1.00E+05

2.00E+05

3.00E+05

4.00E+05

5.00E+05

6.00E+05

7.00E+05

8.00E+05

9.00E+05

1.00E+06

0.00E+00

1.00E+06

2.00E+06

3.00E+06

4.00E+06

5.00E+06

6.00E+06

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Uniqu

e Clon

ed Fragm

ents Fou

nd

Uniqu

e Clon

es Fou

nd

Round

Unique Clones Found Unique Clone Fragments Found

Deckard

•  IJaDataset •  Scalability Limit: Execu>on >me. •  10,000 file subsets, 289 par>>ons, 20 rounds •  7-‐15hr par>>oning, 5-‐7 days detec>on (per round)

•  Serngs: •  Minimum Fragment Size: 50 tokens •  Sliding Window: 5 tokens •  Minimum Clone Similarity: 90% (tree)

• Gold Standard •  Execu>on >me too long.

Deckard: Detec/on vs. Rounds

1.00E+07

1.10E+07

1.20E+07

1.30E+07

1.40E+07

1.50E+07

1.60E+07

1.70E+07

1.80E+07

1.90E+07

1 2 3 4 5 6 7 8 9 10

Uniqu

e Re

ported

Clone

Fragm

ents

Round

Deckard – Detec/on vs. Rounds (Trim)

Considering only clone classes with <= 10 fragments.

0.00E+00

2.00E+06

4.00E+06

6.00E+06

8.00E+06

1.00E+07

1.20E+07

1.40E+07

1.60E+07

1.80E+07

0.00E+00

2.00E+07

4.00E+07

6.00E+07

8.00E+07

1.00E+08

1.20E+08

1 2 3 4 5 6 7 8 9 10

Uniqu

e Clon

e Fragmen

ts Fou

nd

Round

Clones

Fragments

Main Experiment Conclusions

•  Shuffling framework finds cloned fragments faster than the clone pair rela>onships between them.

•  A large number of rounds may be needed to detect a sizable number of the clone pairs.

•  Appropriate when loss of recall is acceptable. •  Ex: contribu>ng towards mul>-‐tool clone corpus.

•  Processing the clones found in a inter-‐project clone corpus can become itself a scalability issue.

Clone Recovery

How can we improve clone pair discovery? • Without a significant increase in rounds?

IDEA: Leverage Cloned Fragment Detec2on Ability •  Apply Transi>ve Property on Clone Repository.

•  If (A,B) and (B,C) then (A,C) •  Perform clone search amongst cloned fragments.

Transi/ve Clone Recovery Test

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Recall

Round

Clone Recall Heuris>c Recall Recovered Recall

NiCad, JDK1.7

Transi/ve Clone Recovery Test

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Recall

Round

Clone Recall Heuris>c Recall Recovered Recall

Simian, JDK1.7

Future Work

1.  Inves>gate addi>onal tools. 2.  Inves>gate efficient clone recovery methods. 3.  Directly compare with determinis>c approach. 4.  Use the shuffling framework to contribute

towards an inter-‐project clone corpus (IJaDataset 2.0).

Thank You!

Scaling classical clone detection tools for ultra large datasets

Technology

Transcript of Scaling classical clone detection tools for ultra large datasets