Source Code Clone Search (Iman keivanloo PhD seminar)
-
Upload
imanmahsa -
Category
Technology
-
view
2.803 -
download
1
description
Transcript of Source Code Clone Search (Iman keivanloo PhD seminar)
![Page 1: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/1.jpg)
Internet-scale Source Code Search and Analysis Framework
Iman Keivanloo
Advisor:Dr. Juergen Rilling
PhD SeminarComputer Science and Software Engineering DepartmentNovember-17-2011
![Page 2: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/2.jpg)
2
Agenda
• Research Context
• Major questions & answers
• Next step
• Conclusion
• Time Table
![Page 3: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/3.jpg)
3
Research Context
“is searching the Internet for source code to help solve a software development problem”
Internet-Scale Code Search
[Gallardo, SUITE’09]
![Page 4: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/4.jpg)
4
How to search for Source Code?
• Free-form Query:
– “how to write into file in Java”
• Structural Query: – “select col1 from table1 where col1=“%write”
[Keivanloo, ICSM’10][Keivanloo, SUITE’11]
![Page 5: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/5.jpg)
5
Research Focus
Suggested simplified query:Select line which has
(1) a method call statement on the trigger method.
...11: CSVReadFile csvData=new CSVReadFile(“input.csv”);12: myWindow.trigger(csvData);13: OutputStream o=new OutputStream();…
...59: Event e=new Event(50);60: e.trigger();61: e.update();...
...133: Listener res=new Listener();134: res.trigger(“warm-up”);135: res.close();...
...55: Window r=new Window();56: long timestamp=System.Now();57: System.out.println(“Start reasoning...”);58: XMLStream xmldata=new XMLStream(io);59: r.trigger(xmldata);60: OutputStream o=new OutputStream();61: r.flush(o);…
…89: Window var=new Window();90: XMLReadFile r=new XMLReadFile (“k.xml”);91: OutputStream o=new OutputStream();92: var.trigger(r);93: var.flush(o);…
Gapped clone
Unordered core
The pattern is similar but it uses
XMLStream instead of XMLFile as the
input
This match is acceptable, even if
the order is different from the 1:1 match
Internet-Scale Structural Code Search Engine
This line looks like a match, however it uses .CSV instead of .XML. We can use our clone search engine to find now other similar code fragments to this one.
Real-time Clone Search Engine...10: Window myWindow=new Window();11: CSVReadFile csvData=new CSVReadFile(“...12: myWindow.trigger(csvData);13: OutputStream o=new OutputStream();14: myWindow.flush(o);15: myWindow.close();...
Step 2: Input [the selected fragment in the first step and its target line (red)]
Step 1: Input [the simplified structural query]
XMLReadFile inFile=new XMLReadFile(“kb.xml”);Window myWindow=new Window();myWindow.trigger(inFile);OutputStream result=new OutputStream();myWindow.flush(result);
The ideal expected asnwer
Similar Fragment Search
![Page 6: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/6.jpg)
6
Research Challenge
![Page 7: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/7.jpg)
7
The Web Search Challenge
![Page 8: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/8.jpg)
8
But Often Still Fail to Deliver the Expected Results After 10 Years of Research
![Page 9: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/9.jpg)
9
No Ambiguity!
![Page 10: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/10.jpg)
10
Early Conclusion
Source Code Search is similar to Web Search
![Page 11: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/11.jpg)
11
Early Conclusion
Source Code Search is similar to Web Search
1. Search techniques = ?
2. Ambiguity resolution techniques = Code AnalysisAnalysis (Ambiguity resolution)
Search
![Page 12: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/12.jpg)
12
Research Approach Overview
Internet-scale Source Code Search and Analysis FrameworkAnalysisSearch
Semantic Web-based Code Analysis
Code Clone Search
![Page 13: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/13.jpg)
Definitions & Requirements
Search
![Page 14: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/14.jpg)
14
Clone (Source Code Clone)
• Similar code fragments
• Type 1: Identical except whitespaces …• Type 2: Identical except variable names ...• Type 3: Identical except a few missing…• Type 4: Similar functionality
[Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 2009.]
for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");
for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");
![Page 15: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/15.jpg)
16
Clone Search
Query Code Database
for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“Hello!");
for (A
ttribute
attrib
ute:e
xample
Set.g
etAttrib
ute
s())
Syst
em.o
ut.p
rintln
(“Hi!"
);
for
for
(Att
ribut
eEnti
ty
theA
ttrib
uteE
ntity
:aTa
bleE
ntity
.ge…
Syst
em
.out.
pri
ntl
n(“
Hello
!");
for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");
for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");
for (A
ttribu
teEn
tity
theA
ttribute
Entity:a
Ta
ble
Entity.g
e…
Syste
m.o
ut.p
rintln
(“H
ello
!");
for (Attribute attribute:es1.getAttributes()) System.out.println(“Test");
for (
Attrib
ute attr
ibute:exampleSet.g
etAttr
ibutes()
)
Syste
m.out.prin
tln(“T
he end");
![Page 16: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/16.jpg)
17
Clone Search
Query
Answer
![Page 17: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/17.jpg)
18
Internet-scale Clone Search
Query
for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“Hello!");
for (A
ttribute
attrib
ute:e
xample
Set.g
etAttrib
ute
s())
Syst
em.o
ut.p
rintln
(“Hi!"
);
for
for
(Att
ribut
eEnti
ty
theA
ttrib
uteE
ntity
:aTa
bleE
ntity
.ge…
Syst
em
.out.
pri
ntl
n(“
Hello
!");
![Page 18: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/18.jpg)
19
Internet-scale Real-time Clone Search
![Page 19: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/19.jpg)
20
Internet-scale Real-time Clone Search
Requirements?
![Page 20: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/20.jpg)
21
Internet-scale Real-time Clone Search
Millions LOC~ 300 MLOC
Requirements:
![Page 21: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/21.jpg)
22
Internet-scale Real-time Clone Search
Millions LOC
Requirements:100
Milliseconds
![Page 22: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/22.jpg)
23
Internet-scale Real-time Clone Search
Millions LOC
Requirements:100
Milliseconds
•Precision• Recall•Type-1, 2, 3…
for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");
for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");
for (AttributeEntity
theAttributeEntity:aTableEntity.ge
…System.out.println(“Hello!");
for (Attribute attribute:es1.getAttributes())
System.out.println(“Test");
![Page 23: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/23.jpg)
24
Internet-scale Real-time Clone Search
Millions LOC
Requirements:100 Milliseconds
Precision RecallType-1, 2, 3…
![Page 24: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/24.jpg)
Is it actually possible?Real-time answer (faster than 100 ms)
Rese
arch
Que
stion
#1
![Page 25: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/25.jpg)
26
• SeClone: An Internet-scale Real-time Clone Search Engine
Our Initial Analysis
Search
AnalysisPhase 1 Phase 2
[Keivanloo, ICPC’11]
![Page 26: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/26.jpg)
27
Inside SeClone
Phase 1• Syntactical Pattern matching
Phase 1 Phase 2Phase 1Pattern Matching
![Page 27: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/27.jpg)
28
Inside SeClone
Phase 2• Information Retrieval & Clustering algorithm
1 for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“The end");
2 for (Attribute attribute:es1.getAttributes()) System.out.println(“Test");
3 for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");
4 for (JAttribute attribute:formType.getAttributes()) {System.out.println(“Test");
5 for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");
Phase 1Pattern Matching
Phase 2Semantic Matching
![Page 28: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/28.jpg)
The DilemmaHow to distribute the 100 milliseconds between
phases?
Pattern Matching Semantic Matching
0 25 50 75 100
Rese
arch
Que
stion
#2
[Keivanloo, WCRE’11]
![Page 29: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/29.jpg)
30
Our Further Analysis [WCRE’11]
• 100 Milliseconds• Millions LOC• Precision• Recall• Type-1, 2, 3…
Pattern Matching Semantic Matching
0 25 50 75 100
The Dilem
maCo
nstr
aint
s
Requ
irem
ents
SeCl
one
[ICPC
11]
Dat
a Ch
arac
teris
tics
O ( p * log n )
![Page 30: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/30.jpg)
31
Source Code Characteristics
![Page 31: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/31.jpg)
32
Analysis of the Data Characteristics: Dataset preparation
• Name: IJaDataset– Comprehensive (Inter-project)
• To avoid project-specific result
– ~18,000 Projects– 1,500,000 unique Java classes
• No duplicate, empty, buggy file
– ~300 MLOC
• online at http://aseg.cs.concordia.ca/seclone
![Page 32: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/32.jpg)
33
Analysis of the Data Characteristics: Granularity Effect
• Three Level Similarity (TLS): Set of similar three-line fragments
• First Level Similarity (FLS): single-line patterns
![Page 33: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/33.jpg)
34
Analysis of the Data Characteristics: Clone frequency
• How many code fragment are analyzed by each query?
• Answer: 3 (Average)
![Page 34: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/34.jpg)
35
Analysis of the Data Characteristics: Clone frequency
• Observation result:– TLS distributes the candidates into 3.9 times more groups– Its group size is 6 times smaller than FLS
![Page 35: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/35.jpg)
36
Analysis of the Data Characteristics: Clone frequency
• Conclusion:– TLS heuristic is practical for real-time clone search,
as long as the outliers are handled properly– Why?• (1) each TLS group has 2.37 members on average• (2) it distributes candidates in small-size groups• (3) for each query, only one group must be evaluated
![Page 36: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/36.jpg)
37
What Does an Outlier Look Like?
• Outlier Definition: patterns with more than 2,000 occurrences
• Observation result:• Only ~1000 patterns out of 30M• ~ 0.01% patterns• Mostly insignificant code patterns
![Page 37: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/37.jpg)
38
Analysis of the Data Characteristics: Sampling efficiency
• Can sampling be used to reduce the amount of data being analyzed?
• Answer: Yes (e.g., 33% contains 91% of popular patterns)
![Page 38: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/38.jpg)
39
Analysis of the Data Characteristics: Indexing
• Can 32bit Hash keys (versus MD5) be used without affecting index quality?
abc 123 abc 123 aXc 456 aXc 123
• Answer: Yes 0.002% error rate
Only 10 cases for same key for three distinct strings
![Page 39: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/39.jpg)
40
Method Names Are Reliable?
• Input Data: Koders 1-year query log– ~10M records
• Observation purpose:– Importance of method names
• Observation result:– 98% success rate vs. 69%
• Result interpretation:– Method names in this context are reliable source of information– They must be preserved to increase precision
![Page 40: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/40.jpg)
41
Source Code Search Framework
![Page 41: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/41.jpg)
42
Internet-scale Real-time Code Clone Search via Multi-level Indexing
– Internet-scale & Speed• 32-bit Hash values
– Type-3 clone• Multi-level indexing
– Customized for Internet-scale Code Search• Special transformation rule
![Page 42: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/42.jpg)
43
Response Time (Pattern Matching) [WCRE’11]
• Regular queries– 25 microseconds
• 99.99% queries– 900 microseconds
![Page 43: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/43.jpg)
44
Conclusion
![Page 44: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/44.jpg)
45
Answer:Research Question #1
Internet-scale Real-time Code Search Is Possible?
YES
![Page 45: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/45.jpg)
The DilemmaHow to distribute the 100 milliseconds between phases?
Pattern Matching Semantic Matching
0 25 50 75 100
1 millisecond 99 milliseconds
Answer:
Answer:Research Question #2
![Page 46: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/46.jpg)
Pattern Matching Semantic Matching
0 25 50 75 100
99 milliseconds
Research Opportunity
Analysis
![Page 47: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/47.jpg)
48
SummaryStep 1
• Studied characteristics of source code on the Internet– unique patterns distribution (sampling application)– Pattern frequencies (multi-level search)– 32-bit hashing strength (code pattern)– Outlier patterns– Method name importance
Step 2• Designed an Internet-scale clone search
– Customized for code search (precision)– Fine granularity– Multi-level Indexing approach (Type-3 clone)– Microsecond range response time (up to 10 times faster)
![Page 48: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/48.jpg)
49
PublicationCode Clone Search and Detection (http://aseg.cs.concordia.ca/seclone/)
• Iman Keivanloo, Juergen Rilling, Philippe Charland. Internet-scale Real-time Code Clone Search via Multi-level Indexing. 18th Working Conference on Reverse Engineering (WCRE 2011), Lero, Limerick , Ireland.
• Iman Keivanloo, Juergen Rilling, Philippe Charland. SeClone – A Hybrid Approach to Internet-Scale Real-Time Code Clone Search. 19th IEEE International Conference on Program Comprehension (ICPC 2011), Kingston, Ontario, Canada.
Source Code Sharing using Linked Data (secold.org)• Iman Keivanloo, Chris Forbes, Juergen Rilling, and Philippe Charland, "Towards Sharing Source Code Facts Using
Linked Data," ICSE Workshop on Search-Driven Development: Users, Infrastructure, Tools and Evaluation (SUITE). 2011.
Source Code Search (http://aseg.cs.concordia.ca/codesearch)• Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. Semantic Web-based Source Code Search. 6th
International Workshop on Semantic Web Enabled Software Engineering (SWESE 2010), June 35, San Francisco, USA. • Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. SE-CodeSearch: A Scalable Semantic Web-based
Source Code Search Infrastructure. 26th IEEE International Conference on Software Maintenance (ICSM), Early Research Achievements (ERA) Track, Sept. 12-18, Timișoara, Romania.
![Page 49: Source Code Clone Search (Iman keivanloo PhD seminar)](https://reader037.fdocuments.net/reader037/viewer/2022103000/556ae267d8b42a86218b45e5/html5/thumbnails/49.jpg)
50
QUESTION?Thank you for your kind attention
PhD SeminarComputer Science and Software Engineering DepartmentNovember-17-2011