Post on 03-Mar-2017
Carol V. Alexandru, Sebastiano Panichella, Harald C. GallSoftware Evolution and Architecture Lab
University of Zurich, Switzerland{alexandru,panichella,gall}@ifi.uzh.ch
22.02.2017
SANER’17
Klagenfurt, Austria
The Problem Domain
• Static analysis (e.g. #Attr., McCabe, coupling...)
1
The Problem Domain
v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0
2
• Static analysis (e.g. #Attr., McCabe, coupling...)
The Problem Domain
v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0
2
• Static analysis (e.g. #Attr., McCabe, coupling...)
• Many revisions, fine-grained historical data
A Typical Analysis Process
3
www
clone
select project
A Typical Analysis Process
www
clone
checkout
select project
select revision
3
A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
applytool
storeanalysis
results
3
A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
applytool
storeanalysis
results
more revisions?
3
A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
applytool
storeanalysis
results
more revisions?
more projects?
3
Redundancies all over...
4
Redundancies in historical code analysis
Impact on Code Study Tools
Redundancies all over...
Redundancies in historical code analysis
Few files change
Only small parts of a file change
Across RevisionsImpact on Code
Study Tools
4
Redundancies all over...
Redundancies in historical code analysis
Few files change
Only small parts of a file change
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
4
Redundancies all over...
Redundancies in historical code analysis
Few files change
Only small parts of a file change
Changes may not even affect results
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
Storing redundant results
4
Redundancies all over...
4
Redundancies in historical code analysis
Few files change
Across Languages
Only small parts of a file change
Changes may not even affect results
Each language has their own toolchain
Yet they share many metrics
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
Storing redundant results
Redundancies all over...
Redundancies in historical code analysis
Few files change
Across Languages
Only small parts of a file change
Changes may not even affect results
Each language has their own toolchain
Yet they share many metrics
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
Storing redundant results
Re-implementing identical analyses
Generalizability is expensive
4
Redundancies all over...
Redundancies in historical code analysis
Few files change
Across Languages
Only small parts of a file change
Changes may not even affect results
Each language has their own toolchain
Yet they share many metrics
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
Storing redundant results
Re-implementing identical analyses
Generalizability is expensive
5
Important!
6
Techniques implemented
in LISA
Your favouriteanalysis features
Pick what you like!
#1: Avoid Checkouts
Avoid checkouts
7
clone
Avoid checkouts
7
clone
checkout
read write
Avoid checkouts
7
clone
checkout
read
read write
analyze
Avoid checkouts
7
clone
checkout
read
For every file: 2 read ops + 1 write opCheckout includes irrelevant filesNeed 1 CWD for every revision to be analyzed in parallel
read write
analyze
Avoid checkouts
8
clone analyze
read
Avoid checkouts
8
clone analyze
Only read relevant files in a single read opNo write opsNo overhead for parallization
read
Avoid checkouts
8
clone analyze
Only read relevant files in a single read opNo write opsNo overhead for parallization
Git
Analysis Tool
File Abstraction Layer
read
Avoid checkouts
8
clone analyze
Only read relevant files in a single read opNo write opsNo overhead for parallization
Git
Analysis Tool
File Abstraction Layer
E.g. for the JDK Compiler:
class JavaSourceFromCharrArray(name: String, val code: CharBuffer)extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {override def getCharContent(): CharSequence = code
}
read
Avoid checkouts
clone analyze
Only read relevant files in a single read opNo write opsNo overhead for parallization
Git
Analysis Tool
File Abstraction Layer
E.g. for the JDK Compiler:
class JavaSourceFromCharrArray(name: String, val code: CharBuffer)extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {override def getCharContent(): CharSequence = code
}
read
9
#2: Use a multi-revision representationof your sources
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
10
rev. 1
rev. 2
rev. 3
rev. 4
11
rev. 1
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
12
rev. 2
Merge ASTs
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
13
rev. 3
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
14
rev. 4
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
15
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
16
rev. range [1-4]
rev. range [1-2]
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4AspectJ (~440k LOC):
1 commit: 2.2M nodesAll >7000 commits: 6.5M nodes
17
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4AspectJ (~440k LOC):
1 commit: 2.2M nodesAll >7000 commits: 6.5M nodes
18
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4AspectJ (~440k LOC):
1 commit: 2.2M nodesAll >7000 commits: 6.5M nodes
19
#3: Store AST nodes only ifthey're needed for analysis
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks) and name for each method and class?
20
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
140 AST nodes(using ANTLR)
parse
What's the complexity (1+#forks) and name for each method and class?
20
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
140 AST nodes(using ANTLR)
parseCompilationUnit
TypeDeclaration
Modifiers
public
Members
Method
Modifiers
public
Name
run
Name
Demo
Parameters ReturnType
PrimitiveType
VOID
Body
Statements
...
...
What's the complexity (1+#forks) and name for each method and class?
20
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks) and name for each method and class?
140 AST nodes(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes(using ANTLR)
21
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks) and name for eachmethod and class?
140 AST nodes(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes(using ANTLR)
22
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks) and name for eachmethod and class?
140 AST nodes(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes(using ANTLR)
23
#4: Use non-duplicative data structuresto store your results
rev. 1
rev. 2
rev. 3
rev. 4
24
rev. 1
rev. 2
rev. 3
rev. 4
24
rev. 1
rev. 2
rev. 3
rev. 4
24
[1-1]
label
#attr
mcc
[4-4]
label
#attr
mcc
InnerClass
0 4
1 2 4
[2-3]
label
#attr
mcc
rev. 1
rev. 2
rev. 3
rev. 4 [1-1]
label
#attr
mcc
[4-4]
label
#attr
mcc
InnerClass
0 4
1 2 4
[2-3]
label
#attr
mcc
25
LISA also does:#5: Parallel Parsing#6: Asynchronous graph computation#7: Generic graph computationsapplying to ASTs from compatiblelanguages
26
To Summarize...
The LISA Analysis Process
27
www
clone
select project
The LISA Analysis Process
27
www
clone
parallel parseinto mergedgraph
select project
Language Mappings(Grammar to Analysis)
determines which ASTnodes are loaded
ANTLRv4Grammar used by
Generates Parser
The LISA Analysis Process
27
www
clone
parallel parseinto mergedgraph
Res
select project
storeanalysis
results
Analysis formulated as Graph Computation
Language Mappings(Grammar to Analysis) used by
Async. compute
determines which ASTnodes are loaded
determines whichdata is persisted
ANTLRv4Grammar used by
Generates Parser
runson graph
The LISA Analysis Process
27
www
clone
parallel parseinto mergedgraph
Res
select project
storeanalysis
results
more projects?
Analysis formulated as Graph Computation
Language Mappings(Grammar to Analysis) used by
Async. compute
determines which ASTnodes are loaded
determines whichdata is persisted
ANTLRv4Grammar used by
Generates Parser
runson graph
How well does it work, then?
Marginal cost for +1 revision
41.670
4.6330.525 0.109 0.082 0.071 0.052 0.041 0.032 0.033
0
5
10
15
20
25
30
35
40
45
50
1 10 100 1000 2000 3000 4000 5000 6000 7000
Average Parsing+Computation time per Revision when analyzing n revisions of AspectJ (10 common metrics)
Seco
nd
s
# of revisions
28
Overall Performance Stats
29
Language Java C# JavScript
#Projects 100 100 100
#Revisions 646'261 489'764 204'301
#Files (parsed!) 3'235'852 3'234'178 507'612
#Lines (parsed!) 1'370'998'072 961'974'773 194'758'719
Total Runtime (RT)¹ 18:43h 52:12h 29:09h
Median RT¹ 2:15min 4:54min 3:43min
Tot. Avg. RT per Rev.² 84ms 401ms 531ms
Med. Avg. RT per Rev.² 30ms 116ms 166ms
¹ Including cloning and persisting results² Excluding cloning and persisting results
What's the catch?(There are a few...)
The (not so) minor stuff
• Must implement analyses from scratch
• No help from a compiler
• Non-file-local analyses need some effort
30
The (not so) minor stuff
• Must implement analyses from scratch
• No help from a compiler
• Non-file-local analyses need some effort
• Moved files/methods etc. add overhead
• Uniquely identifying files/entities is hard
• (No impact on results, though)
30
Language matters
31
Javascript C# Java
E.g.: Javascript takes longer because:• Larger files, less modularization• Slower parser (automatic semicolon-insertion)
LISA is E X TREME
32
complexfeature-richheavyweight
simplegeneric
lightweight
Thank you for your attention
SANER ‘17, Klagenfurt, 22.02.2017
Read the paper: http://t.uzh.ch/Fj
Try the tool: http://t.uzh.ch/Fk
Get the slides: http://t.uzh.ch/Fm
Contact me: alexandru@ifi.uzh.ch