Java Unit Testing Tool Competition — Fifth Round
-
Upload
annibale-panichella -
Category
Technology
-
view
65 -
download
0
Transcript of Java Unit Testing Tool Competition — Fifth Round
![Page 1: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/1.jpg)
.lusoftware verification & validationVVS
Java Unit Testing Tool Competition — Fifth Round
Annibale Panichella, Urko Rueda Molina
1
![Page 2: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/2.jpg)
Previous Editions
Year Venue Coverage tool
Mutation Tool #CUTs #Projects #Participants Statistical
Tests
Round 1 2013 ICST Cobertura Javalanche 77 5 2 ✗
Round 2 2014 FITTEST JaCoCo PITest 63 9 4 ✗
Round 3 2015 SBST JaCoCo PITest 63 9 8 ✗
Round 4 2016 SBST DEFECT4J (Real Faults) 68 5 4 ✗
2
![Page 3: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/3.jpg)
New Edition
Year Venue Coverage tool
Mutation Tool #CUTs #Projects #Participants Statistical
Tests
Round 1 2013 ICST Cobertura Javalanche 77 5 2 ✗
Round 2 2014 FITTEST JaCoCo PITest 63 9 4 ✗
Round 3 2015 SBST JaCoCo PITest 63 9 8 ✗
Round 4 2016 SBST DEFECT4J (Real Faults) 68 5 4 ✗
Round 5 2017 SBST JaCoCo PITest + Our Env. 69 8 2+2 ✓
3
![Page 4: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/4.jpg)
The Infrastructure
4
![Page 5: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/5.jpg)
The Infrastructure
Defect4j
Defect4j
• The previous edition used DEFECT4J to detect flaky tests and to measure effectiveness
• In the new edition, we modified the infrastructure to work with libraries not in DEFECT4J
• We developed our own tool to detect flaky tests
• Effectiveness based on mutation analysis: PITest + JaCoCo
5
![Page 6: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/6.jpg)
The Infrastructure
Defect4j
Defect4j
• The previous edition used DEFECT4J to detect flaky tests and to measure effectiveness
• In the new edition, we modified the infrastructure to work with libraries not in DEFECT4J
• We developed our own tool to detect flaky tests
• Effectiveness based on mutation analysis: PITest + JaCoCo
6
![Page 7: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/7.jpg)
The Infrastructure
Defect4j
Defect4j
• The previous edition used DEFECT4J to detect flaky tests and to measure effectiveness
• In the new edition, we modified the infrastructure to work with libraries not in DEFECT4J
• We developed our own tool to detect flaky tests
• Effectiveness based on mutation analysis: PITest + JaCoCo
7
![Page 8: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/8.jpg)
The Infrastructure
• The previous edition used DEFECT4J to detect flaky tests and to measure effectiveness
• In the new edition, we modified the infrastructure to work with libraries not in DEFECT4J
• We developed our own tool to detect flaky tests
• Effectiveness based on mutation analysis: PITest + JaCoCo
Our Tool
PITest +
JaCoCo
8
![Page 9: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/9.jpg)
Test Management
Flaky tests: • Pass during generation but fail when re-executed • Detection mechanism: we run each test suite five times • Ignored when computing the coverage scores
Non-compiling tests: • Generated test suites were re-compiled in our own
execution environment
9
![Page 10: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/10.jpg)
Metric Computation
Code Coverage: • Statement coverage • Condition coverage
Mutation Score: • We did not use PITest’s running engine since it gave
errors for test cases with ad-hoc/non-standard JUnit runners (e.g., in EvoSuite)
• We only use PITest engine for the generation of mutants
• Combining PITest with JaCoCo: executing only mutants infecting covered lines
10
![Page 11: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/11.jpg)
We apply the same formula used in the last competition since it combines coverage metrics, effectiveness, execution time and number of flaky/non-compiling tests
Scoring Formula
T = Generated Test B = Search Budget C = Class under test R = independent Run
Covi = statement coverage Covb = branch coverage Covm = Strong Mutation
covScorehT,B,C,ri = 1⇥ Covi + 2⇥ Covb + 4⇥ Covm1 2 4
11
![Page 12: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/12.jpg)
We apply the same formula used in the last competition since it combines coverage metrics, effectiveness, execution time and number of flaky/non-compiling tests
Scoring Formula
tScorehT,B,C,ri = covScorehT,B,C,ri ⇥min
✓1,
L
genT ime
◆
T = Generated Test B = Search Budget C = Class under test R = independent Run
Covi = statement coverage Covb = branch coverage Covm = Strong Mutation
getTime = generation time
covScorehT,B,C,ri = 1⇥ Covi + 2⇥ Covb + 4⇥ Covm1 2 4
2 x B
12
![Page 13: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/13.jpg)
We apply the same formula used in the last competition since it combines coverage metrics, effectiveness, execution time and number of flaky/non-compiling tests
Scoring Formula
tScorehT,B,C,ri = covScorehT,B,C,ri ⇥min
✓1,
L
genT ime
◆
T = Generated Test B = Search Budget C = Class under test R = independent Run
Covi = statement coverage Covb = branch coverage Covm = Strong Mutation
getTime = generation time
penalty = percentage of flaky test and non-compiling tests
ScorehT,B,C,ri = tScorehT,B,C,ri + penaltyhT,B,C,ri
covScorehT,B,C,ri = 1⇥ Covi + 2⇥ Covb + 4⇥ Covm1 2 4
2 x B
13
![Page 14: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/14.jpg)
The Competition
14
![Page 15: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/15.jpg)
The Tools
jTExpert
RandoopAutomatic unit test generation for Java
T3
15
![Page 16: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/16.jpg)
Selection of the Benchmark Classes
Source Application Domain # Classes # Selected Classes
BCEL
Apache commons
Bytecode manipulation 431 10
Jxpath Java Beans manipulation with Path syntax 180 10
Imaging Framework to write/read images with various formats 427 4
Google GsonGoogle
Conversion of Java Objects into their JSON representation and vice versa 174 9
Re2j Regular expression engine for time-linear regular expression matching 47 8
Freehep Java Analysis Studio
Open-source repository providing Java utilities for high energy physics applications 180 10
LA4j Github Linear Algebra primitives (matrices and vectors) and algorithms 208 10
Okhttp Github HTTP and HTTP/2 client for Android and Java applications 193 8
16
![Page 17: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/17.jpg)
Selection of the Benchmark Classes
Source Application Domain # Classes # Selected Classes
BCEL
Apache commons
Bytecode manipulation 431 10
Jxpath Java Beans manipulation with Path syntax 180 10
Imaging Framework to write/read images with various formats 427 4
Google GsonGoogle
Conversion of Java Objects into their JSON representation and vice versa 174 9
Re2j Regular expression engine for time-linear regular expression matching 47 8
Freehep Java Analysis Studio
Open-source repository providing Java utilities for high energy physics applications 180 10
LA4j Github Linear Algebra primitives (matrices and vectors) and algorithms 208 10
Okhttp Github HTTP and HTTP/2 client for Android and Java applications 193 8
17
![Page 18: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/18.jpg)
Selection of the Benchmark Classes
Source Application Domain # Classes # Selected Classes
BCEL
Apache commons
Bytecode manipulation 431 10
Jxpath Java Beans manipulation with Path syntax 180 10
Imaging Framework to write/read images with various formats 427 4
Google GsonGoogle
Conversion of Java Objects into their JSON representation and vice versa 174 9
Re2j Regular expression engine for time-linear regular expression matching 47 8
Freehep Java Analysis Studio
Open-source repository providing Java utilities for high energy physics applications 180 10
LA4j Github Linear Algebra primitives (matrices and vectors) and algorithms 208 10
Okhttp Github HTTP and HTTP/2 client for Android and Java applications 193 8
18
![Page 19: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/19.jpg)
Selection Procedure
HOW: • Computing the McCabe’s cyclomatic complexity (MCC) for all methods in
each java library • Filtering out all trivial classes, i.e., classes that contains only methods
with a MCC < 3 • Random sampling from the pruned projects
WHAT/WHY: • Removing (likely) trivial classes not challenging for the tools • Developers may use automated tools for complex classes
19
![Page 20: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/20.jpg)
Benchmark Statistics
Largest Class: Name = XPathParserTokenManager Project = JXPATH N. Statements = 1029 N. Branches = 872
Smallest Class: Name = ForwardBackSubstitutionSolver Project = LA4J N. Statements = 26 N. Branches = 20
# Branches
Freq
uenc
y
# Statements
Freq
uenc
y
20
![Page 21: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/21.jpg)
The Methodology
• Search Budgets = 10s, 30s, 60s, 120s, 240s, 300s, 480s
• Number of CUTs = 69
• Number of repetitions = 3
• All tools have been executed in parallel (multi-threading) on the same machine
• Statistical analysis: Friedman’s test: non-parametric test for multiple-problem analysis Post-hoc Connover’s procedure for pairwise multiple comparisons
21
![Page 22: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/22.jpg)
The Results
22
![Page 23: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/23.jpg)
Coverage Results
Search Budget = 10s Search Budget = 30s
23
![Page 24: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/24.jpg)
Coverage Results
Search Budget = 60s Search Budget = 480s
24
![Page 25: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/25.jpg)
Coverage Results
There are 43 classes out of 69 (≈ 60%) for which at least one of the two participant tools could
not generate any test case.
What happens if we consider only classes for which both EvoSuite and
JTexpert could generate tests?
Filtered Results with Search Budget = 480s
25
![Page 26: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/26.jpg)
Scalability %
Bra
nch
Cove
rage
0
25
50
75
100
Search Budget10s 30s 60s 120s 240s 300s 480s
EvoSuite JTExpertT3 Randoop
% S
trong
Mut
ation
Cov
.
0
12.5
25
37.5
50
Search Budget10s 30s 60s 120s 240s 300s 480s
EvoSuite JTExpertT3 Randoop
Comparison for the class Parser.java extracted from the library Re4J. N. Statements = 760, N. Branches = 565, N. Mutants = 203
26
![Page 27: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/27.jpg)
ScoringSc
ore
0
75
150
225
300
Search Budget10s 30s 60s 120s 240s 300s 480s
EvoSuite JTExpert T3 Randoop
27
![Page 28: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/28.jpg)
Generated vs. Manually-written Tests
Comparison of the scores achieved by • EvoSuite after 480s • JTexpert after 480s • T3 after 480s • Random after 480s • Manually-written tests • Optimal Score
N.B.: We only considered the 63 subjects for which we found developers-written tests.
0
50
100
150
200
250
300
350
400
450
500
268
6178125
251
Optimal
EvoSuit
e
JTExpe
rt T3Rand
oop
Manual
28
![Page 29: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/29.jpg)
Tool Total Score St. Dev.Friedman’s Test
Statistically better than (Conover’s procedure)
Rank Score
EvoSuite 1457 193 1 1.55 JTExpert, T3, Randoop
JTexpert 849 102 2 2.71 T3, Randoop
T3 526 82 3 2.81 Random
Random 448 34 4 2.92
Statistical Analysis
29
![Page 30: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/30.jpg)
Tool Total Score St. Dev.Friedman’s Test
Statistically better than (Conover’s procedure)
Rank Score
EvoSuite 1457 193 1 1.55 JTExpert, T3, Randoop
JTexpert 849 102 2 2.71 T3, Randoop
T3 526 82 3 2.81 Random
Random 448 34 4 2.92
Statistical Analysis
30
![Page 31: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/31.jpg)
Statistical Analysis
Tool Total Score St. Dev.Friedman’s Test
Statistically better than (Conover’s procedure)
Rank Score
EvoSuite 1457 193 1 1.55 JTExpert, T3, Randoop
JTexpert 849 102 2 2.71 T3, Randoop
T3 526 82 3 2.81 Random
Random 448 34 4 2.92
31
![Page 32: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/32.jpg)
Lessons Learnt
• Using multi-problem statistical tests
• Selection procedure to filter-out (likely) trivial classes
• Subject categories: string manipulation, computational intensive, object manipulation, etc.
• What next:
• Publishing the benchmark infrastructure
• Performing a more in-depth analysis for each subject category
• More Tools, new languages? (i.e., C, C#?)
32
![Page 33: Java Unit Testing Tool Competition — Fifth Round](https://reader033.fdocuments.net/reader033/viewer/2022051404/5a6592a57f8b9a9f2f8b4619/html5/thumbnails/33.jpg)
.lusoftware verification & validationVVS
Java Unit Testing Tool Competition — Fifth Round
Annibale Panichella, Urko Rueda Molina
33