A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G....
-
Upload
mitchell-thornton -
Category
Documents
-
view
219 -
download
0
Transcript of A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G....
A Taxonomy of Evaluation Approaches in
Software Engineering
A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, E. Stiakakis
University of Macedonia, Greece
BCI 2015, Craiova Romania, September 2015
…we regret to inform you …
the evaluation of your approach is rather weak …
…unfortunately we had to reject a number of good papers…
..the proposed approach lacks a thorough evaluation……we would like to thank you for your submission, BUT…
..further evaluation is required …
…Congratulations, your paper has been accepted……evaluation is backed up by systematic statistical results …
Taxonomies
Taxonomy: Τάξις (Arrangement)
+ Νόμος (Law, Method)
“aims at organizing a collection of objects in a hierarchical manner to provide a conceptual framework for discussion and analysis”
Context of Study
3 PhD students, 3 faculty members
TSE: ΙΕΕΕ Transactions on Software Engineering, TOSEM: ACM Trans. on Soft. Eng. and Methodology, JSS: Elsevier's Journal of Systems and Software
articles that appeared in the corresponding 2012 volume
Context of Study (2)
Title, Authors, Journal, IssueFree Keywords & Classification (ACM)Employed Evaluation ApproachPages devoted to the evaluationTotal #pages
TSE: 81 articlesTOSEM: 24 articlesJSS: 207 articles
Filtered:articles that clearly did not belong in the SE domain,Empirical Studies (Systematic Literature Reviews, surveys, mapping studies)
TSE: 58 articlesTOSEM: 22 articlesJSS: 53 articles
133 Articles
Key Terms
Performance: Most typical definition of performance originates from computer architecture: performance refers to the amount of work that a system/computer/program can perform in a given time or for given resources.
Effectiveness: By effectiveness we refer to the extent by which a proposed technique/methodology accomplishes the desired goal. For example, a testing approach is effective if it reveals a large number of bugs.
Benchmark: A benchmark is a standard, acknowledged data set (consisting of tasks, collection of items, software etc.) designed with the purpose of being representative of problems that occur frequently in real domains.
Proposed Taxonomy
Quantitative/Qualitative
Expertise
Research Questions
Human Involvement
Quality properties
Use of Benchmarks
Functional / Qualitative Properties
Evaluation Strategy
Evaluation Approach
E1Comparison to similar
approaches
E1.1Qualitative
comparison (Listing of pros/cons)
E1.2Quantitative Comparison
E1.2.2Benchmark based
E1.2.2.2Effectiveness Analysis
(benchmarks)
E1.2.2.1Performance Analysis
(benchmarks)
E2Formal Proof
E3Case Studies
E3.2 Performance Analysis
(case studies)
E3.3Effectiveness Analysis
(case studies)
E3.1 Demonstration
E3.3.2Human Evaluation
E2.2Theorem Proving
E1.2.1.2Effectiveness Analysis
(ad hoc samples)
E1.2.1Non-Benchmark
based
E3.3.1Non-Human Evaluation
E1.2.1.1 Performance Analysis
(ad hoc samples)
E3.3.1.1Explicit Research
Questions(non-human)
E3.3.1.2No Explicit Research
Questions(non-human)
E3.3.2.1By Experts
E3.3.2.2By non-Experts
E3.3.2.2.1Explicit Research
Questions(human, non-experts)
E3.3.2.2.2No Explicit Research
Questions (human, non-experts)
E3.3.2.1.1Explicit Research
Questions(human, experts)
E3.3.2.1.2No Explicit Research
Questions (human, experts)
E1.2.1.2.1Explicit Research
Questions(Effectiveness analysis
– ad hoc samples)
E1.2.1.2.2No Explicit Research
Questions(Effectiveness analysis
– ad hoc samples)
E1.2.2.2.1Explicit Research
Questions(Effectiveness analysis
- benchmarks)
E1.2.2.2.2No Explicit Research
Questions(Effectiveness analysis
- benchmarks)
E2.1Properties Fulfilment
Goal is to make clear the advantages and dis-advantages over previous work, and usually to high-light the added value of the proposed technique
Proposed Taxonomy
Quantitative/Qualitative
Expertise
Research Questions
Human Involvement
Quality properties
Use of Benchmarks
Functional / Qualitative Properties
Evaluation Strategy
Evaluation Approach
E1Comparison to similar
approaches
E1.1Qualitative
comparison (Listing of pros/cons)
E1.2Quantitative Comparison
E1.2.2Benchmark based
E1.2.2.2Effectiveness Analysis
(benchmarks)
E1.2.2.1Performance Analysis
(benchmarks)
E2Formal Proof
E3Case Studies
E3.2 Performance Analysis
(case studies)
E3.3Effectiveness Analysis
(case studies)
E3.1 Demonstration
E3.3.2Human Evaluation
E2.2Theorem Proving
E1.2.1.2Effectiveness Analysis
(ad hoc samples)
E1.2.1Non-Benchmark
based
E3.3.1Non-Human Evaluation
E1.2.1.1 Performance Analysis
(ad hoc samples)
E3.3.1.1Explicit Research
Questions(non-human)
E3.3.1.2No Explicit Research
Questions(non-human)
E3.3.2.1By Experts
E3.3.2.2By non-Experts
E3.3.2.2.1Explicit Research
Questions(human, non-experts)
E3.3.2.2.2No Explicit Research
Questions (human, non-experts)
E3.3.2.1.1Explicit Research
Questions(human, experts)
E3.3.2.1.2No Explicit Research
Questions (human, experts)
E1.2.1.2.1Explicit Research
Questions(Effectiveness analysis
– ad hoc samples)
E1.2.1.2.2No Explicit Research
Questions(Effectiveness analysis
– ad hoc samples)
E1.2.2.2.1Explicit Research
Questions(Effectiveness analysis
- benchmarks)
E1.2.2.2.2No Explicit Research
Questions(Effectiveness analysis
- benchmarks)
E2.1Properties Fulfilment
By formal treatment we mean the use of a mathematically-based approach for proving theorems, properties, invariants or the correctness of a system. Not all of software engineering research can benefit from the application of formal methods
criterion is related to the completeness of the proof,
1. the mathematical reasoning validates the entire approach
2. ensures the fulfillment of certain properties
Proposed Taxonomy
Quantitative/Qualitative
Expertise
Research Questions
Human Involvement
Quality properties
Use of Benchmarks
Functional / Qualitative Properties
Evaluation Strategy
Evaluation Approach
E1Comparison to similar
approaches
E1.1Qualitative
comparison (Listing of pros/cons)
E1.2Quantitative Comparison
E1.2.2Benchmark based
E1.2.2.2Effectiveness Analysis
(benchmarks)
E1.2.2.1Performance Analysis
(benchmarks)
E2Formal Proof
E3Case Studies
E3.2 Performance Analysis
(case studies)
E3.3Effectiveness Analysis
(case studies)
E3.1 Demonstration
E3.3.2Human Evaluation
E2.2Theorem Proving
E1.2.1.2Effectiveness Analysis
(ad hoc samples)
E1.2.1Non-Benchmark
based
E3.3.1Non-Human Evaluation
E1.2.1.1 Performance Analysis
(ad hoc samples)
E3.3.1.1Explicit Research
Questions(non-human)
E3.3.1.2No Explicit Research
Questions(non-human)
E3.3.2.1By Experts
E3.3.2.2By non-Experts
E3.3.2.2.1Explicit Research
Questions(human, non-experts)
E3.3.2.2.2No Explicit Research
Questions (human, non-experts)
E3.3.2.1.1Explicit Research
Questions(human, experts)
E3.3.2.1.2No Explicit Research
Questions (human, experts)
E1.2.1.2.1Explicit Research
Questions(Effectiveness analysis
– ad hoc samples)
E1.2.1.2.2No Explicit Research
Questions(Effectiveness analysis
– ad hoc samples)
E1.2.2.2.1Explicit Research
Questions(Effectiveness analysis
- benchmarks)
E1.2.2.2.2No Explicit Research
Questions(Effectiveness analysis
- benchmarks)
E2.1Properties Fulfilment
Application of the proposed tool, algorithm, technique on artificially constructed or selected case studies. Results are obtained and discussed to demonstrate the feasibility, performance or effectiveness of the approach.
Empirical EvaluationCase StudiesCase Study EvaluationEmpirical ResultsExperimentsExperimental Results…..
E3 Case Studies (113|85%)
E1 Comparison to Similar Approaches (48|36%)
E2 Formal Proof (18|14%)
E1.2.1.1 Performance Analysis (5|4%)
E1.2.2.1 Performance Analysis (8|6%)
E3.3.2.1 Experts (9|7%)E3.3.2.1.1 Explicit Research Questions (3|2%)
E3.3.2.1.2 No Explicit Research Questions (6|5%)
E3.3.2.2.1 Explicit Research Questions (9|7%)E3.3.2.2.2 No Explicit Research Questions (7|5%)
E1.1 Qualitative Comparison (19|14%)
E1.2 Quantitative Comparison (34|26%)
E1.2.1 Non-Benchmark Based (21|16%)
E.1.2.1.2 Effectiveness Analysis (18|14%)
E1.2.2 Benchmark Based (12|9%)
E.1.2.2.2 Effectiveness Analysis (6|5%)
E2.1 Properties Fulfillment (6|5%)
E2.2 Theorem proving (13|10%)
E3.1 Demonstration (33|25%)
E3.2 Performance Analysis (33|25%)
E3.3 Effectiveness Analysis (75|56%)
E3.3.1 Non-Human Evaluation (60|45%)
E3.3.1.1 Explicit Research Questions (27|20%)E.3.3.1.2 No Explicit Research Questions (33|25%)
E3.3.2 Human Evaluation (24|18%)
E.3.3.2.2 Non-Experts (16|12%)
Relative Frequency
E1.2.1.2.1 Explicit Research Questions (3|2%)E1.2.1.2.2 No Explicit Research Questions (15|11%)
E1.2.2.2.1 Explicit Research Questions (1|1%)E1.2.2.2.2 No Explicit Research Questions (5|4%)
Extent of Evaluation
papers with just one page and papers with as many as 24 pages for the evaluation have been encountered
0
5
10
15
20
25
30
35
40
0%-10% 11%-20% 21%-30% 31%-40% >40%
Num
ber o
f Pap
ers
Extent of Evaluation
Availability of Data
Dataset Publicly Available
42%
Dataset Not Available
58%
Tool Publicly
Available24%
Tool Not Available76%
Validation of the Taxonomy
• By definition, it is difficult to assess whether taxonomies are valid, since their construction relies on the subjective interpretation of categories
• we have applied the taxonomy on articles which have not been considered during its development
• we have classified the papers from the Main Track of the 34th International Conference on Software Engineering (ICSE'2012)
• 87 articles have been consideredWe recorded:a) Whether the paper actually introduces any technique b) Whether the paper could be mapped to any of the derived
classification categories c) The corresponding category code
Validation of the Taxonomy (2)
0%3%5%8%
10%13%15%18%20%23%
Pe
rce
nta
ge o
f Ev
alu
atio
n A
pp
roa
che
s
Type of Evaluation
Initial Taxonomy ICSE articles
Correlation between evaluation and area
RQ1: Is the evaluation approach correlated to the area of research?
H0 Variables "Area of Research" and "Evaluation Type" are independent H1 Variables "Area of Research" and "Evaluation Type" are dependent
Areas of research correspond to a second level classification based on the 2012 ACM Computing Classification System
A chi-square test revealed that there is no statistically significant correlation between “Evaluation Type” and “Area of Research”
E1.1 E1.2 E2 E3.1 E3.2 E3.3
Formal language definitions
Collaboration in software development
Context specific languages
Contextual software domains
Designing software
Development frameworks and environments
Distributed systems organizing principles
Extra-functional properties
General programming languages
Software development process management
Software development techniques
Software functional properties
Software post-development issues
Software system structures
Software verification and validation
System description languages 2
4
6
8
10
12
14
16
18
20
22
In Software Testing there is a tendency to employ case studies and analysis of effectiveness (i.e. how well a testing strategy achieves its goals)
Correlation between evaluation and area
RQ2: Is the extent of the evaluation correlated to the evaluation approach?
H0 The distribution of "Extent of Evaluation" is the same across categories of "Evaluation Type" H1 The distribution of "Extent of Evaluation" is not the same across categories of "Evaluation Type"
we applied the non-parametric Independent-Samples Kruskal-Wallis test to compare the distributions across groups formed by the evaluation type variable
result is significant at the 0.05 level. In other words, the extent of evaluation is affected by the employed evaluation strategy.
Evaluation of efficiency on case studies, relying on explicitly stated research questions (E3.3.1.1) devotes a large percentage of the paper to the evaluation.
Conclusion
In software engineering there is a vast amount of different evaluation techniques designed and executed to serve the needs of each particular research
We have attempted to introduce a taxonomy of evaluation approaches.
We identified 17 evaluation types that any approach can adopt either individually or in combination with other types
and 8 axes according to which evaluation approaches can be classified.
. . . We are glad to inform you that your paper: ….
has been ACCEPTED by BCI 2015 Program Committee
Review 1
… the authors have done good job in supporting their
methodology by a convincing evaluation approach …..
So, the next time you receive a review pointing to the strength or weaknesses of the evaluation approach
You might be able to classify your approach based on the proposed taxonomy!