A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G....

A Taxonomy of Evaluation Approaches in

Software Engineering

A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, E. Stiakakis

University of Macedonia, Greece

BCI 2015, Craiova Romania, September 2015

…we regret to inform you …

the evaluation of your approach is rather weak …

…unfortunately we had to reject a number of good papers…

..the proposed approach lacks a thorough evaluation……we would like to thank you for your submission, BUT…

..further evaluation is required …

…Congratulations, your paper has been accepted……evaluation is backed up by systematic statistical results …

I need some proof = EVALUATION!!

Taxonomies

Taxonomy: Τάξις (Arrangement)

+ Νόμος (Law, Method)

“aims at organizing a collection of objects in a hierarchical manner to provide a conceptual framework for discussion and analysis”

Goal of Study

To build a taxonomy

of evaluation approaches

in Software Engineering

Context of Study

3 PhD students, 3 faculty members

TSE: ΙΕΕΕ Transactions on Software Engineering, TOSEM: ACM Trans. on Soft. Eng. and Methodology, JSS: Elsevier's Journal of Systems and Software

articles that appeared in the corresponding 2012 volume

Context of Study (2)

Title, Authors, Journal, IssueFree Keywords & Classification (ACM)Employed Evaluation ApproachPages devoted to the evaluationTotal #pages

TSE: 81 articlesTOSEM: 24 articlesJSS: 207 articles

Filtered:articles that clearly did not belong in the SE domain,Empirical Studies (Systematic Literature Reviews, surveys, mapping studies)

TSE: 58 articlesTOSEM: 22 articlesJSS: 53 articles

133 Articles

Key Terms

Performance: Most typical definition of performance originates from computer architecture: performance refers to the amount of work that a system/computer/program can perform in a given time or for given resources.

Effectiveness: By effectiveness we refer to the extent by which a proposed technique/methodology accomplishes the desired goal. For example, a testing approach is effective if it reveals a large number of bugs.

Benchmark: A benchmark is a standard, acknowledged data set (consisting of tasks, collection of items, software etc.) designed with the purpose of being representative of problems that occur frequently in real domains.

Proposed Taxonomy

Quantitative/Qualitative

Expertise

Research Questions

Human Involvement

Quality properties

Use of Benchmarks

Functional / Qualitative Properties

Evaluation Strategy

Evaluation Approach

E1Comparison to similar

approaches

E1.1Qualitative

comparison (Listing of pros/cons)

E1.2Quantitative Comparison

E1.2.2Benchmark based

E1.2.2.2Effectiveness Analysis

(benchmarks)

E1.2.2.1Performance Analysis

(benchmarks)

E2Formal Proof

E3Case Studies

E3.2 Performance Analysis

(case studies)

E3.3Effectiveness Analysis

(case studies)

E3.1 Demonstration

E3.3.2Human Evaluation

E2.2Theorem Proving


(ad hoc samples)

E1.2.1Non-Benchmark

based

E3.3.1Non-Human Evaluation

E1.2.1.1 Performance Analysis

(ad hoc samples)

E3.3.1.1Explicit Research

Questions(non-human)

E3.3.1.2No Explicit Research


E3.3.2.1By Experts

E3.3.2.2By non-Experts

E3.3.2.2.1Explicit Research

Questions(human, non-experts)

E3.3.2.2.2No Explicit Research

Questions (human, non-experts)


Questions(human, experts)


Questions (human, experts)


Questions(Effectiveness analysis

– ad hoc samples)



– ad hoc samples)



- benchmarks)



- benchmarks)

E2.1Properties Fulfilment

Goal is to make clear the advantages and dis-advantages over previous work, and usually to high-light the added value of the proposed technique

Proposed Taxonomy


Expertise

Research Questions

Human Involvement

Quality properties

Use of Benchmarks


Evaluation Strategy

Evaluation Approach


approaches

E1.1Qualitative





(benchmarks)


(benchmarks)

E2Formal Proof

E3Case Studies


(case studies)


(case studies)

E3.1 Demonstration


E2.2Theorem Proving


(ad hoc samples)

E1.2.1Non-Benchmark

based



(ad hoc samples)





E3.3.2.1By Experts












– ad hoc samples)



– ad hoc samples)



- benchmarks)



- benchmarks)


By formal treatment we mean the use of a mathematically-based approach for proving theorems, properties, invariants or the correctness of a system. Not all of software engineering research can benefit from the application of formal methods

criterion is related to the completeness of the proof,

1. the mathematical reasoning validates the entire approach

2. ensures the fulfillment of certain properties

Proposed Taxonomy


Expertise

Research Questions

Human Involvement

Quality properties

Use of Benchmarks


Evaluation Strategy

Evaluation Approach


approaches

E1.1Qualitative





(benchmarks)


(benchmarks)

E2Formal Proof

E3Case Studies


(case studies)


(case studies)

E3.1 Demonstration


E2.2Theorem Proving


(ad hoc samples)

E1.2.1Non-Benchmark

based



(ad hoc samples)





E3.3.2.1By Experts












– ad hoc samples)



– ad hoc samples)



- benchmarks)



- benchmarks)


Application of the proposed tool, algorithm, technique on artificially constructed or selected case studies. Results are obtained and discussed to demonstrate the feasibility, performance or effectiveness of the approach.

Empirical EvaluationCase StudiesCase Study EvaluationEmpirical ResultsExperimentsExperimental Results…..

Extent of Evaluation

papers with just one page and papers with as many as 24 pages for the evaluation have been encountered

0

5

10

15

20

25

30

35

40

0%-10% 11%-20% 21%-30% 31%-40% >40%

Num

ber o

f Pap

ers

Extent of Evaluation

Availability of Data

Dataset Publicly Available

42%

Dataset Not Available

58%

Tool Publicly

Available24%

Tool Not Available76%

Validation of the Taxonomy

• By definition, it is difficult to assess whether taxonomies are valid, since their construction relies on the subjective interpretation of categories

• we have applied the taxonomy on articles which have not been considered during its development

• we have classified the papers from the Main Track of the 34th International Conference on Software Engineering (ICSE'2012)

• 87 articles have been consideredWe recorded:a) Whether the paper actually introduces any technique b) Whether the paper could be mapped to any of the derived

classification categories c) The corresponding category code

Validation of the Taxonomy (2)

0%3%5%8%

10%13%15%18%20%23%

Pe

rce

nta

ge o

f Ev

alu

atio

n A

pp

roa

che

s

Type of Evaluation

Initial Taxonomy ICSE articles

Correlation between evaluation and area

RQ1: Is the evaluation approach correlated to the area of research?

H0 Variables "Area of Research" and "Evaluation Type" are independent H1 Variables "Area of Research" and "Evaluation Type" are dependent

Areas of research correspond to a second level classification based on the 2012 ACM Computing Classification System

A chi-square test revealed that there is no statistically significant correlation between “Evaluation Type” and “Area of Research”

E1.1 E1.2 E2 E3.1 E3.2 E3.3

Formal language definitions

Collaboration in software development

Context specific languages

Contextual software domains

Designing software

Development frameworks and environments

Distributed systems organizing principles

Extra-functional properties

General programming languages

Software development process management

Software development techniques

Software functional properties

Software post-development issues

Software system structures

Software verification and validation

System description languages 2

4

6

8

10

12

14

16

18

20

22

In Software Testing there is a tendency to employ case studies and analysis of effectiveness (i.e. how well a testing strategy achieves its goals)

Correlation between evaluation and area

RQ2: Is the extent of the evaluation correlated to the evaluation approach?

H0 The distribution of "Extent of Evaluation" is the same across categories of "Evaluation Type" H1 The distribution of "Extent of Evaluation" is not the same across categories of "Evaluation Type"

we applied the non-parametric Independent-Samples Kruskal-Wallis test to compare the distributions across groups formed by the evaluation type variable

result is significant at the 0.05 level. In other words, the extent of evaluation is affected by the employed evaluation strategy.

Evaluation of efficiency on case studies, relying on explicitly stated research questions (E3.3.1.1) devotes a large percentage of the paper to the evaluation.

Conclusion

In software engineering there is a vast amount of different evaluation techniques designed and executed to serve the needs of each particular research

We have attempted to introduce a taxonomy of evaluation approaches.

We identified 17 evaluation types that any approach can adopt either individually or in combination with other types

and 8 axes according to which evaluation approaches can be classified.

. . . We are glad to inform you that your paper: ….

has been ACCEPTED by BCI 2015 Program Committee

Review 1

… the authors have done good job in supporting their

methodology by a convincing evaluation approach …..

So, the next time you receive a review pointing to the strength or weaknesses of the evaluation approach

You might be able to classify your approach based on the proposed taxonomy!

BCI 2015, Craiova Romania, September 2015

Thank you for your attention!!

A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G....

Documents

Transcript of A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G....