Experimental Software Engineering

Experimental Software Engineering

Prof. Marcos [email protected]

Introductions

Marcos Kalinowski• Software Engineering Professor at PUC-Rio• Member of the ISERN• Main research interests:

– Empirical Software Engineering– Software Quality Improvement

• Further information:– www.inf.puc-rio.br/~kalinowski

• Who are you?– Background, interests, ...

Marcos Kalinowski 2Experimental Software Engineering

• Discipline topics:– Experimental Software Engineering: Overview and Research Opportunities

– Empirical Strategies

– Measurement Concepts

– Systematic Literature Reviews and Mapping Studies

– Surveys

– Case Studies

– Controlled Experiments• Experiment Process: Scoping, Planning, Operation, Analysis and Interpretation,

Presentation and Package

– Design Science Research

– Qualitative Methods

– Theory Building



• Assessment– Evaluation 1 = Topic presentation and participation in classroom

discussions– Evaluation 2 = Secundary study plan– Evaluation 3 = Primary study plan– Evaluation 4 = Paper with 8 to 16 pages in Springer LNCS format

Grade = (Evaluation 1 + Evaluation 2 + Evaluation 3 + (2x Evaluation 4)) / 5

Success– (Presence >= 75%) AND (Grade >= 6)Fail– Otherwise

Marcos Kalinowski Experimental Software Engineering 4



• Text book– Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A.,

Experimentation in Software Engineering, Springer, 2012.

• Additional references– Kitchenham, B.A., Budgen, D., Brereton, P., Evidence-Based Software

Engineering and Systematic Reviews, Chapman and Hall/CRC, 2015.

– Kitchenham, B.A., Charters, S., Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE 2007–001, KeeleUniversity and Durham University Joint Report, 2007.

– Runeson, P., Höst, M., Rainer, A.W., Regnell, B., Case Study Research in Software Engineering – Guidelines and Examples. Wiley, 2012.

– Wieringa, R., Design Science Methodology for Information Systems and Software Engineering. Springer, 2014.

– Scientific Papers



• Important Dates– 30/04 – Deadline for delivering the secondary study plan

– 11/06 – Deadline for delivering the primary study plan

– 02/07 – Deadline for delivering the paper

• Others– 23/04 – Holiday


INTRODUCTION

Marcos Kalinowski 7Experimental Software EngineeringMarcos Kalinowski Experimental Software Engineering

Introduction

• The story of the Denver International Airport ...

8Marcos Kalinowski Engenharia de Software Experimental

DEMARCO, T.; LISTER, T. (2003) Waltzing with Bears – Managing Risk on Software Projects. Dorset House. (ISBN: 978-0932633606).

“Software Engineering discipline remains years – perhaps decades– short of the mature engineeringdiscipline needed to meet thedemands of an information age society”.

Silver Bullets in Software Engineering?

9Marcos Kalinowski Experimental Software Engineering

Introduction

• Software development depends on differenttechnologies

– Usually there is no evidence available concerning:• Benefits

• Limitations

• Risks


Introduction

• During the projects, software engineers need toanswer questions like:

– Which software technology should I consider for myproject?

– How much training/investment is needed to introducethe technology into my process?

– When and how can I observe the return on investiment?

– Under which circumstances does the technology presentthe best performance?


• We need to have knowledge on our software technologies (methods, techniques and tools) to understand the situations in which theyreally work, their limits and how we can evolve them. (Basili, 1996)

Marcos Kalinowski Engenharia de Software Experimental 12

BASILI, V. R. (1996) The role of experimentation in software engineering: past, current, and future. IEEE International Conference on Software Engineering (ICSE), pp. 442-449.

Introduction

Obtaining Knowledge

• Building theories, models, experimentation andlearning

– Understanding a discipline involves building theories andmodels

– To verify if our understanding is correct, we need to:• Conduct experiments on our theories models


Obtaining Knowledge

• Building theories, models, experimentation andlearning

– Understanding a discipline involves building theories andmodels

– To verify if our understanding is correct, we need to:• Conduct experiments on our theories models


Experimentation isfundamental to both,

Academy and Industry!

Software Engineering

• Software Engineering involves development and isnot manufacturing

– Involves reasoning and human elements (e.g., develpers)

– There are several variables that can lead to differences inmeasurements

• Current Scenario:

– Limited amount of theories and models

– Lack of knowledge on the limits of existing technologiesfor certain development contexts



• Experimental Studies

– Descovering something or testing hypotheses

– May involve different types of analysis: quantitativeand/or qualitative

• Studies may be:

16

Primary Secondary (Agregate results ofprimary studies)

Marcos Kalinowski Experimental Software Engineering





• Studies may be:

17



Measuring Variables





• Studies may be:

18



Understanding causes and effects of collected data

Classification of Experiments

In Virtuo

In Silico

In Vivo

In Vitro

No Model Needed

Environment needs to be modelled

Computational Models

of the Object and the Environment

Computational Models of the

Participant Behaviour,

Object and Environment


TRAVASSOS, G. H.; BARROS, M. O. (2003) Contributions of In Virtuo and In Silico Experiments for the Future of Empirical Studies in Software Engineering. In: 2nd Workshop on Empirical Software Engineering: The Future of Empirical Studies in Software Engineering, 2003, Rome.

Required Reading

• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapter 1 – Introduction

– Chapter 2 – Empirical Strategies


PRIMARY STUDIES


Primary Study Types

• Controlled Experiment

– An experiment that allows controlling and manipulatingvariables.

• Case Study

– Investigates a phenomena in a real context. Typicallyconducted during software development or maintenanceprojects. Part of the behavior can not be manipulated.


• Survey

– Accomplished after a fact ocurred, aiming at identifyingsome evidence.

– Does not allow control.

• Action Research

– Research method that combines theory (research) andpractice (action), putting together researchers andpractitioners to solve a problem.


Primary Study Types

Controlled Experiment

Characteristics

– Investigate testable hypotheses

– Independent variables are manipulated to measure theireffects on dependent variables


Examples:

– Which technique is more effective for software inspection: checklist based reading or perspective basedreading?



Observation

Cause Effect

Treatment Result

Theory

IndependentVariable

DependentVariable

ExperimentOperation


WOHLIN, C., RUNESON, P., HÖST, M., OHLSSON, M., REGNELL, B., WESSLÉN, A. (2012) Experimentation in Software Engineering. Springer.


Threats to Validity

• Results of experiments should be reported consideringtheir validity

– Internal

– External

– Construct

– Conclusion


BIFFL, S.; KALINOWSKI, M.; EKAPUTRA, F.; ANDERLIN-NETO, A.; CONTE, T.; WINKLER, D. (2014) Towards a semantic knowledge base on threats to validity and control actions in controlled experiments. In: 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Torino, Italy.

Experiment Process

Scoping

Planning

Operation

Analysis


• Scoping– Identification of the study goals– Identification of the objects and groups

• Planning– Formulation of hypotheses– Identification of dependent variables (response variables)– Identification of independent variables (factors)– Selection of subjects– Experiment design– Selection of the analysis methods– Instrumentation– Validity evaluation (threats to validity)


Experiment Process

• Operation– Training and preparation– Execution of the study by the participants

• Analysis– Descriptive statistics– Graphical visualization– Elimination of outliers– Analysis of the distribution– Statistical hypothesis testing

• Packaging– Presentation of the results– Preparation of the package to repeat the study


Experiment Process

Case Study• Definition:

“A method that investigates a phenomena within its real context, specially whenthe boundaries and/or the context of the phenomena are not well defined”

• Mainly used when the use of controlled experiments is not possible,because:– The context is important and difficult to be separated from the problem or to be

simulated– Several effects are expected and observing them might require a longer period of

time

31

RUNESON, P., HOST, M., RAINER, A., REGNELL, B. (2012) Case Study Research in Software Engineering: Guidelines and Examples. John Wiley & Sons.

Case Study

Types of Case Studies– Exploratory

• Used in initial investigations of phenomena

• Aim at deriving new ideas and hypotheses (formulatingtheories)

– Descriptive

• Describe a situation of phenomena

– Explanatory

• Search for na explanation for a situation or problem

• Mainly, but not mandatory, in the form of a causalrelationship

– Confirmatory

• Used to test/refute theories


RUNESON, P., HOST, M., RAINER, A., REGNELL, B. (2012) Case Study Research in Software Engineering: Guidelines and Examples. John Wiley & Sons.

Survey

• Retrospective (descriptive, explanatory, orexploratory) aiming at identifying characteristicsand/or opinions of a large population

• Representative sample selection for a certainpopulation plays a key role in survey research

– Data analysis techniques are used to generalize thesample to the population


Action-Research

• Characteristics:

– Researcher interferes on the study object with the purpose of improving it

• Goals:

– Promote improvements, and

– Contribute to scientific knowledge


SANTOS, P.S.M.; TRAVASSOS, G.H.; ZELKOWITZ, M.V. (2011) Action research can swing thebalance in experimental software engineering, Advances in Computers, vol. 83, 205-276.

Comparison of the Primary Studies



Exercises



• What is the difference between qualitative and quantitativeresearch?

• What is a survey? Give examples of different types of surveysin software engineering.

• Which role plays replication and systematic literaturereviews in building empirical knowledge?

• How can the Experience Factory be combined withGoal/Question/Metrics method and empirical studies on atechnology transfer context?

• Which are the key ethical principles to observe whenconducting experiments?

Required Reading

• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapter 3 – Measurement

• Optional reading:– Basili, V., Caldera, C., Rombach, D. Goal Question Metric Paradigm, Encyclopaedia of

Software Engineering (Marciniak J. editor), vol. 1, John Wiley & Sons, 1994, p. 528-532.

– Basili, V., Trendowicz, A., Kowalczyk, M., Heidrich, J., Seaman, C., Münch, J., Rombach, D. Aligning Organizations through Measurement - The GQM+Strategies Approach. Springer-Verlag, 2014.

– Fenton, N.E.; Bieman, J.; Software Metrics: A Rigorous and Practical Approach; 3rd edition, Kindle edition; Boca Raton, FL: CRC Press Taylor & Francis Group; 2015 ISBN 978-1-4398-3823-5


MEASUREMENT


• Basic Concepts– Scale Types

– Objectives and Subjective Measures

– Direct or Indirect Measures

• Measurement in Software Engineering

• Measurement in Practice

• Exercises

Agenda


• “You can't control what you can't measure”

Tom DeMarco

• Measure x Measurement x Metric

• Measurement activities need clear goals

Basic Concepts


Measurement Goals

• Measurement activities need clear goals

– GQM: characterize, understand, evaluate, predict, improve?

• Goal/Question/Metric GQM (Basili and Rombach)


• Nominal– Least powerful scale, based on nominal classification

– Example: Defect Types

• Ordinal– Ranks entities after an ordering criterion

– Example: Software complexity levels, Likert scales

Scale Types


• Interval– Used when the distance between two measures is meaningful,

nut not the value itself

– Example: Temperatures Measured in Celsius or Fahrenheit

• Ratio– If there exists a meaningful zero value and the ratio between

two measures is meaningful, a ratio scale can be used

– Example: Effort invested in a development activity

Scale Types


• Objective Measures– There is no judgement in the measurement value and is

therefore only dependent on the object that is being measured

– Can be measured several times and will always rovide the samevalue, within the measurement error

– Examplo: Lines of Code

• Subjective Measures– The person making the measurement contributes by making

some sort of judgement

– Mostly of nominal or ordinal scale types

– Example: Usability

Objective and Subjective Measures


• Direct Measures– Gathered directly

– Example: Lines of Code

• Indirect Measures– Involve the measurement of other attributes

– Example: Defects/LOC, LOC/Hour

Direct or Indirect Measures


• Objects of Interest:

– Process• Ativities

– Product• Artefacts

– Resources• Human, Hardware and Software

Measurement in Software Engineering


• Internal Attributes– Obtained directly from the process, product or resource

– Example: Size of a software product

• External Attributes– Can only be measured with respect to how the object related with

other entities of its environment

– Example: Software reliability

Measurement in Software Engineering


• Measurement Approaches

– In software development processes• Métrics are defined by the SEPG and are then collected for each

software development project

• Goal Question Metrics Paradigm (GQM).

• Practical Software Measurement (PSM).

– In experimental studies• Metrics are defined by the researcher and then collected during

the study operation phase.

• Goal Question Metrics Paradigm (GQM).

Measurement in Practice


• Defines a way to plan and execute measurement andanalysis activities;

– Starts with the declaration of the measurement Goals;

– From the objectives Questions that we would like toanswer with the data interpretation are defined;

– Finally, from the questions, the Metrics and the data to becollected are defined.

• Example of a real GQM-based Measurement Plan

GQM


Examples of Experimental Study Goals

• GQM Template:

“Analyze <object of study> with the purpose of <goal> with respect to <quality focus> from the point of view ofthe <perspective> in the context of <context>”.



CARNEIRO, G.; LAIGNER, R.; KALINOWSKI, M.; WINKLER, D.; AND BIFFL, S. Investigating the influence of

inspector learning styles on design inspections: Findings of a quasi-experiment. In CIbSE 2017 - XX Ibero-American

Conference on Software Engineering, pages 222-235, 2017.

Analyze the documentation debt related to the use of AR (user stories)

for the purpose of characterizing

with respect to the impacts that it can cause on the project in terms of extra effort and cost

from the viewpoint of the project manager

in the context of an industrial software development project.


MENDES, T. S.; DE FREITAS FARIAS, M. A.; MENDONÇA, M. G.; SOARES, H. F.; KALINOWSKI, M.; AND

SPÍNOLA, R. O. Impacts of agile requirements documentation debt on software projects: a retrospective study. In

Proceedings ACM Symposium on Applied Computing, Pisa, Italy, April 4-8, 2016, pages 1290-1295, 2016.



ESTÁCIO, B., OLIVEIRA, R., MARCZAK, S., KALINOWSKI, M., GARCIA, A., PRIKLADNICKI, R., LUCENA, C.

Evaluating Collaborative Practices in Acquiring Programming Skills: Findings of a Controlled Experiment. In:

Simpósio Brasileiro de Engenharia de Software (SBES), Belo Horizonte, Brazil, 2015.

Exercises

• What are measure, measurement and meatric and howthey relate?

• Which are the four main measurement scale types?

• What is the difference between a direct and na indirectmeasure?

• Which three classes are measurements in software engineering divided into?

• What are internal and external attributes and how are they mostly related to direct and indirect measures?


Required Reading

• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapter 4 – Systematic Literature Reviews

• Kitchenham, B., Charters, S. Guidelines for performing systematic literaturereviews in software engineering. Technical Report, Keele University andUniversity of Durham, 2007.

• Petersen, K., Vakkalanka, S., Kuzniarz, L., Guidelines for conducting systematicmapping studies in software engineering: An update. Information & Software Technology 64: 1-18, 2015.

• Optional Reading (examples):– E. Mendes, M. Kalinowski, D. Martins, F. Ferrucci and F. Sarro, Cross- vs. Within-Company

Cost Estimation Studies Revisited: An Extended Systematic Review, In: Proc. International Conference on Evaluation and Assessment in Software Engineering (EASE), London, UK, 2014.

– Alves N. S. R., Mendes, T. S., Mendonca, M. G., Spínola R.O., Shull, F., Seaman, C.B.:Identification and management of technical debt: A systematic mappingstudy. Information & Software Technology 70: 100-121 (2016).


SECONDARY STUDIES


Knowledge Acquisition in Software EngineeringStudies

The experimentation process has a recursive nature

Knowledge acquired in primary studies feed secondary studies, which enableidentifying the need of new primary studies...


TRAVASSOS, G. H.; SANTOS, P. S. M.; MIAN, P.; DIAS NETO, A. C.; BIOLCHINI, J. (2008). An Environment to Support Large Scale Experimentation in Software Engineering. In: Proc. of XIII IEEE International Conference on Engineering of Complex Computer Systems, Belfast.

Secondary Studies

• Secondary studies are studies that review primarystudies concerning a specific research question withthe goal of providing a research synthesis of theexisting evidence.

– Aim at identifying, evaluating and interpreting all relevantresults on a given research topic.

– Examples: systematic reviews.


Systematic Literature Reviews (SLRs)

• Literature review that aims at being:

– ...fair (not biased)

– ...rigorous (defined process)

– ...open (transparent)

– ...objective (reproducible)

• Used in many research areas

– Social sciences, health and education

– Very common in medicine


KITCHENHAM, B.; CHARTERS, S. (2007) Guidelines for performing Systematic Literature Review in Software

Engineering. Keele University Technical Report - EBSE-2007-01.

Reasons for Conducting Reviews

• Academy:– Experimental characterization of different technologies.

– Repetition of studies in different contexts to acquireknowledge incrementally.

• Industry:– Experimental results may indicate the impact of using

technologies in different contexts.

– Decision support.


Advantages of Conducting SLRsCharacteristic Traditional Review Systematic Review

Question Usually broadly scoped Focused on researchquestions

Identification ofresearch

Not specified, potentially biased

Several sources and welldefined search strategy

Selection Not specified, potentially biased

Selection based on explicitcriteria

Evaluation Variable Rigorous assessment

Sinthesis Frequently a qualitativesummary

Qualitative andquantitative

Inferences Sometimes based onevidence

Usually based on evidence


SLR

First Filter

Surveys

Case Studies

Experiments

PrimaryStudies

Second Filter

Surveys

Case Studies

Experiments

Extracted Data


Systematic Mapping Study (SMS)

• Secondary study approach

• Rigorous review, that uses a formal process to:

– Identify all relevant research on a specific topic

– SMSs are conducted to identify and categorize existing studies

• Provide only na overview on the research topic

• There is no comparison of results of methods or techniques


PETERSON, K., FELDT, R., MUJTABA, S., MATTSON, M. (2008) Systematic Mapping Studies in Software Engineering.

In: 12th international conference on Evaluation and Assessment in Software Engineering.

Discussion of the Papers: Best Practicesand Examples


Required Reading

• Kuhrmann, M., Fernández, D.M. and Daneva, M., 2017. On the pragmatic design of literature studies in software engineering: an experience-based guideline. Empirical software engineering, 22(6), pp.2852-2891.

• Cruzes, D.S. and Dybå, T., 2011. Research synthesis in software engineering: A tertiary study. Information and Software Technology, 53(5), pp.440-455.


EXPERIMENT PROCESS, SCOPING AND PLANNING


Required Reading

• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapters 6 (Experiment Process), 7 (scoping), and 8 (planning).

• Optional Reading (examples of experiments):– ESTÁCIO, B., OLIVEIRA, R., MARCZAK, S., KALINOWSKI, M., GARCIA, A.,

PRIKLADNICKI, R., LUCENA, C. Evaluating Collaborative Practices in Acquiring Programming Skills: Findings of a Controlled Experiment. In: Simpósio Brasileiro de Engenharia de Software (SBES), 2015, Belo Horizonte.

– RIVERO, L., KALINOWSKI, M., CONTE, T. Practical Findings from Applying Innovative Design Usability Evaluation Technologies for Mockups of Web Applications. In: 47th Hawaii International Conference on System Sciences (HICSS), 2014.


• Experimentation Process

• Experiment Scoping

• Experiment Planning– Context Selection

– Hypotheses Formulation

– Variable Selection

– Participant Selection

– Experiment Design

– Instrumentation

– Threats to Validity

Agenda


Experimentation Process

Scoping

Planning

Execution

Analysis


• Scoping– Identification of the study goals– Identification of the objects and groups

• Planning– Formulation of hypotheses– Identification of dependent variables (response variables)– Identification of independent variables (factors)– Selection of subjects– Experiment design– Selection of the analysis methods– Instrumentation– Validity evaluation (threats to validity)


Experiment Process

• Operation– Training and preparation– Execution of the study by the participants

• Analysis– Descriptive statistics– Graphical visualization– Elimination of outliers– Analysis of the distribution– Statistical hypothesis testing

• Packaging– Presentation of the results– Preparation of the package to repeat the study


Experiment Process

Experiment Scoping

• Identify the Goal and the Context of the Study

GQM template:

“Analyze <Object(s) of study> for the purpose of <Purpose> with respect to their <Quality focus> from the point of view of the <Perspective> in the context of <Context>”.

• Identify the objects and study groups (control and experimental group)


• Experimentation Process

• Experiment Scoping

• Experiment Planning– Context Selection


– Variable Selection

– Participant Selection

– Experiment Design

– Instrumentation

– Threats to Validity

Agenda


Experiment Planning


Context Selection

• Four dimensions:

– Off-line vs on-line;

– Students vs professionals;

– Toy vs real problems;

– Specific vs general.


Hypothesis Formulation

• Null Hypothesis;

• Alternative Hypotheses.


Variable Selection

• Dependent Variables (Response Variables);

• Independent Variables (including Factors).


Participant Selection

• Sample selection.

– Selecting subjects by random is not always possible


Experiment Design

• Principles:

– Randomization;

– Blocking;

– Balancing;

• Design Types:

– Number of factors;

– Number of treatments.


Instrumentation

• Instruments should be completely developed before conducting the experiment and ideally evaluated through a pilot study.

• Examples: Agreement to partipate, subject characterization form, study objects, task description, measurement instruments, follow-up questionnaire.


Threats to Validity

• Conclusion Validity;

• Internal Validity;

• Construct Validity;

• External Validity.


Observation

Cause Effect

Treatment Result

Theory

IndependentVariable

DependentVariable

ExperimentOperation




Threats to Validity

• Results of experiments should be reported consideringtheir validity

– Internal

– External

– Construct

– Conclusion


BIFFL, S.; KALINOWSKI, M.; EKAPUTRA, F.; ANDERLIN-NETO, A.; CONTE, T.; WINKLER, D. (2014) Towards a semantic knowledge base on threats to validity and control actions in controlled experiments. In: 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Torino, Italy.

Exercises



• What are a null hypothesis and an alternativehypothesis?

• What is type-I-error and type-II-error respectively,which is worst and why?

• In which different ways may subjects be sampled?

• What different types of experiment designs are available, and how do the design relate to the statistical methods to apply in the analysis?

• What are the types of threats to validity? Provide one example threat for each type.

EXPERIMENT DESIGN: ADVANCED CONCEPTS


Required Reading

• ESTÁCIO, B., OLIVEIRA, R., MARCZAK, S., KALINOWSKI, M., GARCIA, A., PRIKLADNICKI, R., LUCENA, C. Evaluating Collaborative Practices in Acquiring Programming Skills: Findings of a Controlled Experiment. In: Simpósio Brasileiro de Engenharia de Software (SBES), 2015, Belo Horizonte, Brazil.

• RIVERO, L., KALINOWSKI, M., CONTE, T. Practical Findings from Applying Innovative Design Usability Evaluation Technologies for Mockups of Web Applications. In: 47th Hawaii International Conference on System Sciences (HICSS), 2014.


ADVANCED CONCEPTS(Let's plan and manage more complex studies)

Based on material gently provided by Prof. Guilherme Horta Travassos


Principles of Experimental Designs

• Simple designs help to make the experiment practical– minimizing use of time, money, personnel and experimental

resources

– easier to analyze

• Maximizing information yields more complete understanding– allows generalization to the widest possible situations

• Consider several issues to simplify and maximize:– experimental error

– replication

– randomization

– local control


Factors and Experimental Design

• A factor is an independent variable in the design.

– Examples: To determine the effects of experience and language on productivity, design may have two independent variables: experience and language. Dependent variable is productivity.

• Values or classifications for each factor are called levelsof the factor.

• Levels can be continuous or discrete, quantitative or qualitative.

– Example: Number of years of experience


Experimental Error

• Experimental error describes the failure of two identically treated experimental units to yield identical results– reflects errors of experimentation

– reflects errors of observation

– reflects errors of measurement

– reflects the variation in experimental resources

– reflects the combined effects of confounding factors that can influence the characteristics under study but which have not been singled out for attention in the investigation

• Example: Error may be due to– mind wandering

– timer measured elapsed time inexactly

– distractions: loud noises in next room

– …


How to Control Error

• Control as many variables as possible

• Minimize variability among participants

• Minimize effects of irrelevant variables

• Try to use design to distribute effects of irrelevant variables equally across all experimental conditions

• Techniques for controlling error in the design– Replication

– Randomization

– Local control


Replication

• Represents the repetition of the basic experiment

• It means repeating an experiment under identical conditions, rather than repeating measurements on the same unit

• It provides an estimate of experimental error that acts as a basis for assessing the importance of observed differences in an independent variable (that is, how much confidence we can have in the results)

• It enables us to estimate the mean effect of any experimental factor


Confounding

• Two or more variables are confounded if it is impossible to separate their effects when the subsequent analysis is performed.– Example: you are comparing the use of a new tool with your existing

tool. Programmer A uses the new tool in your development environment, while B uses the existing tool. If you compare measures of quality in the resulting code, the difference is due to the tools only if you have accounted for differences in skill of the programmers. That is, the effects of tools and programmer skill are confounded.

• Confounding is introduced when there is no control for other variables.

• Sequence can also confound (learning effect): Test team uses technique X to test, then technique Y.


Randomization

• Replication allows us to know the statistical significance of the results, but not the validity. That is, we want to be sure that the results followed from the treatments applied. For this, we distribute the observations independently.

• Randomization is the random assignment of subjects to groups or of treatments to experimental units, so that we can assume independence and thus validity of results.

• Randomization does not guarantee independence but keeps variation of bias to a minimum.


Local Control

• Reflects how much control you have over the placement of subjects in experimental units and the organization of those units.

• Makes the design more efficient by reducing the magnitude of experimental error.

• Two aspects of local control:

– blocking

– balancing the units


Blocking

• allocates experimental units to blocks or groups so that the units within a block are relatively homogenous

• predictable variation among units is confounded with the effects of the blocks

• Example: investigating the effects of three design techniques on code quality.– Teach techniques to 12 developers, measure number of defects per

thousand lines of code– If the 12 graduated from 3 universities, training at each university may

affect the way the design technique is understood or used– To eliminate the effects of this, define three blocks: first has all

developers from university X, second from university Y, third from university Z

– Then assign treatments randomly to the developers within each block


Balancing

• blocking and assigning treatments so that an equal number of subjects is assigned to each treatment, whenever possible

• simplifies statistical analysis

• designs can range from being completely balanced to little or no balance

• If a design has no blocks, it must be completely randomized.


Types of Experimental Designs

• Type of design can constrain the analysis.– For example, the way to perform an analysis of variance depends on

number of variables and the way in which subjects are grouped and balanced.

• Measurement scale can constrain the analysis.– Nominal scales divide data into categories, while ordinal scales

permit rank ordering and more powerful tests. Parametric tests such as analysis of variance require at least interval scale.

• Sampling can constrain the analysis.– Degree of randomization

– Distribution of data• Normal or near-normal and homoscedastic distributions can use parametric

tests; otherwise, non-parametric tests are preferable.


EXPERIMENT ANALYSIS

AND INTERPRETATION


Required Reading

• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapters 9 (Operation) and 10 (Analysis and Interpretation)

• ESTÁCIO, B., OLIVEIRA, R., MARCZAK, S., KALINOWSKI, M., GARCIA, A., PRIKLADNICKI, R., LUCENA, C. Evaluating Collaborative Practices in Acquiring Programming Skills: Findings of a Controlled Experiment. In: Simpósio Brasileiro de Engenharia de Software (SBES), 2015, Belo Horizonte, Brazil.

• RIVERO, L., KALINOWSKI, M., CONTE, T. Practical Findings from Applying Innovative Design Usability Evaluation Technologies for Mockups of Web Applications. In: 47th Hawaii International Conference on System Sciences (HICSS), 2014.


ADVANCED CONCEPTS(Let's talk about statistics and data analysis)

Based on material gently provided by Prof. Guilherme Horta Travassos


Experimentation Process

Definition

Planning

Execution

Analysis

Statistical Inference Techniques


Hypotheses, Variables and Scales

• Planning and Hypotheses

• Hypotheses

• Choosing the variables

• Scales

• Scales’ information level

• Scales and basic operations


Planning and Hypotheses

• Planning


– Dependent variables identification (responses)

– Independent variables Identification (factors)

– Participants Selections

– Study Design

– Selection of Analysis Methods

– Instruments Definition

– Threats to validity (experiment risks)


Hypothesis

• A Hypothesis is a theory or supposition that can explain a determined behavior of the research interest

• An experimental study aims at collecting data, from a controlled environment, to support the hypothesis confirmation or refuting

“Developers using the technique Y can conclude the task of requirements analysis in less time and produce a more complete requirements set than when using the technique X”


Hypotheses and Variables

• Hypotheses guide the definition of variables

• Independent Variables (become factors when controlled) – Relate to process inputs. Can be controlled.

– Represent the causes that are expected to affect the results. When controlled their values are called treatments.

• Dependent Variables– Relate to process outputs and they are affected throughout the

experimentation process.

– Represent the effect from the combination of the independent variables values (including the factors). Their possible values are called results.


Hypotheses and Variables


Independent Variables

Used technique (treatments: Y e X)

Developers Characterization

Application Characterization

Dependent Variables

Time to execute the task

% of right requirements defined


Variables and their values

• Studies` variables can be:

– Qualitative: the values (treatments) represent types

– Quantitative: the values represent levels for the variable application

• The values of the variables are collected in scales:

– There are different scales that can be used to collect and represent these values: nominal, ordinal, interval and ratio.

– The scales specify the operations that can be applied to the variables values


Nominal Scale

• Nominal scale values represent different types of an element, without numerical interpretation nor ordering among them.

• Examples in software include:– Names of different measures of software size (lines of code,

function points, use case points, ...)– Names of different programming languages (Java, C++, C#, Pascal,

...)

• The scale does not allow us to say, for instance, that lines of code is greater than function points nor that Java is less than C#


Ordinal Scale

• Ordinal Scale values represent different element types that can be ordered with no numerical interpretation

• Exemples in software include:

– Different CMMI levels (1, ..., 5) or MPS.BR (G, ..., A)

• The scale allows to say, for instance, that CMMI 2 is less than CMMI3, but does not allow to say that the quality difference between the companies CMMI 2 and CMMI 3 is the same as CMMI 3 and CMMI 4.


Interval Scale

• Interval scale values can be ordered and the distance between consecutive values can be interpreted equally, however the ratio between these values has no meaning.

• For instance: although we can say that 2011 represents an year after 2010 and an year before 2012 there is no meaning in calculating the ratio between 2011 and 2012.

• The comparison is possible just because all interval scale presents an arbitrary zero point (in the case of dates, the year 0)


Interval Scale

• The Likert Scales represent and example of intervalscale pretty used in software related studies

– Using a Likert scale we can define different names torepresent, in general, the intensity of a property that can notbe directly measured.

– For instance, we can build a Likert scale to evaluate the riskimpact using the following values: very high, high, medium,low and very low.

– Although impossible to verify the interval distance in the realworld, it is assumed these values are very near each other.


Ratio Scale

• Ratio scale values can be ordered, the distance between consecutive values have the same meaning and the ratio between values can be interpreted.

• Examples in software include software size, effort and time for the project execution.

• The ratio scale allows to say, for instance, that a software with X lines of code is twice smaller than a software with 2X lines of code

• In ratio scale, 0 (zero) means no existence of the measure.


Scales Information

Nominal

Ordinal

Interval

Ratio

Values can be counted and ordered

Values can counted and ordered

Distance between values can be interpreted

Values can counted and ordered

Distance between values can be interpreted

Ratio between values can be interpreted

Mo

re I

nfo

rm

ati

on

...

Values can be counted


Scales and Characteristics

Scale Nominal Ordinal Interval Ratio

Values Counting X X X X

Values Ordering X X X

Equidistant Intervals

X X

Adding and Subracting values

X

Values Division X

• According to the variable scales, we can explore different characteristis of their values


Example


Independent Variables

Used technique (treatments: Y e X)

Nominal Scale with 2 treatments

Developers and Application Characterization

Nominal or Ordinal Scale

Dependent Variables

Time to execute the task

Ratio Scale

% of right requirements defined

Ratio Scale


Tabulation and Graphics

• Variables and execution

• Tabulation

• Graphical Analysis

• Histograms

• Pie Charts

• Dispersion Charts

• Control Charts


Variables and Execution

• The execution of an experimental study consists in a series of trials– In each trial, a participant applies one treatment from the

independent variables set and produces results for each dependent variable

– These results are collected in tuples of type Ai = {Ti, Ri}, where Ti is the ordered set regarding each treatment of each independent variable applied by the participant i and Ri represents the ordered set of each result obtained by the same participant for each dependent variable

– These results are going to be the reason for data analysis in the experimental study.


Variables and Execution• Some tabulated data after the execution of a hypothetical study. These

data will be used in the next slides.

Participant Technique Time(days) % Right Found

1 Y 10 83%2 Y 13 73%3 Y 12 87%4 Y 13 78%5 Y 10 74%6 Y 14 74%7 Y 14 87%8 Y 13 75%9 Y 14 86%

10 Y 14 82%11 Y 13 77%12 X 13 90%13 X 9 89%14 X 11 88%15 X 14 87%16 X 9 97%17 X 12 81%18 X 9 82%19 X 12 86%20 X 11 92%21 X 14 96%22 X 13 98%Marcos Kalinowski 122Experimental Software Engineering

Variables and Execution

• After data tabulation, central tendency measurements, dispersion and dependency can be used together with graphical analysis to better “understand” the data.

• This understanding is important when selecting and applying the statistical inference techniques, that will support the hypothesis testing.


Graphical Visualization

• A chart visually represents the tabulated information

– Charts are usually easier to understand when compared to large tabulated data sets

– The spatial data presentation helps in the identification of groups and the visualization of relationships among them

– In general, charts can be quickly read

• Methods for graphical representation of data

– Histograms

– Pie Charts

– Dispersion charts


Graphical Visualization

• The graphical visualization methods can depend on the variables classification (continuous, discrete)

• Discrete variables can assume any value into a defined finite set of values– They are more common in nominal or ordinal scales. However, they

can also occur in the interval and ratio scales

• Continuous variables can assume any value in an interval with an infinite set of values– They are common in the interval and ratio scales.


Histograms

• It shows the observed values regarding one specific variable in the frequency domain

– The frequency indicates the number or percentage of occurrences for each value from the collected values set

– If data is discrete, each information is presented as a bar as high as the number of times that the value occurs in the value set

– If data is continuous, they shall be made discrete, it means, data needs to be split in equidistant regions. After, it is needed to count how many times the values of each region show up in the collected values set. Next, a bar can be traced as for discrete data.


Histograms

• It is a common representation method for numerical data in any scale, because it involves only counting.

• The histograms also allow to relate observed data with known frequency distributions

– These distributions have mathematical properties from which the statistic inference tests are derived

• If the observed data do not follow these properties (normality, for instance), we can not be confident in the results of the

testing. In these cases, other types of statistical tests must be used


Histograms• Histogram of time spent by the participants in the analysis activity,

according to the used technique

Time

(days)

TechniqueY

TechniqueX

9 0 3

10 2 0

11 0 2

12 1 2

13 4 2

14 4 2

* Data Distribution Table

Time 9days)

# p

artic

ipan

ts


Cumulative Histogram

• A cumulative histogram shows the frequency of occurrence of values less than or equal to a specific value. – Each bar in the graph represents the sum of the previous

bars into a conventional histogram

– In different configurations, it is possible to get some suggestion about the acceptance or rejection of the hypothesis by observing the cumulative histogram regarding ( however, just the statics testing can confirm it!)

– Because data must be ordered, cumulative histograms can not be used with nominal scale variables values.


• Cumulative Histogram for time spent by the participants in the analisys activities with techniques X and Y

0

1

2

3

4

5

6

7

8

9

10

11

12

9 10 11 12 13 14

Técnica Y Técnica X

Time

(days)

TechiniqueY

TechiniqueX

9 0 3

10 2 3

11 2 5

12 3 7

13 7 9

14 11 11

Time (days)#

partic

ipan

ts* Data Distribution Table

Cumulative Histogram


Pie Chart

• A pie (pizza) chart shows the relative frequency (or percentage) of data occurrence, dividing the data by a set of distinct classes and presenting them as proportional slices in the circle.

928%

1118%12

18%

1318%

1418%

Técnica XX Dias% participantes

X Days% participants


Dispersion Diagram

• It shows the observed values of two or more variables through Cartesian graphics.

– Each axis represents one of the variables, composing tuples (two or more dimensions)

– This representation format helps in the identification of patterns that can suggest relations between variables

– Dispersion Diagrams also help to identify the values that are different from normal behavior (outliers). Outliers can distort statistical analysis and shall usually be eliminated before statistical tests.


• Dispersion between the percentage of right requirements found and execution time for activities with techniques X and Y

60%

65%

70%

75%

80%

85%

90%

95%

100%

8 10 12 14 16

% r

igh

t re

qu

irem

en

ts f

ou

nd

Time

Y

X

Dispersion Diagram


Control Charts

• Statistical tool allowing the observation of quantitative data behavior representing the characteristics under investigation

• A typical control graph presents 3 parallel lines:

– A central line, representing the mean behavior presented by the data

– A high extreme limit, called UCL – Upper Control Limit

– A low extreme limit, called LCL – Lower Control Limit)


Control Charts

Versões

21191715131197531

Núm

ero

de D

efei

tos

70

60

50

40

30

20

10

0

Num Defeitos

UCL = 26,81

Média = 15,14

LCL = 3,46

• If the characteristic behavior is under control, its values will bounce around the center line (for instance, the number of mean defects by software version), within the UCL and LCL ranges.

• Once the characteristic behavior is under control, the probability of getting a value out of limits is very low.


Descriptive Statistics

• Objectives

• Central Tendency Measures

• Dispersion Measures

• Frequency Distribution

• Example

• Dependency Measurements


Objectives

• To describe the characteristics behavior and trends from the experimental study collected data through statistics methods

– Together with the graphical analysis, allows the initial analysis of data and measuring of dependencies and relationships among data.

• It aims at to give a general view about the general distribution of the data set.


Central Tendency Measures

• Show the middle values of the observed data set– Mean (arithmetic): meaningful for the interval and ratio scales

– Median: represents the middle value of an ordered data set, following that the number of samples higher than the median is the same as the number of samples lower than the median• Odd samples: median is represented by the middle sample

• Even samples: median is represented by the mean ot the two middle samples

– Mode: represents the most commonly occurring sample. It is meaningful for the nominal, ordinal, interval and ratio scales.• Well defined when just one value has the highest count

• Odd number of samples: it can be considered the middle value of the most common samples with same occurrence (not valid for nominal scale)


Central Tendency Measures

• Other relevant measures

– Minimum Value: represents the lower observed value into the collected data set

– Maximum Value: represents the higher observed value into the collected data set

– Percentile: considering a sample with 100 values, the percentile X% represents the value that split the sample in X values lower than it and (100-X) values greater than it. The median is a special case of the percentile, namely the 50%-percentile

– Quartile: values representing the 25% percentile (1st Quartile), the median (2nd Quartile) and 75% percentile (3rd Quartile).


Dispersion Measures• Measure the level of variation from the central tendency, i.e.

to see how outspread or concentrated the data is– Range: represents the distance between the maximum and minimum

data values– Variance: the mean of the square distance from the sample mean. It

is meaningful for the interval and ratio scales.– Standard deviation: it is the square root of the variance having the

same dimension (unit of measure) as the data values themselves.

freq

uen

cy

xx

freq

uen

cy



0123456789

1011

12131415

1 2 3 4 5 6 7 8 9 10 11

Atividade Y (tempo em dias)

Bar Chart representing the time consumed by each participant that applied technique Y in the analysis activity



Measures of Tendency

Mean 12,73

Median 13

Modes 13 e 14

Range 4

Minimum 10

Maximum 14

1st Quartile 12,5

3rd Quartile 14

Variance 2,22

Standard Deviation

1,49

Técnica Y (Histograma do Tempo)

0

1

2

3

4

5

9 10 11 12 13 14TEMPO (Dias)

# p

artic

ipan

tes



n

xi

1

)(2

2

n

xxi

2

Mean:

There are other measures (such as kurtosis, asymmetry, geometric mean, ...) but out of scope of our discussion

Variance:

Standard Deviation:


Frequency Distributions

• As seen, histograms can represent data in the frequency domain

• The histograms allow verifying if the data distribution follow a classical distribution, such as normal, uniform, beta, among others.

• The normal distribution, in particular, is important for some statistical tests, which require that analyzed data follow a normal distribution.


• The normal distribution has a bell format, with the left and right limits extending from the central point. – The curve is symmetric in relation to its mean and its width is

proportional to its standard deviation – In this way, the curve can be defined by its mean and standard

deviation.

Frequency Distributions


http://en.wikipedia.org/wiki/Image:Normal_distribution_pdf.png

http://en.wikipedia.org/wiki/Image:Normal_distribution_pdf.png

Normal Distribution

• If a numerical data set follows the normal distribution, it is possible to claim: – 68% of all observations are within the mean +- standard deviation

– 95,5% of all observations are within the mean +- 2* standard deviation

– 99,7% of all observation are within the mean +- 3* standard deviation


http://en.wikipedia.org/wiki/Image:Standard_deviation_diagram.png

http://en.wikipedia.org/wiki/Image:Standard_deviation_diagram.png

Measures of Dependency

• When two or more variables are related, it can be useful to calculate the dependency level among them.

• The Measures of Dependency define the strength and direction of the relationship among two or more variables when quantitatively evaluated.

– The most used Measure of Dependency is CORRELATION

– Correlation between two variables is represented by a number

– Correlation among more than two variables is represented trough a correlation matrix



The CORRELATION between two variables range from -1 to 1

The correlation -1 indicates that a high value in one variable corresponds to a low value in the other one

The correlation 1 indicates that a high value in one variable corresponds to a high value in the other one

The correlation near 0 (or 0) indicates there is no way to infer the relationship behavior

CAUTION: just CORRELATION is not the CAUSE!


Pearson Correlation

• Most common correlation coefficient

– Quantifies the linear association strength between two variables and describes how much a straight line could be adjusted to fit these points

– The coefficient assumes the data distribution is normal

• Due the normal distribution, its condition can be indicated by the elliptical cloud formation in the dispersion graphic showing these values



0,0

2,0

4,0

6,0

8,0

10,0

12,0

14,0

16,0

18,0

20,0

0,0 5,0 10,0 15,0 20,0

A

B

CORREL(A,B) = 0,02

0,0

50,0

100,0

150,0

200,0

250,0

0,0 5,0 10,0 15,0 20,0 25,0

A

B

CORREL(A,B) = 0,98

0,0

50,0

100,0

150,0

200,0

250,0

0,0 5,0 10,0 15,0 20,0 25,0

A

B

CORREL(A,B) = -0,98


Spearman Correlation

• Represents other example for a coefficient of correlation

– This method is based on the ranking of the collected values and not on the values

– In this way, it can be also used for variables in ordinal scales

• It can also be used when the distribution is not normal


Regression Analysis

• Regression analysis extends the capacity of representation of dependency, providing an equation to describe the nature of the relationship.

• In simple regression analysis, the interest is in predict the value of dependent variables based on the values of independent variables

Versões

3020100

Núm

ero

de D

efei

tos

60

50

40

30

20

10

0

-10

Observado

Linear



• Accordingly the variable scales, we can calculate

Scale Nominal Ordinal Interval Ratio

Mean X X

Median X X X

Mode X* X X X

Range X X X

Variance X X

Standard Deviation X X

Corr Pearson X X

Corr Spearman X X X

* Remember restrictions for nominal scale!


Outliers Analysis

• Concepts

• Conditions of Occurrence

• Visual Identification

• Numerical Identification


Outliers Removal

• Extreme values (outliers) represent observed values that are too distant from the other data set values

– They can represent data set error and usually must be removed before statistics

– They can occur due problems in the study execution, typing, interpretation or participants` motivation

– It is important to verify the origins of each outlier, because they can represent valid observations and that should be kept in the data set (false positives)


Visual Identification

• Outliers can be visually identified, through dispersion graphics or box-plots– Box-plots diagrams were idealized to show the distribution of quantitative

data– They make use of measures of central tendency and dispersion to

characterize the distribution

Maximum Value

Median

3rd Quartile (mean+ X standard deviation)

1st Quartile (mean- X standard deviation)

Minimum Value


Numerical Identification

• Outliers Removal Methods usually remove values that present a upper distance value from the mean or median

– Values near the limits not necessarily need to be removed from the data set (subjectivity)

– The distance is usually determined as one quartile, one percentile or a specified number of standard deviations• Quartile Method

– Lower Outliers: Q1 - 1.5*IQ

– Upper Outliers: Q3 + 1.5*IQ

– Where IQ = Q3 – Q1.


Numerical Identification

• Removing outliers of the percentage of right requirements found by the participants that applied the techniques X and Y

Participant Time (days)% right

requirements

1 10 83%2 13 73%3 12 87%4 13 78%5 10 74%6 14 74%7 14 87%8 13 75%9 14 86%

10 14 82%11 13 77%

Measure Value

Minimum 73%

Mean– 1sd 74%

Mean 80%

Mean+ 1sd 86%

Maximum 87%


Hypothesis Testing

• Experimental Studies Types

• Hypothesis Testing

• Erros, power and p-value

• Hypothesis Testing Types

• T-test

• Mann-Whitney

• ANOVA, Tukey

• Kruskal-Wallis


Experimental Studies Types

Hypothesis Testing

Relationship Exploration

NormalDistribution Data

Non NormalDistributionData

2 groups

3+ groups

t-testpaired Student's t-test

ANOVA, Tukey

NormalDistributionData

Non NormalDistributionData

PearsonLinear Regression

SpearmanNon-Linear Regression

2 groups

3+ groups

Mann-Whitney (Wilcoxon rank-sum test)Wilcoxon signed-rank test

Kruskal-Wallis


Hypothesis Testing

• As seen, an experimental study aims at to collect data to confirm or refute a hypothesis

• In general, two hypotheses are defined: – Null Hypothesis(H0): indicates the observed differences are

coincidental. It means that this is the hypothesis the researcher would like most to reject with high confidence

– Alternative Hypothesis(H1): represents the hypothesis inverse to the null one, that can be accepted, or tested.

• Statistics tests allow the acceptation or rejection of hypotheses


Hypothesis Testing• In general, the Software Engineering tests compare

the mean between different groups of participants applying different treatments


Null Hypothesis: (TimeY) = (TimeX)

Alternative Hypothesis: (TimeY) (TimeX)


Types of Errors

• The verification of hypothesis always deal with some risk, implying that some analysis error can happen

– Type I (): it happens when the statics test indicates the existence of a relationship between cause and effect that actually does not exist

– Type II (): it happens when the statistical test does not indicate a relationship between cause and effect that actually does exist

= P (error-type-I) = P (H0 is rejected| H0 is true)

= P (error-type-II) = P (H0 is not rejected| H0 is false)


• The null hypothesis is usually built to minimize type I errors

– Consider:

• H0: medicine A = medicine B

• H1: medicine A is better than medicine B

– Errors:

• Type I: medicine A is better than B, but it is not true (they are equal)

• Type II: medicine A is equal to medicine B, but this is not true (A is better)

Types of Errors


Power of Testing

• Indicates the probability of rejecting the null hypothesis when it is false, it means, the probability of correctly making the decision based on the alternative hypothesis– The size of error depends on the power of testing

– The power of testing implies in the probability the test can find the relationship when the null hypothesis is false

– The statistical testing with highest power shall be used to evaluate the hypothesis.

Power = 1 -

Power= P (H0 rejected | Ho is false)


Significance Level

• Shows the likelihood of an type-I error to happen

– Most common significance level (): 10%, 5%, 1% and 0.1%

– We call p-value the lower level of significance that can be used to reject the null hypothesis

– We say there is statistical significance when the calculated p-value is lower than the adopted significance level

– For instance, when p=0.0001 one can say that the result is really significant, because this value is much lower than the usually used significance levels.

– However, if p=0.048 then one can not be sure. Although the value is lower than 5%, it is really closed to this significance level.


• The decision-making process for a hypothesis test can be based on the probability value (p-value) for the given test:

– If the p-value is less than or equal to a predetermined level of significance (α-level), then you reject the null hypothesis and claim support for the alternative hypothesis

– If the p-value is greater than the α-level, you fail to reject the null hypothesis and cannot claim support for the alternative hypothesis

Significance Level


H0 acceptation Rejection of H0

Significance Level


Degrees of Freedom

• Degrees of freedom (DF) is the amount information (variables) free to be used for the calculation of a statistic (formula)

• The number of independent values to be used in the estimation of a parameter

• In general, the number of degrees of freedom of an estimate is equal to the number of values used in estimating the least number of estimated parameters in the intermediate calculation for obtaining it

• So to calculate the mean of a sample size "n" are necessary as "n" observations so that this statistic has "n" degrees of freedom

• The estimation of variance using a sample size "n" will have "n - 1" degrees of freedom as to obtain the sample variance is necessary before the calculation of the sample mean


Hypothesis Testing Types

• The tests of hypothesis can be parametric and non parametric

• Parametric Tests

– They use specific formulas, derived from know distribution frequencies. Therefore, the data set that will be tested must present a distribution:

• Normal: symmetric distribution

AND

• Homocedastic: the variance is constant


• Non-Parametric Tests

– Shall be used when the data distribution does not attend the

parametric tests requirements (normality, homocedastity)

– They are less powerful than parametric tests, and do not

assume any probability distribution in the data

– They use ranking of values instead the values

Hypothesis Testing Types


Normality

• Frequency Distribution Graphs of the normal curve (in blue) and some hypotethical data (red vertical lines)

Data with distribution similar to normal

Data with non normal distribution


Normality Testing

• Kolmogorov-Smirnov (K-S) testing

– Evaluates if two samples have similar distributions or

one sample presents a distribution similar to normal

– Frequently used to identify normality in samples with

at least 30 values!

– Detects differences related to the central tendency,

dispersion and symmetry, but is really sensitive to

long tails (high value of standard deviations)


• Shapiro-Wilk testing

– Calculates the W value, indicating if the sample xi

follows the normal distribution

– Frequently used to identifiy normality in samples withless than 50 values

– Small S-W values indicate the distribution is notnormal

– Test used in small data sets, where the extremevalues can make hard to use K-S

Normality Testing


Homocedasticity

• A set of variables is homocedastic if the variables have similar variances – A classical example of lack of homocedasticity is the relationship between

the type of consumed food and salary:

• As much as the salary of a person increases, the variety of food types the person can consume also increases

• A poor person usually spends a constant value in food, consuming similar products

• A more rich person eventually can consume more simple products, but can also consume more sophisticated products

• In this way, the richer a person is the more types of food it can consume


• Observed values for an hypothetical study, showing heterocesdaticity between two groups

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Muito

Alto

Alto

Médio

Baix

o

Muito

Baix

o 0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Muito

Alto

Alto

Médio

Baix

o

Muito

Baix

o

Group I Group II

Homocedasticity


• Levene’s Testing

– Consider a variable Y, with N distinct values divided inK groups, where Ni is the number of values in group i

– The Levene’s Testing accepts the hypothesis that thevariances are homogeneous whether the S-W (or K-S)value is less than the value of significance level

Homocedasticity Testing


Hypotesis Testing Types

Experimental Design Parametric Test Non-Parametric Test

One factor, one treatment - Binomial

One factor, two random treatments

T Test Mann-Whitney

One factor, two paired treatments

Paired-T Test Wilcoxon signed rank

One factor, more than two treatments

ANOVA Kruskal-Wallis


T Test or Student-T

• Parametric test used to compare two means from twoindependent samples– Relates to a category of tests, which different tests can be

applied accordingly the sample variances (homocedastic ornot)

– Different test are also applied whether the samples areindependents or paired

– We can say two samples are paired ones when it does exist aunique relation between a value in one sample with a valuein the other one.• Example: a sample before training and a sample after training

– All T tests assume normal distribution for the datadistribution


Mann-Whitney Test

• Represents a non parametric alternative for T Test

– Requires the samples are independent, with continuous datain scales ordinal, interval or ratio

– To accomplish comparison, the samples are grouped andordered

– The samples are transformed in rankings into the group andthe sum of the smaller sample (T) in the group is calculated

– Finally, the statistic value is calculated and compared with atable of values


ANOVA – Variance Analysis

• Statistical technique aiming at testing the equality betweentwo or more groups means

– Allows the comparison of means from different treatments, beingused as an extension for T test

– Evaluates whether the variability within the groups is greater thanthat existent among the groups

– The technique assumes independency, normality andhomocedastity of the group samples.

• As its aim is to evaluate if the means are equal, independentfrom the factor, the null ANOVA hypothesis establishes thatthe factor dependent variations must be equal zero.


Tukey Test

• Test to compare means

– Usually used when ANOVA identify difference among means from multiples samples

• ANOVA shows the means are different, but does not show which ones are different

– The Tukey test supports the identification of the means that do differ


Kruskal-Wallis Test

• A non parametric alternative to variance analysis

– As most of non parametric tests, this test is based onthe ranking of values rather the values bythemselves.



Prof. Marcos [email protected]

Experimental Software Engineering

Documents

Transcript of Experimental Software Engineering