Experimental Software Engineering
Transcript of Experimental Software Engineering
Experimental Software Engineering
Prof. Marcos [email protected]
Introductions
Marcos Kalinowski• Software Engineering Professor at PUC-Rio• Member of the ISERN• Main research interests:
– Empirical Software Engineering– Software Quality Improvement
• Further information:– www.inf.puc-rio.br/~kalinowski
• Who are you?– Background, interests, ...
Marcos Kalinowski 2Experimental Software Engineering
• Discipline topics:– Experimental Software Engineering: Overview and Research Opportunities
– Empirical Strategies
– Measurement Concepts
– Systematic Literature Reviews and Mapping Studies
– Surveys
– Case Studies
– Controlled Experiments• Experiment Process: Scoping, Planning, Operation, Analysis and Interpretation,
Presentation and Package
– Design Science Research
– Qualitative Methods
– Theory Building
Marcos Kalinowski 3Experimental Software Engineering
Experimental Software Engineering
• Assessment– Evaluation 1 = Topic presentation and participation in classroom
discussions– Evaluation 2 = Secundary study plan– Evaluation 3 = Primary study plan– Evaluation 4 = Paper with 8 to 16 pages in Springer LNCS format
Grade = (Evaluation 1 + Evaluation 2 + Evaluation 3 + (2x Evaluation 4)) / 5
Success– (Presence >= 75%) AND (Grade >= 6)Fail– Otherwise
Marcos Kalinowski Experimental Software Engineering 4
Experimental Software Engineering
Experimental Software Engineering
• Text book– Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A.,
Experimentation in Software Engineering, Springer, 2012.
• Additional references– Kitchenham, B.A., Budgen, D., Brereton, P., Evidence-Based Software
Engineering and Systematic Reviews, Chapman and Hall/CRC, 2015.
– Kitchenham, B.A., Charters, S., Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE 2007–001, KeeleUniversity and Durham University Joint Report, 2007.
– Runeson, P., Höst, M., Rainer, A.W., Regnell, B., Case Study Research in Software Engineering – Guidelines and Examples. Wiley, 2012.
– Wieringa, R., Design Science Methodology for Information Systems and Software Engineering. Springer, 2014.
– Scientific Papers
Marcos Kalinowski 5Experimental Software Engineering
Experimental Software Engineering
• Important Dates– 30/04 – Deadline for delivering the secondary study plan
– 11/06 – Deadline for delivering the primary study plan
– 02/07 – Deadline for delivering the paper
• Others– 23/04 – Holiday
Marcos Kalinowski 6Experimental Software Engineering
INTRODUCTION
Marcos Kalinowski 7Experimental Software EngineeringMarcos Kalinowski Experimental Software Engineering
Introduction
• The story of the Denver International Airport ...
8Marcos Kalinowski Engenharia de Software Experimental
DEMARCO, T.; LISTER, T. (2003) Waltzing with Bears – Managing Risk on Software Projects. Dorset House. (ISBN: 978-0932633606).
“Software Engineering discipline remains years – perhaps decades– short of the mature engineeringdiscipline needed to meet thedemands of an information age society”.
Silver Bullets in Software Engineering?
9Marcos Kalinowski Experimental Software Engineering
Introduction
• Software development depends on differenttechnologies
– Usually there is no evidence available concerning:• Benefits
• Limitations
• Risks
10Marcos Kalinowski Experimental Software Engineering
Introduction
• During the projects, software engineers need toanswer questions like:
– Which software technology should I consider for myproject?
– How much training/investment is needed to introducethe technology into my process?
– When and how can I observe the return on investiment?
– Under which circumstances does the technology presentthe best performance?
11Marcos Kalinowski Experimental Software Engineering
• We need to have knowledge on our software technologies (methods, techniques and tools) to understand the situations in which theyreally work, their limits and how we can evolve them. (Basili, 1996)
Marcos Kalinowski Engenharia de Software Experimental 12
BASILI, V. R. (1996) The role of experimentation in software engineering: past, current, and future. IEEE International Conference on Software Engineering (ICSE), pp. 442-449.
Introduction
Obtaining Knowledge
• Building theories, models, experimentation andlearning
– Understanding a discipline involves building theories andmodels
– To verify if our understanding is correct, we need to:• Conduct experiments on our theories models
13Marcos Kalinowski Experimental Software Engineering
Obtaining Knowledge
• Building theories, models, experimentation andlearning
– Understanding a discipline involves building theories andmodels
– To verify if our understanding is correct, we need to:• Conduct experiments on our theories models
14Marcos Kalinowski Experimental Software Engineering
Experimentation isfundamental to both,
Academy and Industry!
Software Engineering
• Software Engineering involves development and isnot manufacturing
– Involves reasoning and human elements (e.g., develpers)
– There are several variables that can lead to differences inmeasurements
• Current Scenario:
– Limited amount of theories and models
– Lack of knowledge on the limits of existing technologiesfor certain development contexts
15Marcos Kalinowski Experimental Software Engineering
Experimental Software Engineering
• Experimental Studies
– Descovering something or testing hypotheses
– May involve different types of analysis: quantitativeand/or qualitative
• Studies may be:
16
Primary Secondary (Agregate results ofprimary studies)
Marcos Kalinowski Experimental Software Engineering
Experimental Software Engineering
• Experimental Studies
– Descovering something or testing hypotheses
– May involve different types of analysis: quantitativeand/or qualitative
• Studies may be:
17
Primary Secondary (Agregate results ofprimary studies)
Marcos Kalinowski Experimental Software Engineering
Measuring Variables
Experimental Software Engineering
• Experimental Studies
– Descovering something or testing hypotheses
– May involve different types of analysis: quantitativeand/or qualitative
• Studies may be:
18
Primary Secondary (Agregate results ofprimary studies)
Marcos Kalinowski Experimental Software Engineering
Understanding causes and effects of collected data
Classification of Experiments
In Virtuo
In Silico
In Vivo
In Vitro
No Model Needed
Environment needs to be modelled
Computational Models
of the Object and the Environment
Computational Models of the
Participant Behaviour,
Object and Environment
19Marcos Kalinowski Engenharia de Software Experimental
TRAVASSOS, G. H.; BARROS, M. O. (2003) Contributions of In Virtuo and In Silico Experiments for the Future of Empirical Studies in Software Engineering. In: 2nd Workshop on Empirical Software Engineering: The Future of Empirical Studies in Software Engineering, 2003, Rome.
Required Reading
• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapter 1 – Introduction
– Chapter 2 – Empirical Strategies
Marcos Kalinowski 20Experimental Software Engineering
PRIMARY STUDIES
21Marcos Kalinowski Experimental Software Engineering
Primary Study Types
• Controlled Experiment
– An experiment that allows controlling and manipulatingvariables.
• Case Study
– Investigates a phenomena in a real context. Typicallyconducted during software development or maintenanceprojects. Part of the behavior can not be manipulated.
22Marcos Kalinowski Experimental Software Engineering
• Survey
– Accomplished after a fact ocurred, aiming at identifyingsome evidence.
– Does not allow control.
• Action Research
– Research method that combines theory (research) andpractice (action), putting together researchers andpractitioners to solve a problem.
23Marcos Kalinowski Experimental Software Engineering
Primary Study Types
Controlled Experiment
Characteristics
– Investigate testable hypotheses
– Independent variables are manipulated to measure theireffects on dependent variables
24Marcos Kalinowski Experimental Software Engineering
Examples:
– Which technique is more effective for software inspection: checklist based reading or perspective basedreading?
25Marcos Kalinowski Experimental Software Engineering
Controlled Experiment
Observation
Cause Effect
Treatment Result
Theory
IndependentVariable
DependentVariable
ExperimentOperation
Marcos Kalinowski Engenharia de Software Experimental 26
WOHLIN, C., RUNESON, P., HÖST, M., OHLSSON, M., REGNELL, B., WESSLÉN, A. (2012) Experimentation in Software Engineering. Springer.
Controlled Experiment
Threats to Validity
• Results of experiments should be reported consideringtheir validity
– Internal
– External
– Construct
– Conclusion
Marcos Kalinowski Engenharia de Software Experimental 27
BIFFL, S.; KALINOWSKI, M.; EKAPUTRA, F.; ANDERLIN-NETO, A.; CONTE, T.; WINKLER, D. (2014) Towards a semantic knowledge base on threats to validity and control actions in controlled experiments. In: 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Torino, Italy.
Experiment Process
Scoping
Planning
Operation
Analysis
Marcos Kalinowski 28Experimental Software Engineering
• Scoping– Identification of the study goals– Identification of the objects and groups
• Planning– Formulation of hypotheses– Identification of dependent variables (response variables)– Identification of independent variables (factors)– Selection of subjects– Experiment design– Selection of the analysis methods– Instrumentation– Validity evaluation (threats to validity)
Marcos Kalinowski 29Experimental Software Engineering
Experiment Process
• Operation– Training and preparation– Execution of the study by the participants
• Analysis– Descriptive statistics– Graphical visualization– Elimination of outliers– Analysis of the distribution– Statistical hypothesis testing
• Packaging– Presentation of the results– Preparation of the package to repeat the study
Marcos Kalinowski 30Experimental Software Engineering
Experiment Process
Case Study• Definition:
“A method that investigates a phenomena within its real context, specially whenthe boundaries and/or the context of the phenomena are not well defined”
• Mainly used when the use of controlled experiments is not possible,because:– The context is important and difficult to be separated from the problem or to be
simulated– Several effects are expected and observing them might require a longer period of
time
31
RUNESON, P., HOST, M., RAINER, A., REGNELL, B. (2012) Case Study Research in Software Engineering: Guidelines and Examples. John Wiley & Sons.
Case Study
Types of Case Studies– Exploratory
• Used in initial investigations of phenomena
• Aim at deriving new ideas and hypotheses (formulatingtheories)
– Descriptive
• Describe a situation of phenomena
– Explanatory
• Search for na explanation for a situation or problem
• Mainly, but not mandatory, in the form of a causalrelationship
– Confirmatory
• Used to test/refute theories
32Marcos Kalinowski Experimental Software Engineering
RUNESON, P., HOST, M., RAINER, A., REGNELL, B. (2012) Case Study Research in Software Engineering: Guidelines and Examples. John Wiley & Sons.
Survey
• Retrospective (descriptive, explanatory, orexploratory) aiming at identifying characteristicsand/or opinions of a large population
• Representative sample selection for a certainpopulation plays a key role in survey research
– Data analysis techniques are used to generalize thesample to the population
33Marcos Kalinowski Experimental Software Engineering
Action-Research
• Characteristics:
– Researcher interferes on the study object with the purpose of improving it
• Goals:
– Promote improvements, and
– Contribute to scientific knowledge
34Marcos Kalinowski Experimental Software Engineering
SANTOS, P.S.M.; TRAVASSOS, G.H.; ZELKOWITZ, M.V. (2011) Action research can swing thebalance in experimental software engineering, Advances in Computers, vol. 83, 205-276.
Comparison of the Primary Studies
Marcos Kalinowski Engenharia de Software Experimental 35
WOHLIN, C., RUNESON, P., HÖST, M., OHLSSON, M., REGNELL, B., WESSLÉN, A. (2012) Experimentation in Software Engineering. Springer.
Exercises
Marcos Kalinowski Engenharia de Software Experimental 36
WOHLIN, C., RUNESON, P., HÖST, M., OHLSSON, M., REGNELL, B., WESSLÉN, A. (2012) Experimentation in Software Engineering. Springer.
• What is the difference between qualitative and quantitativeresearch?
• What is a survey? Give examples of different types of surveysin software engineering.
• Which role plays replication and systematic literaturereviews in building empirical knowledge?
• How can the Experience Factory be combined withGoal/Question/Metrics method and empirical studies on atechnology transfer context?
• Which are the key ethical principles to observe whenconducting experiments?
Required Reading
• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapter 3 – Measurement
• Optional reading:– Basili, V., Caldera, C., Rombach, D. Goal Question Metric Paradigm, Encyclopaedia of
Software Engineering (Marciniak J. editor), vol. 1, John Wiley & Sons, 1994, p. 528-532.
– Basili, V., Trendowicz, A., Kowalczyk, M., Heidrich, J., Seaman, C., Münch, J., Rombach, D. Aligning Organizations through Measurement - The GQM+Strategies Approach. Springer-Verlag, 2014.
– Fenton, N.E.; Bieman, J.; Software Metrics: A Rigorous and Practical Approach; 3rd edition, Kindle edition; Boca Raton, FL: CRC Press Taylor & Francis Group; 2015 ISBN 978-1-4398-3823-5
Marcos Kalinowski 37Experimental Software Engineering
MEASUREMENT
38Marcos Kalinowski Experimental Software Engineering
• Basic Concepts– Scale Types
– Objectives and Subjective Measures
– Direct or Indirect Measures
• Measurement in Software Engineering
• Measurement in Practice
• Exercises
Agenda
39Marcos Kalinowski Experimental Software Engineering
• “You can't control what you can't measure”
Tom DeMarco
• Measure x Measurement x Metric
• Measurement activities need clear goals
Basic Concepts
40Marcos Kalinowski Experimental Software Engineering
Measurement Goals
• Measurement activities need clear goals
– GQM: characterize, understand, evaluate, predict, improve?
• Goal/Question/Metric GQM (Basili and Rombach)
41Marcos Kalinowski Experimental Software Engineering
• Nominal– Least powerful scale, based on nominal classification
– Example: Defect Types
• Ordinal– Ranks entities after an ordering criterion
– Example: Software complexity levels, Likert scales
Scale Types
42Marcos Kalinowski Experimental Software Engineering
• Interval– Used when the distance between two measures is meaningful,
nut not the value itself
– Example: Temperatures Measured in Celsius or Fahrenheit
• Ratio– If there exists a meaningful zero value and the ratio between
two measures is meaningful, a ratio scale can be used
– Example: Effort invested in a development activity
Scale Types
43Marcos Kalinowski Experimental Software Engineering
• Objective Measures– There is no judgement in the measurement value and is
therefore only dependent on the object that is being measured
– Can be measured several times and will always rovide the samevalue, within the measurement error
– Examplo: Lines of Code
• Subjective Measures– The person making the measurement contributes by making
some sort of judgement
– Mostly of nominal or ordinal scale types
– Example: Usability
Objective and Subjective Measures
44Marcos Kalinowski Experimental Software Engineering
• Direct Measures– Gathered directly
– Example: Lines of Code
• Indirect Measures– Involve the measurement of other attributes
– Example: Defects/LOC, LOC/Hour
Direct or Indirect Measures
45Marcos Kalinowski Experimental Software Engineering
• Objects of Interest:
– Process• Ativities
– Product• Artefacts
– Resources• Human, Hardware and Software
Measurement in Software Engineering
46Marcos Kalinowski Experimental Software Engineering
• Internal Attributes– Obtained directly from the process, product or resource
– Example: Size of a software product
• External Attributes– Can only be measured with respect to how the object related with
other entities of its environment
– Example: Software reliability
Measurement in Software Engineering
47Marcos Kalinowski Experimental Software Engineering
• Measurement Approaches
– In software development processes• Métrics are defined by the SEPG and are then collected for each
software development project
• Goal Question Metrics Paradigm (GQM).
• Practical Software Measurement (PSM).
– In experimental studies• Metrics are defined by the researcher and then collected during
the study operation phase.
• Goal Question Metrics Paradigm (GQM).
Measurement in Practice
48Marcos Kalinowski Experimental Software Engineering
• Defines a way to plan and execute measurement andanalysis activities;
– Starts with the declaration of the measurement Goals;
– From the objectives Questions that we would like toanswer with the data interpretation are defined;
– Finally, from the questions, the Metrics and the data to becollected are defined.
• Example of a real GQM-based Measurement Plan
GQM
49Marcos Kalinowski Experimental Software Engineering
Marcos Kalinowski Experimental Software Engineering 50
Marcos Kalinowski Experimental Software Engineering 51
Marcos Kalinowski Experimental Software Engineering 52
Examples of Experimental Study Goals
• GQM Template:
“Analyze <object of study> with the purpose of <goal> with respect to <quality focus> from the point of view ofthe <perspective> in the context of <context>”.
53Marcos Kalinowski Experimental Software Engineering
Examples of Experimental Study Goals
CARNEIRO, G.; LAIGNER, R.; KALINOWSKI, M.; WINKLER, D.; AND BIFFL, S. Investigating the influence of
inspector learning styles on design inspections: Findings of a quasi-experiment. In CIbSE 2017 - XX Ibero-American
Conference on Software Engineering, pages 222-235, 2017.
Analyze the documentation debt related to the use of AR (user stories)
for the purpose of characterizing
with respect to the impacts that it can cause on the project in terms of extra effort and cost
from the viewpoint of the project manager
in the context of an industrial software development project.
Marcos Kalinowski Experimental Software Engineering 55
MENDES, T. S.; DE FREITAS FARIAS, M. A.; MENDONÇA, M. G.; SOARES, H. F.; KALINOWSKI, M.; AND
SPÍNOLA, R. O. Impacts of agile requirements documentation debt on software projects: a retrospective study. In
Proceedings ACM Symposium on Applied Computing, Pisa, Italy, April 4-8, 2016, pages 1290-1295, 2016.
Examples of Experimental Study Goals
Examples of Experimental Study Goals
ESTÁCIO, B., OLIVEIRA, R., MARCZAK, S., KALINOWSKI, M., GARCIA, A., PRIKLADNICKI, R., LUCENA, C.
Evaluating Collaborative Practices in Acquiring Programming Skills: Findings of a Controlled Experiment. In:
Simpósio Brasileiro de Engenharia de Software (SBES), Belo Horizonte, Brazil, 2015.
Exercises
• What are measure, measurement and meatric and howthey relate?
• Which are the four main measurement scale types?
• What is the difference between a direct and na indirectmeasure?
• Which three classes are measurements in software engineering divided into?
• What are internal and external attributes and how are they mostly related to direct and indirect measures?
Marcos Kalinowski Experimental Software Engineering 57
Required Reading
• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapter 4 – Systematic Literature Reviews
• Kitchenham, B., Charters, S. Guidelines for performing systematic literaturereviews in software engineering. Technical Report, Keele University andUniversity of Durham, 2007.
• Petersen, K., Vakkalanka, S., Kuzniarz, L., Guidelines for conducting systematicmapping studies in software engineering: An update. Information & Software Technology 64: 1-18, 2015.
• Optional Reading (examples):– E. Mendes, M. Kalinowski, D. Martins, F. Ferrucci and F. Sarro, Cross- vs. Within-Company
Cost Estimation Studies Revisited: An Extended Systematic Review, In: Proc. International Conference on Evaluation and Assessment in Software Engineering (EASE), London, UK, 2014.
– Alves N. S. R., Mendes, T. S., Mendonca, M. G., Spínola R.O., Shull, F., Seaman, C.B.:Identification and management of technical debt: A systematic mappingstudy. Information & Software Technology 70: 100-121 (2016).
Marcos Kalinowski 58Experimental Software Engineering
SECONDARY STUDIES
59Marcos Kalinowski Experimental Software Engineering
Knowledge Acquisition in Software EngineeringStudies
The experimentation process has a recursive nature
Knowledge acquired in primary studies feed secondary studies, which enableidentifying the need of new primary studies...
Marcos Kalinowski Engenharia de Software Experimental 60
TRAVASSOS, G. H.; SANTOS, P. S. M.; MIAN, P.; DIAS NETO, A. C.; BIOLCHINI, J. (2008). An Environment to Support Large Scale Experimentation in Software Engineering. In: Proc. of XIII IEEE International Conference on Engineering of Complex Computer Systems, Belfast.
Secondary Studies
• Secondary studies are studies that review primarystudies concerning a specific research question withthe goal of providing a research synthesis of theexisting evidence.
– Aim at identifying, evaluating and interpreting all relevantresults on a given research topic.
– Examples: systematic reviews.
61Marcos Kalinowski Experimental Software Engineering
Systematic Literature Reviews (SLRs)
• Literature review that aims at being:
– ...fair (not biased)
– ...rigorous (defined process)
– ...open (transparent)
– ...objective (reproducible)
• Used in many research areas
– Social sciences, health and education
– Very common in medicine
62Marcos Kalinowski Engenharia de Software Experimental
KITCHENHAM, B.; CHARTERS, S. (2007) Guidelines for performing Systematic Literature Review in Software
Engineering. Keele University Technical Report - EBSE-2007-01.
Reasons for Conducting Reviews
• Academy:– Experimental characterization of different technologies.
– Repetition of studies in different contexts to acquireknowledge incrementally.
• Industry:– Experimental results may indicate the impact of using
technologies in different contexts.
– Decision support.
63Marcos Kalinowski Experimental Software Engineering
Advantages of Conducting SLRsCharacteristic Traditional Review Systematic Review
Question Usually broadly scoped Focused on researchquestions
Identification ofresearch
Not specified, potentially biased
Several sources and welldefined search strategy
Selection Not specified, potentially biased
Selection based on explicitcriteria
Evaluation Variable Rigorous assessment
Sinthesis Frequently a qualitativesummary
Qualitative andquantitative
Inferences Sometimes based onevidence
Usually based on evidence
64Marcos Kalinowski Experimental Software Engineering
SLR
First Filter
Surveys
Case Studies
Experiments
PrimaryStudies
Second Filter
Surveys
Case Studies
Experiments
Extracted Data
65Marcos Kalinowski Experimental Software Engineering
Systematic Mapping Study (SMS)
• Secondary study approach
• Rigorous review, that uses a formal process to:
– Identify all relevant research on a specific topic
– SMSs are conducted to identify and categorize existing studies
• Provide only na overview on the research topic
• There is no comparison of results of methods or techniques
66Marcos Kalinowski Engenharia de Software Experimental
PETERSON, K., FELDT, R., MUJTABA, S., MATTSON, M. (2008) Systematic Mapping Studies in Software Engineering.
In: 12th international conference on Evaluation and Assessment in Software Engineering.
Discussion of the Papers: Best Practicesand Examples
Marcos Kalinowski 67Experimental Software Engineering
Required Reading
• Kuhrmann, M., Fernández, D.M. and Daneva, M., 2017. On the pragmatic design of literature studies in software engineering: an experience-based guideline. Empirical software engineering, 22(6), pp.2852-2891.
• Cruzes, D.S. and Dybå, T., 2011. Research synthesis in software engineering: A tertiary study. Information and Software Technology, 53(5), pp.440-455.
Marcos Kalinowski 68Experimental Software Engineering
EXPERIMENT PROCESS, SCOPING AND PLANNING
69Marcos Kalinowski Experimental Software Engineering
Required Reading
• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapters 6 (Experiment Process), 7 (scoping), and 8 (planning).
• Optional Reading (examples of experiments):– ESTÁCIO, B., OLIVEIRA, R., MARCZAK, S., KALINOWSKI, M., GARCIA, A.,
PRIKLADNICKI, R., LUCENA, C. Evaluating Collaborative Practices in Acquiring Programming Skills: Findings of a Controlled Experiment. In: Simpósio Brasileiro de Engenharia de Software (SBES), 2015, Belo Horizonte.
– RIVERO, L., KALINOWSKI, M., CONTE, T. Practical Findings from Applying Innovative Design Usability Evaluation Technologies for Mockups of Web Applications. In: 47th Hawaii International Conference on System Sciences (HICSS), 2014.
Marcos Kalinowski 70Experimental Software Engineering
• Experimentation Process
• Experiment Scoping
• Experiment Planning– Context Selection
– Hypotheses Formulation
– Variable Selection
– Participant Selection
– Experiment Design
– Instrumentation
– Threats to Validity
Agenda
71Marcos Kalinowski Experimental Software Engineering
Experimentation Process
Scoping
Planning
Execution
Analysis
72Marcos Kalinowski Experimental Software Engineering
• Scoping– Identification of the study goals– Identification of the objects and groups
• Planning– Formulation of hypotheses– Identification of dependent variables (response variables)– Identification of independent variables (factors)– Selection of subjects– Experiment design– Selection of the analysis methods– Instrumentation– Validity evaluation (threats to validity)
Marcos Kalinowski 73Experimental Software Engineering
Experiment Process
• Operation– Training and preparation– Execution of the study by the participants
• Analysis– Descriptive statistics– Graphical visualization– Elimination of outliers– Analysis of the distribution– Statistical hypothesis testing
• Packaging– Presentation of the results– Preparation of the package to repeat the study
Marcos Kalinowski 74Experimental Software Engineering
Experiment Process
Experiment Scoping
• Identify the Goal and the Context of the Study
GQM template:
“Analyze <Object(s) of study> for the purpose of <Purpose> with respect to their <Quality focus> from the point of view of the <Perspective> in the context of <Context>”.
• Identify the objects and study groups (control and experimental group)
Marcos Kalinowski 75Experimental Software Engineering
• Experimentation Process
• Experiment Scoping
• Experiment Planning– Context Selection
– Hypotheses Formulation
– Variable Selection
– Participant Selection
– Experiment Design
– Instrumentation
– Threats to Validity
Agenda
Marcos Kalinowski 76Experimental Software Engineering
Experiment Planning
Marcos Kalinowski 77Experimental Software Engineering
Context Selection
• Four dimensions:
– Off-line vs on-line;
– Students vs professionals;
– Toy vs real problems;
– Specific vs general.
Marcos Kalinowski 78Experimental Software Engineering
Hypothesis Formulation
• Null Hypothesis;
• Alternative Hypotheses.
Marcos Kalinowski 79Experimental Software Engineering
Variable Selection
• Dependent Variables (Response Variables);
• Independent Variables (including Factors).
Marcos Kalinowski 80Experimental Software Engineering
Participant Selection
• Sample selection.
– Selecting subjects by random is not always possible
Marcos Kalinowski 81Experimental Software Engineering
Experiment Design
• Principles:
– Randomization;
– Blocking;
– Balancing;
• Design Types:
– Number of factors;
– Number of treatments.
Marcos Kalinowski 82Experimental Software Engineering
Instrumentation
• Instruments should be completely developed before conducting the experiment and ideally evaluated through a pilot study.
• Examples: Agreement to partipate, subject characterization form, study objects, task description, measurement instruments, follow-up questionnaire.
Marcos Kalinowski 83Experimental Software Engineering
Threats to Validity
• Conclusion Validity;
• Internal Validity;
• Construct Validity;
• External Validity.
Marcos Kalinowski 84Experimental Software Engineering
Observation
Cause Effect
Treatment Result
Theory
IndependentVariable
DependentVariable
ExperimentOperation
Marcos Kalinowski Engenharia de Software Experimental 85
WOHLIN, C., RUNESON, P., HÖST, M., OHLSSON, M., REGNELL, B., WESSLÉN, A. (2012) Experimentation in Software Engineering. Springer.
Controlled Experiment
Threats to Validity
• Results of experiments should be reported consideringtheir validity
– Internal
– External
– Construct
– Conclusion
Marcos Kalinowski Engenharia de Software Experimental 86
BIFFL, S.; KALINOWSKI, M.; EKAPUTRA, F.; ANDERLIN-NETO, A.; CONTE, T.; WINKLER, D. (2014) Towards a semantic knowledge base on threats to validity and control actions in controlled experiments. In: 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Torino, Italy.
Exercises
Marcos Kalinowski Engenharia de Software Experimental 87
WOHLIN, C., RUNESON, P., HÖST, M., OHLSSON, M., REGNELL, B., WESSLÉN, A. (2012) Experimentation in Software Engineering. Springer.
• What are a null hypothesis and an alternativehypothesis?
• What is type-I-error and type-II-error respectively,which is worst and why?
• In which different ways may subjects be sampled?
• What different types of experiment designs are available, and how do the design relate to the statistical methods to apply in the analysis?
• What are the types of threats to validity? Provide one example threat for each type.
EXPERIMENT DESIGN: ADVANCED CONCEPTS
Marcos Kalinowski 88Experimental Software Engineering
Required Reading
• ESTÁCIO, B., OLIVEIRA, R., MARCZAK, S., KALINOWSKI, M., GARCIA, A., PRIKLADNICKI, R., LUCENA, C. Evaluating Collaborative Practices in Acquiring Programming Skills: Findings of a Controlled Experiment. In: Simpósio Brasileiro de Engenharia de Software (SBES), 2015, Belo Horizonte, Brazil.
• RIVERO, L., KALINOWSKI, M., CONTE, T. Practical Findings from Applying Innovative Design Usability Evaluation Technologies for Mockups of Web Applications. In: 47th Hawaii International Conference on System Sciences (HICSS), 2014.
Marcos Kalinowski 89Experimental Software Engineering
ADVANCED CONCEPTS(Let's plan and manage more complex studies)
Based on material gently provided by Prof. Guilherme Horta Travassos
Marcos Kalinowski 90Experimental Software Engineering
Principles of Experimental Designs
• Simple designs help to make the experiment practical– minimizing use of time, money, personnel and experimental
resources
– easier to analyze
• Maximizing information yields more complete understanding– allows generalization to the widest possible situations
• Consider several issues to simplify and maximize:– experimental error
– replication
– randomization
– local control
Marcos Kalinowski 91Experimental Software Engineering
Factors and Experimental Design
• A factor is an independent variable in the design.
– Examples: To determine the effects of experience and language on productivity, design may have two independent variables: experience and language. Dependent variable is productivity.
• Values or classifications for each factor are called levelsof the factor.
• Levels can be continuous or discrete, quantitative or qualitative.
– Example: Number of years of experience
Marcos Kalinowski 92Experimental Software Engineering
Experimental Error
• Experimental error describes the failure of two identically treated experimental units to yield identical results– reflects errors of experimentation
– reflects errors of observation
– reflects errors of measurement
– reflects the variation in experimental resources
– reflects the combined effects of confounding factors that can influence the characteristics under study but which have not been singled out for attention in the investigation
• Example: Error may be due to– mind wandering
– timer measured elapsed time inexactly
– distractions: loud noises in next room
– …
Marcos Kalinowski 93Experimental Software Engineering
How to Control Error
• Control as many variables as possible
• Minimize variability among participants
• Minimize effects of irrelevant variables
• Try to use design to distribute effects of irrelevant variables equally across all experimental conditions
• Techniques for controlling error in the design– Replication
– Randomization
– Local control
Marcos Kalinowski 94Experimental Software Engineering
Replication
• Represents the repetition of the basic experiment
• It means repeating an experiment under identical conditions, rather than repeating measurements on the same unit
• It provides an estimate of experimental error that acts as a basis for assessing the importance of observed differences in an independent variable (that is, how much confidence we can have in the results)
• It enables us to estimate the mean effect of any experimental factor
Marcos Kalinowski 95Experimental Software Engineering
Confounding
• Two or more variables are confounded if it is impossible to separate their effects when the subsequent analysis is performed.– Example: you are comparing the use of a new tool with your existing
tool. Programmer A uses the new tool in your development environment, while B uses the existing tool. If you compare measures of quality in the resulting code, the difference is due to the tools only if you have accounted for differences in skill of the programmers. That is, the effects of tools and programmer skill are confounded.
• Confounding is introduced when there is no control for other variables.
• Sequence can also confound (learning effect): Test team uses technique X to test, then technique Y.
Marcos Kalinowski 96Experimental Software Engineering
Randomization
• Replication allows us to know the statistical significance of the results, but not the validity. That is, we want to be sure that the results followed from the treatments applied. For this, we distribute the observations independently.
• Randomization is the random assignment of subjects to groups or of treatments to experimental units, so that we can assume independence and thus validity of results.
• Randomization does not guarantee independence but keeps variation of bias to a minimum.
Marcos Kalinowski 97Experimental Software Engineering
Local Control
• Reflects how much control you have over the placement of subjects in experimental units and the organization of those units.
• Makes the design more efficient by reducing the magnitude of experimental error.
• Two aspects of local control:
– blocking
– balancing the units
Marcos Kalinowski 98Experimental Software Engineering
Blocking
• allocates experimental units to blocks or groups so that the units within a block are relatively homogenous
• predictable variation among units is confounded with the effects of the blocks
• Example: investigating the effects of three design techniques on code quality.– Teach techniques to 12 developers, measure number of defects per
thousand lines of code– If the 12 graduated from 3 universities, training at each university may
affect the way the design technique is understood or used– To eliminate the effects of this, define three blocks: first has all
developers from university X, second from university Y, third from university Z
– Then assign treatments randomly to the developers within each block
Marcos Kalinowski 99Experimental Software Engineering
Balancing
• blocking and assigning treatments so that an equal number of subjects is assigned to each treatment, whenever possible
• simplifies statistical analysis
• designs can range from being completely balanced to little or no balance
• If a design has no blocks, it must be completely randomized.
Marcos Kalinowski 100Experimental Software Engineering
Types of Experimental Designs
• Type of design can constrain the analysis.– For example, the way to perform an analysis of variance depends on
number of variables and the way in which subjects are grouped and balanced.
• Measurement scale can constrain the analysis.– Nominal scales divide data into categories, while ordinal scales
permit rank ordering and more powerful tests. Parametric tests such as analysis of variance require at least interval scale.
• Sampling can constrain the analysis.– Degree of randomization
– Distribution of data• Normal or near-normal and homoscedastic distributions can use parametric
tests; otherwise, non-parametric tests are preferable.
Marcos Kalinowski 101Experimental Software Engineering
EXPERIMENT ANALYSIS
AND INTERPRETATION
Marcos Kalinowski 102Experimental Software Engineering
Required Reading
• Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., Experimentation in Software Engineering, Springer, 2012.– Chapters 9 (Operation) and 10 (Analysis and Interpretation)
• ESTÁCIO, B., OLIVEIRA, R., MARCZAK, S., KALINOWSKI, M., GARCIA, A., PRIKLADNICKI, R., LUCENA, C. Evaluating Collaborative Practices in Acquiring Programming Skills: Findings of a Controlled Experiment. In: Simpósio Brasileiro de Engenharia de Software (SBES), 2015, Belo Horizonte, Brazil.
• RIVERO, L., KALINOWSKI, M., CONTE, T. Practical Findings from Applying Innovative Design Usability Evaluation Technologies for Mockups of Web Applications. In: 47th Hawaii International Conference on System Sciences (HICSS), 2014.
Marcos Kalinowski 103Experimental Software Engineering
ADVANCED CONCEPTS(Let's talk about statistics and data analysis)
Based on material gently provided by Prof. Guilherme Horta Travassos
Marcos Kalinowski 104Experimental Software Engineering
Experimentation Process
Definition
Planning
Execution
Analysis
Statistical Inference Techniques
Marcos Kalinowski 105Experimental Software Engineering
Hypotheses, Variables and Scales
• Planning and Hypotheses
• Hypotheses
• Choosing the variables
• Scales
• Scales’ information level
• Scales and basic operations
Marcos Kalinowski 106Experimental Software Engineering
Planning and Hypotheses
• Planning
– Hypotheses Formulation
– Dependent variables identification (responses)
– Independent variables Identification (factors)
– Participants Selections
– Study Design
– Selection of Analysis Methods
– Instruments Definition
– Threats to validity (experiment risks)
Marcos Kalinowski 107Experimental Software Engineering
Hypothesis
• A Hypothesis is a theory or supposition that can explain a determined behavior of the research interest
• An experimental study aims at collecting data, from a controlled environment, to support the hypothesis confirmation or refuting
“Developers using the technique Y can conclude the task of requirements analysis in less time and produce a more complete requirements set than when using the technique X”
Marcos Kalinowski 108Experimental Software Engineering
Hypotheses and Variables
• Hypotheses guide the definition of variables
• Independent Variables (become factors when controlled) – Relate to process inputs. Can be controlled.
– Represent the causes that are expected to affect the results. When controlled their values are called treatments.
• Dependent Variables– Relate to process outputs and they are affected throughout the
experimentation process.
– Represent the effect from the combination of the independent variables values (including the factors). Their possible values are called results.
Marcos Kalinowski 109Experimental Software Engineering
Hypotheses and Variables
“Developers using the technique Y can conclude the task of requirements analysis in less time and produce a more complete requirements set than when using the technique X”
Independent Variables
Used technique (treatments: Y e X)
Developers Characterization
Application Characterization
Dependent Variables
Time to execute the task
% of right requirements defined
Marcos Kalinowski 110Experimental Software Engineering
Variables and their values
• Studies` variables can be:
– Qualitative: the values (treatments) represent types
– Quantitative: the values represent levels for the variable application
• The values of the variables are collected in scales:
– There are different scales that can be used to collect and represent these values: nominal, ordinal, interval and ratio.
– The scales specify the operations that can be applied to the variables values
Marcos Kalinowski 111Experimental Software Engineering
Nominal Scale
• Nominal scale values represent different types of an element, without numerical interpretation nor ordering among them.
• Examples in software include:– Names of different measures of software size (lines of code,
function points, use case points, ...)– Names of different programming languages (Java, C++, C#, Pascal,
...)
• The scale does not allow us to say, for instance, that lines of code is greater than function points nor that Java is less than C#
Marcos Kalinowski 112Experimental Software Engineering
Ordinal Scale
• Ordinal Scale values represent different element types that can be ordered with no numerical interpretation
• Exemples in software include:
– Different CMMI levels (1, ..., 5) or MPS.BR (G, ..., A)
• The scale allows to say, for instance, that CMMI 2 is less than CMMI3, but does not allow to say that the quality difference between the companies CMMI 2 and CMMI 3 is the same as CMMI 3 and CMMI 4.
Marcos Kalinowski 113Experimental Software Engineering
Interval Scale
• Interval scale values can be ordered and the distance between consecutive values can be interpreted equally, however the ratio between these values has no meaning.
• For instance: although we can say that 2011 represents an year after 2010 and an year before 2012 there is no meaning in calculating the ratio between 2011 and 2012.
• The comparison is possible just because all interval scale presents an arbitrary zero point (in the case of dates, the year 0)
Marcos Kalinowski 114Experimental Software Engineering
Interval Scale
• The Likert Scales represent and example of intervalscale pretty used in software related studies
– Using a Likert scale we can define different names torepresent, in general, the intensity of a property that can notbe directly measured.
– For instance, we can build a Likert scale to evaluate the riskimpact using the following values: very high, high, medium,low and very low.
– Although impossible to verify the interval distance in the realworld, it is assumed these values are very near each other.
Marcos Kalinowski 115Experimental Software Engineering
Ratio Scale
• Ratio scale values can be ordered, the distance between consecutive values have the same meaning and the ratio between values can be interpreted.
• Examples in software include software size, effort and time for the project execution.
• The ratio scale allows to say, for instance, that a software with X lines of code is twice smaller than a software with 2X lines of code
• In ratio scale, 0 (zero) means no existence of the measure.
Marcos Kalinowski 116Experimental Software Engineering
Scales Information
Nominal
Ordinal
Interval
Ratio
Values can be counted and ordered
Values can counted and ordered
Distance between values can be interpreted
Values can counted and ordered
Distance between values can be interpreted
Ratio between values can be interpreted
Mo
re I
nfo
rm
ati
on
...
Values can be counted
Marcos Kalinowski 117Experimental Software Engineering
Scales and Characteristics
Scale Nominal Ordinal Interval Ratio
Values Counting X X X X
Values Ordering X X X
Equidistant Intervals
X X
Adding and Subracting values
X
Values Division X
• According to the variable scales, we can explore different characteristis of their values
Marcos Kalinowski 118Experimental Software Engineering
Example
“Developers using the technique Y can conclude the task of requirements analysis in less time and produce a more complete requirements set than when using the technique X”
Independent Variables
Used technique (treatments: Y e X)
Nominal Scale with 2 treatments
Developers and Application Characterization
Nominal or Ordinal Scale
Dependent Variables
Time to execute the task
Ratio Scale
% of right requirements defined
Ratio Scale
Marcos Kalinowski 119Experimental Software Engineering
Tabulation and Graphics
• Variables and execution
• Tabulation
• Graphical Analysis
• Histograms
• Pie Charts
• Dispersion Charts
• Control Charts
Marcos Kalinowski 120Experimental Software Engineering
Variables and Execution
• The execution of an experimental study consists in a series of trials– In each trial, a participant applies one treatment from the
independent variables set and produces results for each dependent variable
– These results are collected in tuples of type Ai = {Ti, Ri}, where Ti is the ordered set regarding each treatment of each independent variable applied by the participant i and Ri represents the ordered set of each result obtained by the same participant for each dependent variable
– These results are going to be the reason for data analysis in the experimental study.
Marcos Kalinowski 121Experimental Software Engineering
Variables and Execution• Some tabulated data after the execution of a hypothetical study. These
data will be used in the next slides.
Participant Technique Time(days) % Right Found
1 Y 10 83%2 Y 13 73%3 Y 12 87%4 Y 13 78%5 Y 10 74%6 Y 14 74%7 Y 14 87%8 Y 13 75%9 Y 14 86%
10 Y 14 82%11 Y 13 77%12 X 13 90%13 X 9 89%14 X 11 88%15 X 14 87%16 X 9 97%17 X 12 81%18 X 9 82%19 X 12 86%20 X 11 92%21 X 14 96%22 X 13 98%Marcos Kalinowski 122Experimental Software Engineering
Variables and Execution
• After data tabulation, central tendency measurements, dispersion and dependency can be used together with graphical analysis to better “understand” the data.
• This understanding is important when selecting and applying the statistical inference techniques, that will support the hypothesis testing.
Marcos Kalinowski 123Experimental Software Engineering
Graphical Visualization
• A chart visually represents the tabulated information
– Charts are usually easier to understand when compared to large tabulated data sets
– The spatial data presentation helps in the identification of groups and the visualization of relationships among them
– In general, charts can be quickly read
• Methods for graphical representation of data
– Histograms
– Pie Charts
– Dispersion charts
Marcos Kalinowski 124Experimental Software Engineering
Graphical Visualization
• The graphical visualization methods can depend on the variables classification (continuous, discrete)
• Discrete variables can assume any value into a defined finite set of values– They are more common in nominal or ordinal scales. However, they
can also occur in the interval and ratio scales
• Continuous variables can assume any value in an interval with an infinite set of values– They are common in the interval and ratio scales.
Marcos Kalinowski 125Experimental Software Engineering
Histograms
• It shows the observed values regarding one specific variable in the frequency domain
– The frequency indicates the number or percentage of occurrences for each value from the collected values set
– If data is discrete, each information is presented as a bar as high as the number of times that the value occurs in the value set
– If data is continuous, they shall be made discrete, it means, data needs to be split in equidistant regions. After, it is needed to count how many times the values of each region show up in the collected values set. Next, a bar can be traced as for discrete data.
Marcos Kalinowski 126Experimental Software Engineering
Histograms
• It is a common representation method for numerical data in any scale, because it involves only counting.
• The histograms also allow to relate observed data with known frequency distributions
– These distributions have mathematical properties from which the statistic inference tests are derived
• If the observed data do not follow these properties (normality, for instance), we can not be confident in the results of the
testing. In these cases, other types of statistical tests must be used
Marcos Kalinowski 127Experimental Software Engineering
Histograms• Histogram of time spent by the participants in the analysis activity,
according to the used technique
Time
(days)
TechniqueY
TechniqueX
9 0 3
10 2 0
11 0 2
12 1 2
13 4 2
14 4 2
* Data Distribution Table
Time 9days)
# p
artic
ipan
ts
Marcos Kalinowski 128Experimental Software Engineering
Cumulative Histogram
• A cumulative histogram shows the frequency of occurrence of values less than or equal to a specific value. – Each bar in the graph represents the sum of the previous
bars into a conventional histogram
– In different configurations, it is possible to get some suggestion about the acceptance or rejection of the hypothesis by observing the cumulative histogram regarding ( however, just the statics testing can confirm it!)
– Because data must be ordered, cumulative histograms can not be used with nominal scale variables values.
Marcos Kalinowski 129Experimental Software Engineering
• Cumulative Histogram for time spent by the participants in the analisys activities with techniques X and Y
0
1
2
3
4
5
6
7
8
9
10
11
12
9 10 11 12 13 14
Técnica Y Técnica X
Time
(days)
TechiniqueY
TechiniqueX
9 0 3
10 2 3
11 2 5
12 3 7
13 7 9
14 11 11
Time (days)#
partic
ipan
ts* Data Distribution Table
Cumulative Histogram
Marcos Kalinowski 130Experimental Software Engineering
Pie Chart
• A pie (pizza) chart shows the relative frequency (or percentage) of data occurrence, dividing the data by a set of distinct classes and presenting them as proportional slices in the circle.
928%
1118%12
18%
1318%
1418%
Técnica XX Dias% participantes
X Days% participants
Marcos Kalinowski 131Experimental Software Engineering
Dispersion Diagram
• It shows the observed values of two or more variables through Cartesian graphics.
– Each axis represents one of the variables, composing tuples (two or more dimensions)
– This representation format helps in the identification of patterns that can suggest relations between variables
– Dispersion Diagrams also help to identify the values that are different from normal behavior (outliers). Outliers can distort statistical analysis and shall usually be eliminated before statistical tests.
Marcos Kalinowski 132Experimental Software Engineering
• Dispersion between the percentage of right requirements found and execution time for activities with techniques X and Y
60%
65%
70%
75%
80%
85%
90%
95%
100%
8 10 12 14 16
% r
igh
t re
qu
irem
en
ts f
ou
nd
Time
Y
X
Dispersion Diagram
Marcos Kalinowski 133Experimental Software Engineering
Control Charts
• Statistical tool allowing the observation of quantitative data behavior representing the characteristics under investigation
• A typical control graph presents 3 parallel lines:
– A central line, representing the mean behavior presented by the data
– A high extreme limit, called UCL – Upper Control Limit
– A low extreme limit, called LCL – Lower Control Limit)
Marcos Kalinowski 134Experimental Software Engineering
Control Charts
Versões
21191715131197531
Núm
ero
de D
efei
tos
70
60
50
40
30
20
10
0
Num Defeitos
UCL = 26,81
Média = 15,14
LCL = 3,46
• If the characteristic behavior is under control, its values will bounce around the center line (for instance, the number of mean defects by software version), within the UCL and LCL ranges.
• Once the characteristic behavior is under control, the probability of getting a value out of limits is very low.
Marcos Kalinowski 135Experimental Software Engineering
Descriptive Statistics
• Objectives
• Central Tendency Measures
• Dispersion Measures
• Frequency Distribution
• Example
• Dependency Measurements
Marcos Kalinowski 136Experimental Software Engineering
Objectives
• To describe the characteristics behavior and trends from the experimental study collected data through statistics methods
– Together with the graphical analysis, allows the initial analysis of data and measuring of dependencies and relationships among data.
• It aims at to give a general view about the general distribution of the data set.
Marcos Kalinowski 137Experimental Software Engineering
Central Tendency Measures
• Show the middle values of the observed data set– Mean (arithmetic): meaningful for the interval and ratio scales
– Median: represents the middle value of an ordered data set, following that the number of samples higher than the median is the same as the number of samples lower than the median• Odd samples: median is represented by the middle sample
• Even samples: median is represented by the mean ot the two middle samples
– Mode: represents the most commonly occurring sample. It is meaningful for the nominal, ordinal, interval and ratio scales.• Well defined when just one value has the highest count
• Odd number of samples: it can be considered the middle value of the most common samples with same occurrence (not valid for nominal scale)
Marcos Kalinowski 138Experimental Software Engineering
Central Tendency Measures
• Other relevant measures
– Minimum Value: represents the lower observed value into the collected data set
– Maximum Value: represents the higher observed value into the collected data set
– Percentile: considering a sample with 100 values, the percentile X% represents the value that split the sample in X values lower than it and (100-X) values greater than it. The median is a special case of the percentile, namely the 50%-percentile
– Quartile: values representing the 25% percentile (1st Quartile), the median (2nd Quartile) and 75% percentile (3rd Quartile).
Marcos Kalinowski 139Experimental Software Engineering
Dispersion Measures• Measure the level of variation from the central tendency, i.e.
to see how outspread or concentrated the data is– Range: represents the distance between the maximum and minimum
data values– Variance: the mean of the square distance from the sample mean. It
is meaningful for the interval and ratio scales.– Standard deviation: it is the square root of the variance having the
same dimension (unit of measure) as the data values themselves.
freq
uen
cy
xx
freq
uen
cy
Marcos Kalinowski 140Experimental Software Engineering
Descriptive Statistics
0123456789
1011
12131415
1 2 3 4 5 6 7 8 9 10 11
Atividade Y (tempo em dias)
Bar Chart representing the time consumed by each participant that applied technique Y in the analysis activity
Marcos Kalinowski 141Experimental Software Engineering
Descriptive Statistics
Measures of Tendency
Mean 12,73
Median 13
Modes 13 e 14
Range 4
Minimum 10
Maximum 14
1st Quartile 12,5
3rd Quartile 14
Variance 2,22
Standard Deviation
1,49
Técnica Y (Histograma do Tempo)
0
1
2
3
4
5
9 10 11 12 13 14TEMPO (Dias)
# p
artic
ipan
tes
Marcos Kalinowski 142Experimental Software Engineering
Descriptive Statistics
n
xi
1
)(2
2
n
xxi
2
Mean:
There are other measures (such as kurtosis, asymmetry, geometric mean, ...) but out of scope of our discussion
Variance:
Standard Deviation:
Marcos Kalinowski 143Experimental Software Engineering
Frequency Distributions
• As seen, histograms can represent data in the frequency domain
• The histograms allow verifying if the data distribution follow a classical distribution, such as normal, uniform, beta, among others.
• The normal distribution, in particular, is important for some statistical tests, which require that analyzed data follow a normal distribution.
Marcos Kalinowski 144Experimental Software Engineering
• The normal distribution has a bell format, with the left and right limits extending from the central point. – The curve is symmetric in relation to its mean and its width is
proportional to its standard deviation – In this way, the curve can be defined by its mean and standard
deviation.
Frequency Distributions
Marcos Kalinowski 145Experimental Software Engineering
Normal Distribution
• If a numerical data set follows the normal distribution, it is possible to claim: – 68% of all observations are within the mean +- standard deviation
– 95,5% of all observations are within the mean +- 2* standard deviation
– 99,7% of all observation are within the mean +- 3* standard deviation
Marcos Kalinowski 146Experimental Software Engineering
Measures of Dependency
• When two or more variables are related, it can be useful to calculate the dependency level among them.
• The Measures of Dependency define the strength and direction of the relationship among two or more variables when quantitatively evaluated.
– The most used Measure of Dependency is CORRELATION
– Correlation between two variables is represented by a number
– Correlation among more than two variables is represented trough a correlation matrix
Marcos Kalinowski 147Experimental Software Engineering
Measures of Dependency
The CORRELATION between two variables range from -1 to 1
The correlation -1 indicates that a high value in one variable corresponds to a low value in the other one
The correlation 1 indicates that a high value in one variable corresponds to a high value in the other one
The correlation near 0 (or 0) indicates there is no way to infer the relationship behavior
CAUTION: just CORRELATION is not the CAUSE!
Marcos Kalinowski 148Experimental Software Engineering
Marcos Kalinowski 149Experimental Software Engineering
Pearson Correlation
• Most common correlation coefficient
– Quantifies the linear association strength between two variables and describes how much a straight line could be adjusted to fit these points
– The coefficient assumes the data distribution is normal
• Due the normal distribution, its condition can be indicated by the elliptical cloud formation in the dispersion graphic showing these values
Marcos Kalinowski 150Experimental Software Engineering
Measures of Dependency
0,0
2,0
4,0
6,0
8,0
10,0
12,0
14,0
16,0
18,0
20,0
0,0 5,0 10,0 15,0 20,0
A
B
CORREL(A,B) = 0,02
0,0
50,0
100,0
150,0
200,0
250,0
0,0 5,0 10,0 15,0 20,0 25,0
A
B
CORREL(A,B) = 0,98
0,0
50,0
100,0
150,0
200,0
250,0
0,0 5,0 10,0 15,0 20,0 25,0
A
B
CORREL(A,B) = -0,98
Marcos Kalinowski 151Experimental Software Engineering
Spearman Correlation
• Represents other example for a coefficient of correlation
– This method is based on the ranking of the collected values and not on the values
– In this way, it can be also used for variables in ordinal scales
• It can also be used when the distribution is not normal
Marcos Kalinowski 152Experimental Software Engineering
Regression Analysis
• Regression analysis extends the capacity of representation of dependency, providing an equation to describe the nature of the relationship.
• In simple regression analysis, the interest is in predict the value of dependent variables based on the values of independent variables
Versões
3020100
Núm
ero
de D
efei
tos
60
50
40
30
20
10
0
-10
Observado
Linear
Marcos Kalinowski 153Experimental Software Engineering
Descriptive Statistics
• Accordingly the variable scales, we can calculate
Scale Nominal Ordinal Interval Ratio
Mean X X
Median X X X
Mode X* X X X
Range X X X
Variance X X
Standard Deviation X X
Corr Pearson X X
Corr Spearman X X X
* Remember restrictions for nominal scale!
Marcos Kalinowski 154Experimental Software Engineering
Outliers Analysis
• Concepts
• Conditions of Occurrence
• Visual Identification
• Numerical Identification
Marcos Kalinowski 155Experimental Software Engineering
Outliers Removal
• Extreme values (outliers) represent observed values that are too distant from the other data set values
– They can represent data set error and usually must be removed before statistics
– They can occur due problems in the study execution, typing, interpretation or participants` motivation
– It is important to verify the origins of each outlier, because they can represent valid observations and that should be kept in the data set (false positives)
Marcos Kalinowski 156Experimental Software Engineering
Visual Identification
• Outliers can be visually identified, through dispersion graphics or box-plots– Box-plots diagrams were idealized to show the distribution of quantitative
data– They make use of measures of central tendency and dispersion to
characterize the distribution
Maximum Value
Median
3rd Quartile (mean+ X standard deviation)
1st Quartile (mean- X standard deviation)
Minimum Value
Marcos Kalinowski 157Experimental Software Engineering
Numerical Identification
• Outliers Removal Methods usually remove values that present a upper distance value from the mean or median
– Values near the limits not necessarily need to be removed from the data set (subjectivity)
– The distance is usually determined as one quartile, one percentile or a specified number of standard deviations• Quartile Method
– Lower Outliers: Q1 - 1.5*IQ
– Upper Outliers: Q3 + 1.5*IQ
– Where IQ = Q3 – Q1.
Marcos Kalinowski 158Experimental Software Engineering
Numerical Identification
• Removing outliers of the percentage of right requirements found by the participants that applied the techniques X and Y
Participant Time (days)% right
requirements
1 10 83%2 13 73%3 12 87%4 13 78%5 10 74%6 14 74%7 14 87%8 13 75%9 14 86%
10 14 82%11 13 77%
Measure Value
Minimum 73%
Mean– 1sd 74%
Mean 80%
Mean+ 1sd 86%
Maximum 87%
Marcos Kalinowski 159Experimental Software Engineering
Hypothesis Testing
• Experimental Studies Types
• Hypothesis Testing
• Erros, power and p-value
• Hypothesis Testing Types
• T-test
• Mann-Whitney
• ANOVA, Tukey
• Kruskal-Wallis
Marcos Kalinowski 160Experimental Software Engineering
Experimental Studies Types
Hypothesis Testing
Relationship Exploration
NormalDistribution Data
Non NormalDistributionData
2 groups
3+ groups
t-testpaired Student's t-test
ANOVA, Tukey
NormalDistributionData
Non NormalDistributionData
PearsonLinear Regression
SpearmanNon-Linear Regression
2 groups
3+ groups
Mann-Whitney (Wilcoxon rank-sum test)Wilcoxon signed-rank test
Kruskal-Wallis
Marcos Kalinowski 161Experimental Software Engineering
Hypothesis Testing
• As seen, an experimental study aims at to collect data to confirm or refute a hypothesis
• In general, two hypotheses are defined: – Null Hypothesis(H0): indicates the observed differences are
coincidental. It means that this is the hypothesis the researcher would like most to reject with high confidence
– Alternative Hypothesis(H1): represents the hypothesis inverse to the null one, that can be accepted, or tested.
• Statistics tests allow the acceptation or rejection of hypotheses
Marcos Kalinowski 162Experimental Software Engineering
Hypothesis Testing• In general, the Software Engineering tests compare
the mean between different groups of participants applying different treatments
“Developers using the technique Y can conclude the task of requirements analysis in less time and produce a more complete requirements set than when using the technique X”
Null Hypothesis: (TimeY) = (TimeX)
Alternative Hypothesis: (TimeY) (TimeX)
Marcos Kalinowski 163Experimental Software Engineering
Types of Errors
• The verification of hypothesis always deal with some risk, implying that some analysis error can happen
– Type I (): it happens when the statics test indicates the existence of a relationship between cause and effect that actually does not exist
– Type II (): it happens when the statistical test does not indicate a relationship between cause and effect that actually does exist
= P (error-type-I) = P (H0 is rejected| H0 is true)
= P (error-type-II) = P (H0 is not rejected| H0 is false)
Marcos Kalinowski 164Experimental Software Engineering
• The null hypothesis is usually built to minimize type I errors
– Consider:
• H0: medicine A = medicine B
• H1: medicine A is better than medicine B
– Errors:
• Type I: medicine A is better than B, but it is not true (they are equal)
• Type II: medicine A is equal to medicine B, but this is not true (A is better)
Types of Errors
Marcos Kalinowski 165Experimental Software Engineering
Power of Testing
• Indicates the probability of rejecting the null hypothesis when it is false, it means, the probability of correctly making the decision based on the alternative hypothesis– The size of error depends on the power of testing
– The power of testing implies in the probability the test can find the relationship when the null hypothesis is false
– The statistical testing with highest power shall be used to evaluate the hypothesis.
Power = 1 -
Power= P (H0 rejected | Ho is false)
Marcos Kalinowski 166Experimental Software Engineering
Significance Level
• Shows the likelihood of an type-I error to happen
– Most common significance level (): 10%, 5%, 1% and 0.1%
– We call p-value the lower level of significance that can be used to reject the null hypothesis
– We say there is statistical significance when the calculated p-value is lower than the adopted significance level
– For instance, when p=0.0001 one can say that the result is really significant, because this value is much lower than the usually used significance levels.
– However, if p=0.048 then one can not be sure. Although the value is lower than 5%, it is really closed to this significance level.
Marcos Kalinowski 167Experimental Software Engineering
• The decision-making process for a hypothesis test can be based on the probability value (p-value) for the given test:
– If the p-value is less than or equal to a predetermined level of significance (α-level), then you reject the null hypothesis and claim support for the alternative hypothesis
– If the p-value is greater than the α-level, you fail to reject the null hypothesis and cannot claim support for the alternative hypothesis
Significance Level
Marcos Kalinowski 168Experimental Software Engineering
H0 acceptation Rejection of H0
Significance Level
Marcos Kalinowski 169Experimental Software Engineering
Degrees of Freedom
• Degrees of freedom (DF) is the amount information (variables) free to be used for the calculation of a statistic (formula)
• The number of independent values to be used in the estimation of a parameter
• In general, the number of degrees of freedom of an estimate is equal to the number of values used in estimating the least number of estimated parameters in the intermediate calculation for obtaining it
• So to calculate the mean of a sample size "n" are necessary as "n" observations so that this statistic has "n" degrees of freedom
• The estimation of variance using a sample size "n" will have "n - 1" degrees of freedom as to obtain the sample variance is necessary before the calculation of the sample mean
Marcos Kalinowski 170Experimental Software Engineering
Hypothesis Testing Types
• The tests of hypothesis can be parametric and non parametric
• Parametric Tests
– They use specific formulas, derived from know distribution frequencies. Therefore, the data set that will be tested must present a distribution:
• Normal: symmetric distribution
AND
• Homocedastic: the variance is constant
Marcos Kalinowski 171Experimental Software Engineering
• Non-Parametric Tests
– Shall be used when the data distribution does not attend the
parametric tests requirements (normality, homocedastity)
– They are less powerful than parametric tests, and do not
assume any probability distribution in the data
– They use ranking of values instead the values
Hypothesis Testing Types
Marcos Kalinowski 172Experimental Software Engineering
Normality
• Frequency Distribution Graphs of the normal curve (in blue) and some hypotethical data (red vertical lines)
Data with distribution similar to normal
Data with non normal distribution
Marcos Kalinowski 173Experimental Software Engineering
Normality Testing
• Kolmogorov-Smirnov (K-S) testing
– Evaluates if two samples have similar distributions or
one sample presents a distribution similar to normal
– Frequently used to identify normality in samples with
at least 30 values!
– Detects differences related to the central tendency,
dispersion and symmetry, but is really sensitive to
long tails (high value of standard deviations)
Marcos Kalinowski 174Experimental Software Engineering
• Shapiro-Wilk testing
– Calculates the W value, indicating if the sample xi
follows the normal distribution
– Frequently used to identifiy normality in samples withless than 50 values
– Small S-W values indicate the distribution is notnormal
– Test used in small data sets, where the extremevalues can make hard to use K-S
Normality Testing
Marcos Kalinowski 175Experimental Software Engineering
Homocedasticity
• A set of variables is homocedastic if the variables have similar variances – A classical example of lack of homocedasticity is the relationship between
the type of consumed food and salary:
• As much as the salary of a person increases, the variety of food types the person can consume also increases
• A poor person usually spends a constant value in food, consuming similar products
• A more rich person eventually can consume more simple products, but can also consume more sophisticated products
• In this way, the richer a person is the more types of food it can consume
Marcos Kalinowski 176Experimental Software Engineering
• Observed values for an hypothetical study, showing heterocesdaticity between two groups
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Muito
Alto
Alto
Médio
Baix
o
Muito
Baix
o 0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Muito
Alto
Alto
Médio
Baix
o
Muito
Baix
o
Group I Group II
Homocedasticity
Marcos Kalinowski 177Experimental Software Engineering
• Levene’s Testing
– Consider a variable Y, with N distinct values divided inK groups, where Ni is the number of values in group i
– The Levene’s Testing accepts the hypothesis that thevariances are homogeneous whether the S-W (or K-S)value is less than the value of significance level
Homocedasticity Testing
Marcos Kalinowski 178Experimental Software Engineering
Hypotesis Testing Types
Experimental Design Parametric Test Non-Parametric Test
One factor, one treatment - Binomial
One factor, two random treatments
T Test Mann-Whitney
One factor, two paired treatments
Paired-T Test Wilcoxon signed rank
One factor, more than two treatments
ANOVA Kruskal-Wallis
Marcos Kalinowski 179Experimental Software Engineering
T Test or Student-T
• Parametric test used to compare two means from twoindependent samples– Relates to a category of tests, which different tests can be
applied accordingly the sample variances (homocedastic ornot)
– Different test are also applied whether the samples areindependents or paired
– We can say two samples are paired ones when it does exist aunique relation between a value in one sample with a valuein the other one.• Example: a sample before training and a sample after training
– All T tests assume normal distribution for the datadistribution
Marcos Kalinowski 180Experimental Software Engineering
Mann-Whitney Test
• Represents a non parametric alternative for T Test
– Requires the samples are independent, with continuous datain scales ordinal, interval or ratio
– To accomplish comparison, the samples are grouped andordered
– The samples are transformed in rankings into the group andthe sum of the smaller sample (T) in the group is calculated
– Finally, the statistic value is calculated and compared with atable of values
Marcos Kalinowski 181Experimental Software Engineering
ANOVA – Variance Analysis
• Statistical technique aiming at testing the equality betweentwo or more groups means
– Allows the comparison of means from different treatments, beingused as an extension for T test
– Evaluates whether the variability within the groups is greater thanthat existent among the groups
– The technique assumes independency, normality andhomocedastity of the group samples.
• As its aim is to evaluate if the means are equal, independentfrom the factor, the null ANOVA hypothesis establishes thatthe factor dependent variations must be equal zero.
Marcos Kalinowski 182Experimental Software Engineering
Tukey Test
• Test to compare means
– Usually used when ANOVA identify difference among means from multiples samples
• ANOVA shows the means are different, but does not show which ones are different
– The Tukey test supports the identification of the means that do differ
Marcos Kalinowski 183Experimental Software Engineering
Kruskal-Wallis Test
• A non parametric alternative to variance analysis
– As most of non parametric tests, this test is based onthe ranking of values rather the values bythemselves.
Marcos Kalinowski 184Experimental Software Engineering
Experimental Software Engineering
Prof. Marcos [email protected]