1 Experimentation in Computer Science – Part 3. 2 Experimentation in Software Engineering ---...

1

Experimentation in Computer Science – Part 3

2

Experimentation in Software Engineering --- Outline

Empirical Strategies Measurement Experiment Process (Continued)

E

Experiment Process:Phases

ExperimentDefinition

ExperimentPlanning

ExperimentOperation

Analysis &Interpretation

Presentation& Package Conclusions

ExperimentIdea

ExperimentProcess

4

Experiment Planning:Overview

ContextSelection

HypothesisFormulation

VariablesSelection

Selection ofSubjects

ExperimentDesign

ExperimentOperation


ExperimentPlanning

Instrumen-tation

ValidityEvaluation

Experiment Planning:Instrumentation

Instrumentation types: Objects (e.g., specs, code) Guidelines (e.g., process descriptions, checklists,

tutorial documents) Measurement instruments (surveys, forms,

automated data collection tools)

Overall goal of instrumentation: facilitate its performance without affecting control (instrumentation must not affect outcomes)

Experiment Planning:Validity Evaluation

Threats to external validity concern the ability to generalize results outside the experimental setting

Threats to internal validity concern the ability to conclude that a causal effect exists between independent and dependent variables

Threats to construct validity concern the extent to which variables and measures accurately reflect the constructs under study.

Threats to conclusion validity concern issues that affect our ability to draw accurate statistical conclusions

Experiment Planning:Process and Threats Related

Causeconstruct

Effectconstruct

Treatment Outcome

Theory(hypothesis)

Observation

cause-effect construct

treatment-outcomeconstruct

Independent variable Dependent variable

Experiment Planning:Process and Threats Related

Causeconstruct

Effectconstruct

Treatment Outcome

Theory(hypothesis)

Observation

cause-effect construct

treatment-outcomeconstruct

Independent variable Dependent variable

external

construct construct

internalconclusion

Experiment Planning:Threats to External Validity

Population: subject population not representative of population we wish to generalize to

Place: experimental setting or materials not representative of setting we wish to generalize to

Time: experiment is conducted at a time that affects results

Reduce external validity threats in a given experiment by making environment as realistic as possible; however, reality is not homogenous, so important to report environment characterisitics.Reduce external validity threats long-term through replication.

Experiment Planning: Threats to Internal Validity

Instrumentation: measurement tools report inaccurately or affect results

Selection: groups selected are not equivalent Learning: subjects learn over the course of the

experiment, altering later results Mortality: subjects drop out of the experiment Social Effects: e.g., control group resents

treatment group (demoralization or rivalry)

Reduce internal threats through careful experiment design.

Experiment Planning: Threats to Construct Validity

Inadequate preoperational explication of constructs: theory isn’t clear enough (e.g. what is “better”)

Mono-operation or mono-method bias: using a single independent variable, case, subject, treatment, or measure may under-represent constructs

Levels of constructs: using incorrect levels of constructs may confound presence of construct with its level

Integration of testing and treatment: testing itself makes subjects sensitive to treatment; test is part of treatment

Social effects: experimenter expectancy, evaluation apprehension, hypothesis guessing

Reduce construct threats through careful design, and replication.

Experiment Planning: Threats to Conclusion Validity

Low statistical power: increases risk of being unable to reject a false null hypothesis

Violated assumptions of statistical tests: some tests have assumptions, e.g. about normally distributed and independent samples

Fishing: searching for a specific result causes analyses to not be independent, and researchers may influence results by seeking specific outcomes

Reliability of measures: if you can’t measure the result twice with equal outcomes, measures aren’t reliable

Reduce conclusion validity threats through careful design, andperhaps through consultation with statistical experts

Experiment Planning: Priorities Among Validity Threats

Decreasing some types of threats may cause others to increase. (E.g. using CS students increases group size, reduces heterogeneity, aids conclusion validity, reduces external validity.)

Tradeoffs need to be considered for type of study: Theory testing is more interested in internal and construct validity

than external Applied experimentation is more interested in external and

possibly conclusion validity

E



ExperimentPlanning

ExperimentOperation



ExperimentIdea

ExperimentProcess

15

Experiment Operation:Overview

Experiment operation: carrying out the actual experiment and collecting data

Three phases: Preparation Execution Data validation

16

Experiment Operation:Preparation

Locate participants Offer inducements to obtain participants Obtain participant consent, maybe also IRB approval Consider confidentiality (maintain it, inform

participants about it) Avoid deception where it affects participants, reveal it

later discussing necessity (beware validity tradeoffs; providing information is good but may affect results)

Prepare instrumentation Objects, guidelines, tools, forms Use pilot studies and walkthroughs to reduce threats

17

Experiment Operation:Execution

Execution might take place over a small set of specified occasions, or across a long time span

Data collection takes place: subjects or interviewers fill out forms, tools collect metrics

Consider interaction between experiment and environment, e.g., if experiment is being performed in-vivo, watch for confounding effects (experiment process altering behavior)

18

Experiment Operation:Data Validation

Verify that data has been collected correctly Verify that data is reasonable Consider whether outliers exist and should be

removed (must be for good reasons) Verify that experiment was conducted as

intended Post-experiment questionnaires can assess

whether subjects understood instructions

E



ExperimentPlanning

ExperimentOperation



ExperimentIdea

ExperimentProcess

20

Analysis and Interpretation:Overview

Quantitative interpretation can include: Descriptive statistics: describe and graphically

present data set, used before hypothesis testing to better understand data and identify outliers

Data set reduction: locate and possibly remove anomalous data points

Hypothesis testing: apply statistical tests to determine whether the null hypothesis can be rejected

21

Analysis and Interpretation:Visualizing Data Sets

Graphs are effective ways to provide an overview of a data set

Basic graphs types for use in visualization: Scatter plots Box plots Line plots Bar charts Cumulative bar charts Pie charts

22

Analysis and Interpretation:Data Set Reduction

Hypothesis testing techniques depend on quality of data set; data set reduction improves data set quality by removing anomalous data (outliers)

Outliers can be removed, but only for reasons such as that they represent rare events not likely to occur again Scatter plots can help find outliers Statistical tests can determine probabilities that points are outliers

Sometimes redundant data is not easily analyzed, if the redundancy is too large; factor analysis and principal components analysis can identify orthogonal factors with which to replace redundant factors

23

Analysis and Interpretation:Hypothesis Testing

Hypothesis testing: can we reject H0? If statistical tests say we can’t, we draw no conclusions If tests say we can, H0 is false with a given significance

= P(type-I-error) = P(reject H0 | H0 is true).

We also calculate p-value : the lowest possible significance with which we can reject H0

Typically, is 0.05; to claim significance must be <

24

Analysis and Interpretation:Statistical Tests per Design

Design Parametric Non-parametric

One factor, one treatment Chi-2

Binomial test

One factor, two treatments, completely randomized

t-test

f-test

Mann-Whitney

Chi-2

One factor, two treatments, paired comparison

paired t-test Wilcoxon

Sign test

One factor, more than two treatments

ANOVA Kruskal-Wallis

Chi-2

More than one factor ANOVA

25

Analysis and Interpretation:Statistical Tests

Important to choose the right test - type of data must be appropriate are data items paired or not? is data normally distributed or not? are data sets completely independent or not?

Take a stats course, see texts such as Montgomery, consult with statisticians, use statistical packages

26

Analysis and Interpretation:Statistical vs Practical Significance

Statistical significance does not imply practical importance. E.g. if T1 is shown with statistical significance to be 1% more effective than T2, it must still be decided whether 1% matters

Lack of statistical significance does not imply lack of practical importance. The fact that H0 cannot be rejected at level does not mean that H0 is true, and results of high practical importance may justify using a lower

E



ExperimentPlanning

ExperimentOperation



ExperimentIdea

ExperimentProcess

28

Presentation:An Outline for an Experiment Report

1. Introduction, Motivation

2. Background, Prior Work

3. Empirical Study3.0 Research Questions

3.1 Objects of analysis

3.1.1 participants

3.1.2 objects

3.2 Variables and measures 3.2.1 independent variables 3.2.2 dependent variables 3.2.3 other factors

3.3 Experiment setup

3.3.1 setup details 3.3.2 operational details

3.4 Analysis strategy 3.5 Threats to validity 3.6 Data and analysis

4. Interpretation5. Conclusions

Presentation Issues

• Supporting replicability.• What to say and what not to say?• How much to say?• Describing design decisions

1 Experimentation in Computer Science – Part 3. 2 Experimentation in Software Engineering ---...

Documents

Transcript of 1 Experimentation in Computer Science – Part 3. 2 Experimentation in Software Engineering ---...