Asking Questions Handouts

31
The Right Questions about Statistics Maths Learning Centre By Dr David Butler © 2012 The University of Adelaide 1 The purpose of Statistics is to ANSWER QUESTIONS USING DATA Know the type of question and you can choose what type of statistics... The purpose of Statistics is to ANSWER QUESTIONS USING DATA Know more about your data and you can choose what statistical method... HOW THE DATA IS COLLECTED what is done to the subjects? when is information recorded? how are the subjects chosen? VARIABLES IN THE DATA how to measure? what type? defining groups or measurements? what distribution? HOW MUCH DATA lots of things recorded per subject? lots of subjects? missing data? Aim: DESCRIBE Type of question: What's going on? Examples: How many chapters do novels have? What possibilities are there for body temperature after a meal with or without chilli? What sort of relationship might the amount of sleep a student gets have with their grades? What sorts of things might be related to whether a person does volunteer work? Type of Statistics: Descriptive statistics: graphs and basic numbers Aim: DECIDE Type of question: Yes or no? Examples: Is the median number of chapters in a novel 20? Is your body temperature higher after a meal if it has chilli in it? Does getting more sleep affect a students’ grades? Are women more likely to participate in volunteer work than men? Type of Statistics: Hypothesis tests (p-values) Aim: ESTIMATE Type of question: What's this number? Examples: What is the median number of chapters in a novel? How much higher is your body temperature after a chilli meal compared to one without? On average, how much of an effect does 30 minutes more sleep have on a students’ grades? How much more (or less) likely is a woman to participate in volunteer work than a man? Type of Statistics: Confidence intervals Aim: PREDICT / EXPLAIN Type of question: What's the formula? Examples: How can I explain a persons body temperature after a meal using their temperature before and the chilli content of the meal? How can I calculate a students grade based on their number of hours of sleep during semester? How can I use a persons gender, age, income and religion to predict their chances of participating in volunteer work? Type of Statistics: Modelling and regression

description

Asking Questions Handouts

Transcript of Asking Questions Handouts

Page 1: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 1

The purpose of Statistics is to ANSWER QUESTIONS USING DATA

Know the type of question and you can choose what type of statistics...

The purpose of Statistics is to ANSWER QUESTIONS USING DATA

Know more about your data and you can choose what statistical method...

HOW THE DATA IS COLLECTED

what is done to the subjects?

when is information recorded?

how are the subjects chosen?

VARIABLES IN THE DATA

how to measure?

what type?

defining groups or measurements?

what distribution?

HOW MUCH DATA

lots of things recorded per subject?

lots of subjects?

missing data?

Aim: DESCRIBE

Type of question: What's going on?

Examples:

How many chapters do novels have?

What possibilities are there for body temperature

after a meal with or without chilli?

What sort of relationship might the amount of

sleep a student gets have with their grades?

What sorts of things might be related to whether

a person does volunteer work?

Type of Statistics: Descriptive statistics: graphs

and basic numbers

Aim: DECIDE

Type of question: Yes or no?

Examples:

Is the median number of chapters in a novel 20?

Is your body temperature higher after a meal if it

has chilli in it?

Does getting more sleep affect a students’

grades?

Are women more likely to participate in volunteer

work than men?

Type of Statistics: Hypothesis tests (p-values)

Aim: ESTIMATE

Type of question: What's this number?

Examples:

What is the median number of chapters in a

novel?

How much higher is your body temperature after

a chilli meal compared to one without?

On average, how much of an effect does 30

minutes more sleep have on a students’ grades?

How much more (or less) likely is a woman to

participate in volunteer work than a man?

Type of Statistics: Confidence intervals

Aim: PREDICT / EXPLAIN

Type of question: What's the formula?

Examples:

How can I explain a person’s body temperature

after a meal using their temperature before and

the chilli content of the meal?

How can I calculate a student’s grade based on

their number of hours of sleep during semester?

How can I use a person’s gender, age, income and

religion to predict their chances of participating in

volunteer work?

Type of Statistics: Modelling and regression

Page 2: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 2

Variable

NUMERICAL

Variable

CATEGORICAL

OR

DATA ENTRY

TYPES OF VARIABLES (things you record)

Numerical / Quantitative / Scale (numbers: how far apart has meaning)

o Continuous (measured)

o Discrete (counted)

Categorical / Qualitative (words: how far apart has no meaning)

o Nominal (names: more or less has no meaning)

o Ordinal (ordered: more or less has meaning)

DISTRIBUTIONS OF NUMERICAL VARIABLES (how the possible values are spread out)

Approximately normal

– parametric tests will be fine

Skewed or worse

– non-parametric tests might be better

WHAT EXPLANATORY CATEGORICAL VARIABLES DEFINE:

BECOMES...

gender age chilli temp

1 M 18 Y 37

2 M 25 N 36

3 F 19 Y 38

4 F 21 N 35

age = 18 gender = M

chilli = Y temp = 37

1

age = 25 gender = M

chilli = N temp = 36

2

age = 19 gender = F

chilli = Y temp = 38

3

Statisticians say:

"PLEASE make it consistent!"

Repeated Measures

(matched pairs)

chilli = N

temp = 36

chilli = Y

temp = 36

chilli = Y

temp = 38 Independent

Groups BECOMES...

chilli temp

1 Y 38

2

N 37 3

Y 36

4 N 36 3 chilli = N

temp = 37

1 2

4

BECOMES...

(chilli = Y) temp

(chilli = N) temp

1 38 37

2

37 36 3

36 37

4 37 35

(chilli = Y) temp = 38

(chilli = N) temp = 37

1 (chilli = Y)

temp = 36

(chilli = N) temp = 37

2

(chilli = Y)

temp = 37

(chilli = N) temp = 36

3 (chilli = Y)

temp = 37

(chilli = N) temp = 35

4

Page 3: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 3

HOW HYPOTHESIS TESTING WORKS

A hypothesis test is designed to DECIDE the answer to a YES OR NO question using DATA.

This is how to do a hypothesis test:

Have a yes-or-no question.

Collect data.

Calculate a test statistic.

Figure out the distribution if you assume a particular answer.

Calculate a p-value.

Decide the answer based on the p-value. This is what a hypothesis test means:

It tells you if your data is likely or unlikely given a particular situation (the “null hypothesis”).

A low p-value means your data is unlikely and you don’t believe you’re in that situation.

A high p-value means your data is likely and you do believe you could be in that situation.

HOW CONFIDENCE INTERVALS WORK

A confidence interval is designed to give a RANGE of possible answers for a “WHAT’S THE

NUMBER?” question, using DATA from a sample.

This is how to find a confidence interval:

Have a “what’s the number?” question.

Collect data.

Choose a matching hypothesis test.

Work backwards to calculate two ends.

The confidence interval is between these two values. This is what a confidence interval means:

The values in the CI would be retained with a matching hypothesis test.

The values in the CI have a high chance of producing data like yours.

The values in the CI are those you are “happy to believe” based on your data.

Page 4: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 4

HOW REGRESSION WORKS

Regression is a method designed to create a FORMULA that uses some information to

PREDICT/EXPLAIN an outcome, using DATA.

This is how to perform regression:

Have a “what’s the formula?” question.

Collect data.

Look at the pattern – usually with a scatterplot – to choose a formula.

Get a computer to calculate the numbers and p-values.

Check the p-values.

Choose your final formula. This is what regression means:

It tells you a formula for how an outcome varies based on other information.

It does NOT tell you if some things CAUSE others, only how to calculate them as accurately as possible.

The computer output will tell you p-values and confidence intervals to answer other types of questions.

More details:

DESCRIBING A RELATIONSHIP: o Scatterplot describes relationship – and helps choose a good formula o Correlation coefficient (r) measures how strong a linear relationship is.

Ranges from -1 (perfect negative) to 0 (no relationship) to 1 (perfect positive). Ignores how steep the slope is, only says how close to a line.

FINDING AND INTERPRETING THE FORMULA: o Computer program will use the data to find the numbers that make the formula fit best. o The coefficient says how much the outcome changes (on average) for a change of 1 in the

explanatory variable.

LOOKING AT P-VALUES: o The p-value that goes with the F-statistic in the ANOVA table tells you whether all the

variables at once have a relationship with the outcome. Low p-value means the relationship is “significant”.

o The p-value for each coefficient tells you whether that explanatory variable appears to have a relationship with the outcome. Low p-value means the effect is “significant”.

LOOKING AT CONFIDENCE INTERVALS: o The confidence interval that goes with an explanatory variable tells you how large or small

the real effect could be.

NOTE: Regression has assumptions that must be checked in order to use it properly, especially if you plan to use the p-values and confidence intervals.

Page 5: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 5

OR

BECOMES... OR

Variable

NUMERICAL

Concept

Variable

CATEGORICAL

OR OR

Turning a research question into a statistical question.

ORIGINAL QUESTION:

TYPE OF QUESTION:

TYPES OF VARIABLES:

WHAT EXPLANATORY CATEGORICAL VARIABLES DEFINE:

DISTRIBUTION OF OUTCOME NUMERICAL VARIABLE:

Note: This probably doesn’t matter if you have a lot of data.

STATISTICAL QUESTION:

Note: In the list below, the outcome variables are usually assumed to be normal.

ABOUT ONE

CONCEPT

PREDICT/EXPLAIN – what’s the formula?

ESTIMATE – what’s this number? DECIDE – yes or no?

DESCRIBE – what’s going on?

ABOUT RELATIONSHIPS

BETWEEN CONCEPTS

Concept Concept Concept

Repeated Measures

(matched pairs)

Independent

Groups

Variable

NUMERICAL

DESCRIBE eg: eg:

Variable

NUMERICAL

Variable

CATEGORICAL

DECIDE

Independent Groups

Page 6: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 6

Statistical methods for statistical questions

DESCRIBE: Numbers: Mean & standard deviation ( median & IQR)

Graphs: Histogram / Boxplot.

DECIDE: “Is the mean equal to #?” – one sample t-test.

“Is the median equal to #?” – sign test.

ESTIMATE: “What is the mean?” – confidence interval for a mean.

DESCRIBE: Numbers: Table of percentages or proportions.

Graphs: Histogram.

DECIDE: “Is this percentage equal to #?” – z-test for a single proportion.

“Are percentages distributed according to #, #, #?” – chi-squared test for

goodness of fit.

ESTIMATE: “What is this percentage?” – confidence interval for a proportion.

DESCRIBE: Numbers: Means & standard deviations for each group

( medians & IQRs for each category).

Graphs: Histograms on same scale / side-by-side

boxplots.

DECIDE: “Are the means equal?” – unpaired t-test ( Mann-

Whitney U-test or Wilcoxon rank-sum test).

ESTIMATE: “What is the difference between the means?” –

confidence interval for the difference in means.

DESCRIBE: Numbers: Mean & standard deviation of differences

between measurements.

Graphs: Histogram of the differences between

measurements.

DECIDE: “Is there a mean difference?” – paired t-test

( Wilcoxon signed ranks test).

ESTIMATE: “What is the mean difference?” – confidence interval for

the mean difference.

DESCRIBE: Numbers: Mean & standard deviation of each group.

Graphs: Histograms/boxplots on the same scale. Line

graph showing mean of each group.

DECIDE: “Are the means equal?” – one-way analysis of variance

ANOVA with post-hoc t-tests ( Kruskal-Wallis test).

ESTIMATE: “What are the differences between means?” – confidence

intervals for each difference in means.

Variable

NUMERICAL

Variable

CATEGORICAL

Variable

NUMERICAL

Variable

CATEGORICAL

(2 categories)

Variable

NUMERICAL

Variable

CATEGORICAL

(2 categories)

Variable

NUMERICAL

Variable

CATEGORICAL

(any# categories)

Independent Groups

Independent Groups

Repeated Measures

Page 7: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 7

Statistical methods for statistical questions

DESCRIBE: Graphs: Line graph for each subject showing changing

value of variable.

DECIDE: “On average, does the value change for each person

across categories?” – repeated measures ANOVA with

post-hoc paired t-tests / mixed effects regression.

ESTIMATE: “What are the mean differences between categories?” –

confidence intervals for mean differences.

DESCRIBE: Numbers: Two-way table of counts or %s. Odds ratios.

Graphs: Histogram for each explanatory category.

DECIDE: “Is the outcome just as likely for both explanatory

categories?”, “Are the two variables associated?” – chi-

squared test for independence (small amount of data:

Fisher’s exact test).

ESTIMATE: “How much more likely is the outcome in this category?”–

confidence interval for difference in proportions,

confidence interval for odds ratio.

DESCRIBE: Numbers: Two-way table of counts or %s.

Graphs: Histogram for each explanatory category.

DECIDE: “Is the outcome just as likely for both explanatory

categories?” – McNemar’s test.

ESTIMATE: “How much more likely is the outcome in one category

compared to the other?”– confidence interval for

difference in proportions.

DESCRIBE: Numbers: Two-way table of counts or %.

Graphs: Histogram for each explanatory category.

DECIDE: “Do the percentages in the outcome change across the

explanatory categories?”, “Are the two variables

associated?” – chi-squared test for independence.

DESCRIBE: Numbers: Two-way table of counts or %.

Graphs: Histogram for each explanatory category.

DECIDE: “Do the percentages in the outcome change across the

explanatory categories?”, “Are the two variables

associated?” – Cochrane’s Q-test.

Variable

CATEGORICAL

(2 categories)

Variable

CATEGORICAL

(2 categories)

Variable

NUMERICAL

Variable

CATEGORICAL

(any# categories)

Variable

CATEGORICAL

(2 categories)

Variable

CATEGORICAL

(2 categories)

Variable

CATEGORICAL

(any# categories)

Variable

CATEGORICAL

(any# categories)

Variable

CATEGORICAL

(any# categories)

Variable

CATEGORICAL

(2 categories)

Independent Groups

Independent Groups

Repeated Measures

Repeated Measures

Repeated Measures

Page 8: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 8

Statistical methods for statistical questions

DESCRIBE: Numbers: Correlation coefficient (R)

Graphs: Scatterplot.

DECIDE: “Does a relationship exist?” – linear regression: t-test on

coefficient.

ESTIMATE: “How much does the output variable change when the

explanatory variable changes?” – linear regression:

confidence interval for slope.

PREDICT: “How can you calculate the output knowing the

explanatory variable?” – linear regression formula:

y = β0 + β1 x.

NOTE: May need to do a nonlinear regression if the scatterplot

indicates a curved sort of relationship.

DESCRIBE: Numbers: Mean & standard deviation for each category

of the outcome.

Graphs: Histograms/boxplots on the same scale.

DECIDE: “Does the numerical variable have an effect on the

chances of the outcome?” – unpaired t-test using the

outcome to define the two groups.

ESTIMATE: “How much does a change in the numerical variable affect

the chances of the outcome?” – logistic regression:

confidence interval for odds ratio.

PREDICT: “How can you calculate the chances of the outcome

knowing the value of the explanatory variable?” – logistic

regression formula: log(odds of y) = β0 + β1 x.

DESCRIBE: Numbers: Proportion reaching event at certain time (eg 5-

year survival), median times to reach event.

Graphs: Kaplan-Meier curve showing survival

percentages.

DECIDE: “Is the time to reach the event the same in all groups?” –

survival analysis: log-rank test.

ESTIMATE: “What is the difference in proportions reaching the end

point at this particular time?” – confidence interval for

the difference in proportions.

“How much more at risk of the event is this group than

this group?” – Cox regression: confidence interval for

relative hazard.

Variable

NUMERICAL

Variable

NUMERICAL

Variable

NUMERICAL

Variable

CATEGORICAL

(2 categories)

Time to event

NUMERICAL

Variable

CATEGORICAL

(any# categories)

Possible missing data!

Independent Groups

Page 9: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 9

Statistical methods for statistical questions

DESCRIBE: Graphs: Scatterplot for each explanatory variable with the

outcome variable.

Numbers: multiple linear regression: R2 value

DECIDE: “Does a relationship exist with any of the variables at all?”

– multiple linear regression: F-test.

“Does a relationship exist with this varable, taking into

account the others?” – multiple linear regression: t-test

on one coefficient.

ESTIMATE: “How much does the output variable change when this

explanatory variable changes?” – multiple linear

regression: confidence interval for one slope.

PREDICT: “How can you calculate the output knowing the

explanatory variables?” – multiple linear regression

formula: y = β0 + β1 x1 + β2 x2.

NOTE: This can be done for many explanatory variables.

DESCRIBE: Graphs: Scatterplot of both numerical variables for each

category.

Numbers: multiple regression: R2 value

DECIDE: See above for multiple regression.

ESTIMATE: See above for multiple regression.

PREDICT: See above for multiple regression.

NOTE: This can be done for many explanatory variables of both types.

DESCRIBE: Graphs: Histogram for each combination of explanatory

categories. Line graph showing mean of each group.

DECIDE: “Does a relationship exist with any of the variables at all?”

– two-way ANOVA: F-test.

“Does a relationship exist with this varable, taking into

account the others?” – two-way ANOVA: F-test for one

effect.

Note: both can also answered with multiple regression

(see above).

PREDICT: “How can you calculate the output knowing the

explanatory variables?” – multiple linear regression

formula: y = β0 + β1 x1 + β2 x2.

Variable

NUMERICAL

Variable

NUMERICAL

Variable

NUMERICAL

Variable

NUMERICAL

Variable

CATEGORICAL

(any# categories)

Variable

CATEGORICAL

(any# categories)

Independent Groups

Independent Groups

Variable

NUMERICAL

Variable

NUMERICAL

Variable

CATEGORICAL

(any# categories)

Independent Groups

Page 10: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 10

DESCRIBE: Graphs: Histogram for each combination of explanatory

categories.

DECIDE: “Does a relationship exist with any of the variables at all?”

– multiple logistic regression: chi-squared test for

covariates.

“Does a relationship exist with this varable, taking into

account the others?” – multiple logistic regression: Wald

test.

ESTIMATE: “How much does the chance of the outcome change when

this explanatory variable changes?” – multiple logistic

regression: confidence interval for odds ratio.

PREDICT: “How can you calculate the chances of the outcome

knowing the explanatory variables?” – multiple logistic

regression formula: log(odds of y) = β0 + β1 x1 + β2 x2.

NOTE: This can be done with many explanatory variables – even if

some of them are numerical.

DESCRIBE: Numbers: multiple linear regression: R2 value

DECIDE: “Does a relationship exist with any of the variables at all?”

– mixed effects regression: F-test.

“Does a relationship exist with this varable, taking into

account the others?” – mixed effects linear regression: t-

test on one coefficient.

ESTIMATE: “How much does the output variable change when this

explanatory variable changes?” – mixed effects

regression: confidence interval for one coefficient.

PREDICT: “How can you calculate the output knowing the

explanatory variables?” – mixed effects regression

formula.

NOTE: “mixed effects” may also be called “random effects”.

NOTE: This can be done for many explanatory variables, of both types,

and with a mixture of repeated-measures and independent-

groups

DECIDE: “Does one variable change the way the other affects the

outcome?”– multiple linear regression: t-test on the

interaction effect.

ESTIMATE: “How much does the second variable change the effect of

the first on the outcome?”– multiple linear regression:

confidence interval for the interaction effect.

PREDICT: “How can you calculate the output knowing the

explanatory variables?” – multiple linear regression

formula: y = β0 + β1 x1 + β2 x2 + β12 x1x2.

Variable

NUMERICAL

Variable

NUMERICAL

Variable

NUMERICAL

Variable

CATEGORICAL

(any# categories)

Variable

CATEGORICAL

(any# categories)

Independent Groups

Independent Groups

Variable

CATEGORICAL

(2 categories)

Variable

NUMERICAL

Variable

NUMERICAL

Variable

CATEGORICAL

(any# categories)

Repeated Measures

Page 11: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 11

DESCRIBE: Graphs: Scatterplot for each category, showing line of

best fit in each case.

DECIDE: “Does one variable change the way the other affects the

outcome?”– Analysis of Covariance (ANCOVA) / multiple

linear regression: t-test on the interaction effect.

ESTIMATE: “How much does the second variable change the effect of

the first on the outcome?”– multiple linear regression:

confidence interval for the interaction effect.

PREDICT: “How can you calculate the output knowing the

explanatory variables?” – multiple linear regression formula: y

= β0 + β1 x1 + β2 x2 + β12 x1x2.

NOTE: This can be done for many explanatory variables of both types.

ANCOVA refers specifically to the case where the interaction

variable is categorical.

NOTE: There are many other methods dealing with more specific and difficult questions including (but

definitely not limited to):

“Does this variable affect the variance of the outcome?”

F-test for two variances

“Do these variables affect this categorical outcome (which has several categories)?”

Multinomial regression

“Does the data come from a normal distribution?”

Investigate normal quantile-quantile plot; Shapiro-Wilk test

“To what degree do these two measuring systems agree?”

Intraclass correlation coefficient

“What is the best cut-off for this measurement in order to say someone needs medical attention?”

ROC analysis

“Do all these measurements vary together so that they could be considered as measuring some

smaller number of underlying concepts?”

Factor analysis / Principal Component Analysis

“Can the subjects be grouped into a few similar groups based on the similarity in their

measurements?”

Cluster analysis

and so on ...

Variable

NUMERICAL

Variable

NUMERICAL

Variable

CATEGORICAL

(any# categories)

Independent Groups

Page 12: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 12

Page 13: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 13

Page 14: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 14

Page 15: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 15

Page 16: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 16

Page 17: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 17

Page 18: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 18

Page 19: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 19

Page 20: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 20

Page 21: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 21

Page 22: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 22

Page 23: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 23

Page 24: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 24

Page 25: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 25

Page 26: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 26

Page 27: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 27

Page 28: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 28

Page 29: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 29

SAMPLE SIZE CALCULATIONS

FOR HYPOTHESIS TESTS:

The following five things affect the sample size you need:

1. Which hypothesis test you plan to use

2. Size of the difference you are looking for

Most hypothesis tests concern the differences between means or percentages.

The difference you would like to see is often called:

Clinically significant difference

Practically significant difference

Choosing how big this difference is requires KNOWLEDGE OF YOUR AREA OF RESEARCH.

3. Variability of the results

HIGH VARIABILITY means many options for what could happen in a sample of a particular size

eg: for the CHI-SQUARED TEST

very high or very low expected percentage low variability

medium expected percentage high variability

eg: for t-tests or ANOVA

large standard deviation high variability

You usually get this information from previous research or a pilot study.

Hypothesis test based on categorical outcomes

(as opposed to numerical outcomes)

BIGGER

sample size

Looking for a

SMALL DIFFERENCE

BIGGER

sample size

Hypothesis test uses independent groups

(as opposed to repeated measures)

BIGGER

sample size

HIGH

VARIABILITY

BIGGER

sample size

Page 30: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 30

4. Significance level

The cut-off for saying when a p-value is significant. Usually 5%.

Also known as α (alpha) or the “Type I Error rate”.

5. Power

The probability of getting a significant result if in fact there IS a difference in the population.

Usually you set this at 80%.

The opposite of Type II Error rate (also known as β (beta)).

[ Note that a high dropout rate also increases sample size ]

FOR CONFIDENCE INTERVALS:

Confidence intervals are related to hypothesis tests, so the considerations above are used

for confidence intervals too.

NOTE: Significance level = 100% - Confidence Level

(so for a 95% confidence interval, the significance level is 5%)

NOTE: The “difference you are looking for” is half the width of the confidence interval. Also known

as the “margin of error”.

FOR REGRESSION:

Rule of thumb: at least 10 times as many subjects as there are explanatory variables.

Proper calculations are based on the t-tests involved to see if slope is significant.

LOW

SIGNIFICANCE LEVEL

BIGGER

sample size

HIGH

POWER

BIGGER

sample size

X2

X1 X2

X1

X3 X4 X5

At least 2×10 = 20 At least 5×10 = 50

Y Y

Page 31: Asking Questions Handouts

The Right Questions about Statistics Maths Learning Centre

By Dr David Butler © 2012 The University of Adelaide 31

SOME TERMINOLOGY:

Type I Error:

NO difference in the population

BUT there IS a difference in the sample

(also known as significance level or alpha α)

Type II Error:

There IS a difference in the population

BUT there is NO difference in the sample

(also known as beta β, or the opposite of power)

PERFORMING THE CALCULATIONS :

Russ Lenth’s has created a comprehensive suite of online calculators:

http://homepage.stat.uiowa.edu/~rlenth/Power

You need all the information mentioned above in order to use the calculators.

There are also simple formulas for the t-tests and chi-squared tests in Chapter 36 of

“Medical Statistics at a Glance” by Aviva Petrie and Caroline Sabin

You need all the information mentioned above in order to use the formulas.