+ Refresher in inferential statistics Tim.bates@ed.ac.uk stats.

Post on 26-Dec-2015

222 views 0 download

Tags:

Transcript of + Refresher in inferential statistics Tim.bates@ed.ac.uk stats.

+

Refresher in inferential statisticsTim.bates@ed.ac.uk

http://www.psy.ed.ac.uk/events/research_seminars/psychstats

+Resources

http://www.statmethods.net

+Our basic question…

Did something occur?

Importantly, did what we predicted would occur, transpire?, i.e., is the world as we predicted?

Why does this require statistics?

+Is Breastfeeding good for Baby’s brains?

The association between breastfeeding and IQ is moderated by a genetic polymorphism (rs174575) in the FADS2 gene

Caspi A et al. PNAS 2007;104:18860-18865

©2007 by National Academy of Sciences

+Overview

Hypothesis testing p-values Type I vs. Type II errors Power Correlation Fisher’s exact test T-test Linear regression Non-parametric statistics (mostly for you to

go over in your own time)

+Hypothesis testing

1.Propose a null and an experimental hypothesis.

Mistakes here may make the experiment un-analysable

2.Consider the assumptions of the test: Are they met?

Statistical independence of observations

Distributions of the observations.

Student's t distribution, normal distribution etc.

3.Compute the relevant test statistic.

1.Student’s t-test-> t ; ANOVA F; Chi2

4.Compute likelihood of the test-statistic:

1.Does it exceed your chosen threshold?

2.Either reject (or fail to reject) the null hypothesis

+What mistakes can we make?

“The World”

Yes No

Your Decision

Yes correct detection

false positive

No false negative correct rejection

+Starting to make inferences…the Binomial

Toss a coin

+Dropping lots of coins...

Pachinko

+Normal compared to Binomial

n = 6

p = .5

+Distributions

normal (µ, ∂) binomial (p, n)

+Distributions

Poisson (lambda) Power

Accidents in a period of time; Publication rates

+Testing what distribution you have

+Why are things normal?

+Central limit theorem

The mean of a large number of independent random variables is distributed approximately normally.

+Hypothesis testing

Making statistical decisions using experimental data.

Need to form a null hypothesis

(we can reject, but not confirm hypotheses)

A result is “significant” if it is unlikely to have occurred by chance.

Ronald Fisher “We may discover whether a second sample is or is not significantly different from the first”.

+What mistakes can we make?

“The World”

Yes No

Your Decision

Yes correct detection

false positive

No false negative correct rejection

+Error

Type-I error: False Alarm, a bogus effect reject the null hypothesis when it is really

true Much of published science is Type-I error

(Ioannides, 2008)

Type-II error: Miss a real effect Fail to reject our null hypothesis when it is

false Many small projects have this problem

Type-III error: :-) lazy, incompetent, or willful ignorance of the

truth

+p-values

Almost any difference (a count, a difference in means, a difference in variances) can be found with some probability, irrespective of the true situation.

All we can do is to set a threshold likelihood for deciding that an event occurred by chance.

p=.05 = 1 time in 20, the result would be as large by chance when the null hypothesis is true.

+Type I vs. Type II errors

Type I: False positive Likelihood of type 1 = α p=.05 = setting α to .05

Type II: False negative Likelihood of type 2 = β Power = 1-β

World

Yes No

You

Yes Correct detection(power)

Type I (α)

No Type II (β) Correct rejection

+P-values

p-value is the likelihood of mean differences as large or larger than those observed in the data occurring by chance

p-value criteria (alpha ) allow us a binary answer to our questions

Questions –  is a smaller p-value:

“More” significant?

Indicate a “Bigger” effect? (if so when?)

and how could we measure” effect”?

+Compare these two statements

It’s ‘significant’, but how big is the effect?

I can see it’s big: but what is the p-value?

+Confidence Intervals

Range of values within a given likelihood threshold (for instance 95%)

Closely related to p-values. p = 1-CI i.e., if p<.05, 95% CI will not include 0 (no difference) Would you rather have a CI or a p-value? Why?

What is an effect size?

+P and CI

You can’t go from p to CI!

You can go from CI to p At a p=.05, 95%CIs will overlap less than 25% At p= .01, the 95% CI bars just touch

+Units of a Confidence Interval

Unlike p, CIs are given in the units of the DV Cumming and Finch (2005) BMI in people on a low carb diet might be19-23 kg/m2

Cumming, G. and Finch S.(2005). Inference by eye: confidence intervals and how to read pictures of data. American Psychologist. 60:170-80. PMID: 15740449

+Standard Errors and Standard Deviations

SE is (typically) the standard error of the mean The precision with which we have estimated the population

mean based on our sample Computationally, it is ∂/sqrt(n)

A 95% confidence interval is ± 1.96 SE

+Example: coin toss

Random sample of 100 coin tosses, of a coin believed to be fair

We observed number of 45 heads, and 55 tails: Is the coin fair?

+Binomial test

binom.test(x=45, n=100, p=.5, alternative="two.sided”)

number of successes = 45, number of trials = 100

p-value = 0.3682

alternative hypothesis: true probability of success != 0.5

95 percent confidence interval: 0.3503 0.5527

sample estimates: probability of success: 0.45

+Categorical Data

Fisher’s Exact Test

Categorical data resulting from classifying objects in one of two ways

Tests significance of the observed "contingency" of the two outcomes. Fisher, R. A. (1922). On the interpretation of χ2 from

contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85(1), 87-94.

+The Lady Drinking Tea

Question: Does Tea taste better if the milk is added to the tea, or vice versa?

Null Hypothesis: The drinker cannot tell

Subjects: Ms Bristol

Experiment: 8 "trials" (cups): 4 in each way, in random order

DV: Milk versus Milk second discrimination

Enter data into 2 x 2 contingency table

+Fisher Contingency Table

A = c(1, 1, 1, 0, 1, 0, 0, 0) # vector of guesses

B = c(1, 1, 1, 1, 0, 0, 0, 0) # vector of Teas

guessTable <- table(A,B) # contingency table

labels = list(Guess = c("Milk", "Tea"), Truth = c("Milk", "Tea")) # make labels

dimnames(guessTable)= labels # add label

fisher.test(guessTable, alternative = "greater") # test

GuessMilk Tea

TruthMilk 3 1

Tea 1 3

+Can she tell?

Fisher's Exact Test for Count Data

p-value = 0.24 # association could not be established

Alternative hypothesis: true odds ratio is greater than 1

95% confidence interval: 0.313 – Inf

Sample odds ratio: 6.40

+What if we have two continuous variables?

Are they related

Q: If you have continuous depression scores and cut-off scores, which is more powerful?

+Correlation of two continuous variables: Pearson’s r

All variables continuous

Pearson

+Correlation: what are the maximum and minimum correlations?

+Power (1-β)

Probability that a test will correctly reject the null hypothesis.

Complement of the false negative rate, β False negative = missing a real effect 1-β = p (correctly reject a false null hypothesis)

+Power and how to get it

Probability of rejecting the null hypothesis when it is false

Whence comes power?

+Power applied to a correlation

Samples of n=30 from a population in which two normal traits correlate 0.3

r=0.3

xy = mvrnorm (n=30, mu=rep(0,2), Sigma= matrix(c(1,r,r,1) ,nrow=2, ncol=2));

xy = data.frame(xy);

names(xy) <- c("x", "y");

qplot(x, y, data = xy, geom = c("point" , "smooth"), method=lm)

+Power of a correlation test

library(pwr)

pwr.r.test(n = 30, r = .3, sig.level = 0.05)

n = 30

r = 0.3

sig.level = 0.05

power = 0.359

alternative = two.sided

+Power: r = .3

+ t-test

When we wish to compare means in a sample, we must estimate the standard deviation from the sample

Student's t-distribution is the distribution of small samples from normally varying populations

+t-distribution function

t is defined as the ratio:

Z/sqrt(V/v) Z is normally distributed with expected value 0 and

variance 1; V has a chi-square distribution with ν degrees of

freedom;

+Normal and t-distributions

Normal is in blue

Green = t with df = 1

Red = t with df = 3 (far right = df increasing to 30)

+Power of t-test

power.t.test(n=15, delta=.5)

Two-sample t test power calculation

n = 15 ; delta = 0.5 ; sd = 1; sig.level = 0.05

power = 0.26

alternative = two.sided

NOTE: n is number in *each* group

+Linear regression

+Linear regression

fit = lm(y ~ x1 + x2 + x3, data=mydata)

summary(fit) # show results

anova(fit) # anova table

coefficients(fit) # model coefficients

confint(fit, level=0.95) # CIs for model parameters

fitted(fit) # predicted values

residuals(fit) # residuals

influence(fit) # regression diagnostics

+

Nonparametric StatisticsTimothy C. Batestim.bates@ed.ac.uk

+Bootstrapping: Kurtosis differences

kurtosisDiff <- function(x, y, B = 1000){

kx <- replicate(B, kurtosi(sample(x, replace = TRUE)))

ky <- replicate(B, kurtosi(sample(y, replace = TRUE)))

return(kx - ky)

}

kurtDiff <- kurtosisDiff(x, y, B = 10000); mean(kurtDiff > 0) # p= 0.205 NS

+Parametric Statistics 1

Assume data are drawn from samples with a certain distribution (usually normal)

Compute the likelihood that groups are related/unrelated or same/different given that underlying model

t-test, Pearson’s correlation, ANOVA…

+Parametric Statistics 2

Assumptions of Parametric statistics1. Observations are independent

2. Your data are normally distributed

3. Variances are equal across groups Can be modified to cope with unequal ∂2

+Non-parametric Statistics?

Non-parametric statistics do not assume any underlying distribution

They compute the likelihood that your groups are the same or different by comparing the ranks of subjects across the range of scores.

+Non-parametric Statistics

Assumptions of non-parametric statistics1. Observations are independent

+Non-parametric Statistics?

Non-parametric statistics do not assume any underlying distribution

Estimating or modeling this distribution reduces their power to detect effects…

So don’t use them unless you have to

+Why use a Non-parametric Statistic?

Very small samples Leads to Type-1 (false alarm) errors

Outliers more often lead to spurious Type-1 (false alarm) errors in parametric statistics. Nonparametric statistics reduce data to an

ordinal rank, which reduces the impact or leverage of outliers.

+Non-parametric Choices

Data type?

χ2

discrete

Question?

continuous

Number of groups?

Spearman’s Rank

association

Different central value

Mann-Whitney UWilcoxon’s Rank Sums

Kruskal-Wallis test

two-groups

more than 2

Brown-Forsythe

Difference in ∂2

+Non-parametric Choices

Data type?

χ2

discrete

Question?

continuous

Number of groups?

Spearman’s Rank

Like a Pearson’s R

Mann-Whitney UWilcoxon’s Rank Sums

Kruskal-Wallis test

two-groups

more than 2 Like ANOVA

Like Student’s t

No alternative

Different central value

Brown-Forsythe

Difference in ∂2

Like F-test

association

+Binomial test

binom.test(45, 100, .5, alternative="two.sided”)

number of successes = 45, number of trials = 100,

p-value = 0.3682

alternative hypothesis: true probability of success is not equal to 0.5

95 percent confidence interval: 0.350 0.5527

Sample estimates: probability of success 0.45

binom.test(51,235,(1/6),alternative="greater")

+Spearman Rank test (ρ (rho))

Named after Charles Spearman,

Non-parametric measure of correlation Assesses how well an arbitrary monotonic function

describes the relationship between two variables,

Does not require the relationship be linear

Does not require interval measurement

+Spearman Rank (ρ rho)

d = difference in rank of a given pair n = number of pairs

Alternative test = Kendall's Tau (Kendall's τ)

+Mann-Whitney U

AKA: “Wilcoxon rank-sum test Mann & Whitney, 1947; Wilcoxon, 1945

Non-parametric test for difference in the medians of two independent samples Assumptions:

Samples are independent Observations can be ranked (ordinal or better)

+Mann-Whitney U

U tests the difference in the medians of two independent samples

n1 = number of obs in sample 1

n2 = number of obs in sample 2

R = sum of ranks of the lower-ranked sample

+Mann-Whitney U or t?

Should you use it over the t-test? Yes if you have a very small sample (<20)

(central limit assumptions not met) If your data are really ordinal Otherwise, probably not.

It is less prone to type-I error (spurious significance) due to outliers.

But does not in fact handle comparisons of samples whose variances differ very well (Use unequal variance t-test with rank data)

+Wilcoxon signed-rank test (related samples)

Same idea as Mann-U, generalized to matched samples

Equivalent to non-independent sample t-test

+Kruskall-Wallis

Non-parametric one-way analysis of variance by ranks (named after William Kruskal and W. Allen Wallis)

tests equality of medians across groups.

It is an extension of the Mann-Whitney U test to 3 or more groups.

Does not assume a normal population,

Assumes population variances among groups are equal.

+Aesop: Mann-Whitney U Example

Suppose that Aesop is dissatisfied with his classic experiment in which one tortoise was found to beat one hare in a race.

He decides to carry out a significance test to discover whether the results could be extended to tortoises and hares in general…

+Aesop 2: Mann-Whitney U

He collects a sample of 6 tortoises and 6 hares, and makes them all run his race. The order in which they reach the finishing post (their rank order) is as follows:

tort = c(1, 7, 8, 9, 10,11)

hare = c(2, 3, 4, 5, 6, 12) Original tortoise still goes at warp speed,

original hare is still lazy, but the others run truer to stereotype.

+Aesop 3: Mann-Whitney U

wilcox.test(tort, hare)

Wilcoxon = W = 25, p-value = 0.31 Tortoises and hares do not differ tort = c(1, 7, 8, 9, 10,11) (n2 = 6)

hare = c(2, 3, 4, 5, 6, 12) (n1 = 6, R1 =32)