Dealing With Statistical Uncertainty Richard Mott Wellcome Trust Centre for Human Genetics.

Dealing With Statistical Uncertainty

Richard MottWellcome Trust Centre for Human

Genetics

Synopsis

• Hypothesis testing • P-values• Confidence intervals• T-test• Non-parametric tests• Sample size calculations • Permutations• Bootstrap

Hypothesis testing

Null Hypothesis H0

Examples of H0:– Mean of a population is 3.0

– In a genetic association study, there is no association between disease state and the genotypes of a particular SNP

Alternative Hypothesis H1

• Examples of H1:– Mean of population > 3.0

– There is an association between disease and genotype

P-values• The P-value of a statistic Z is the

probability of sampling a value more extreme than that observed value z if the null hypothesis is true.

• For a one-sided test, the p-value would be = Prob( Z > z | H0)

• For a two-sided test it is = Prob( |Z| > z | H0)

• Note this is not the probability density at x, but the tails of the cumulative distribution function

Upper and lower 2.5% tails of the N(0,1) distribution, corresponding to |z|=1.96

Power• The power of a statistic Z is the

probability of sampling a value more extreme than that observed value z if the alternative hypothesis is true.

• For a one-sided test, the power would be = Prob( Z > z | H1)

• For a two-sided test it is = Prob( |Z| > z | H1)

• Note this is not the probability density at x, but the tails of the cumulative distribution function

H0: N(1,0) H1: N(1.5,1) H2: N(4,1)0.05 = Prob( Z> 1.644|H0) (5% upper tail)0.44 = Prob( Z> 1.644|H1)0.99 = Prob( Z> 1.644|H2)

The Likelihood• Likelihood = “probability of the data given the model”• Basis of parametric statistical inference

• Different hypotheses can often be expressed in terms of different values of the parameters

• Hypothesis testing equivalent to comparing the likelihoods for different €

L(θ) = Pr(y i |θ)i

∏

The Likelihood Ratio Test• General Framework for constructing hypothesis tests:

S = Likelihood( data | H1) / Likelihood( data | H0)

Reject H0 if S > s(H0, H1)

Threshold s is chosen such that there is a probability of making a false rejection under H0.

is the size of the test (or false positive rate or Type I error) e.g. =0.05 is the power of the test, the probability of correctly rejecting H0 when H1 is true e.g. =0.81- is the type II error, the false negative rate

Generally, for fixed sample size n, if we fix then we can’t fix If we fix and then we must let n vary.

The Neyman Pearson Lemma states that the likelihood ratio test is the most powerful of all tests of a given size

Type III error: "correctly rejecting the null hypothesis for the wrong reason".

Example: The mean of a Normal Distribution

H0: mean = 0 vs H1: mean = 1

data: y1, … yn iid N(,2) Assume variance is the same and is known

Therefore we base all inferences on the sample mean compared to the difference in the hypothesised means

The Normal and T distributions• For a sample from a Normal distribution with known mean and variance, the sample mean can be

standardised to follow the standard Normal distribution

• And the 95% confidence interval for the mean is given by

• But we often wish to make inferences about samples from a Normal Distribution when the variance is unknown: The variance must be estimated from the data as the sample variance

• Because of this uncertainty the distribution of the sample mean is broader, and follows the Tn-1 distribution

The T distribution

• The Tn and Normal distributions are almost identical for n>20,

• As a rule of thumb, an approximate 95% confidence interval for n>20 is

T-tests• The T-test compares the means of samples

• The T-test is optimal (in LR sense) if the data are sampled from a Normal distribution

• Even if the data are not Normal the test is useful in large samples

• There are several versions, depending on the details

T tests for the comparison of sample meansOne Sample Test

• One sample y1 ….. yn– H0: the sample mean = 0

– H1: the sample mean != 0

– Reject H0 if

– tn-1() is the quantile of the Tn-1 distribution such that the probability of exceeding tn-1() is

– Two sided test

– in R, tn-1(0.025) = pt(0.025,n-1)

T tests for the comparison of sample means Paired T-test

• Two samples of paired data (x1,y1), (x2-y2) …. (xn,yn)

• Example:– two experimental replicates taken from n individuals, we

wish to test if the replicates are statistically similar• H0: mean(x) = mean(y)• Take differences d1= x1 - y1 … dn= xn - yn

• H0: mean(d) = 0• One sample T-test with 0=0

T tests for the comparison of sample means Two-Sample T-test

• Two samples of unpaired data x1, x2 ….. xn

y1, y2 ….. ym

• H0: mean(x) = mean(y)• If we assume the variance is the same in each group then we can estimate the variance as

the pooled estimator

• The test statistic is

(The case with unequal variances is more complicated)

T tests in Rt.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)

• x a numeric vector of data values. • y an optional numeric vector data values. • alternative a character string specifying the alternative hypothesis, must be one of

"two.sided" (default), "greater" or "less". • mu a number indicating the true value of the mean • paired a logical indicating whether you want a paired t-test. • var.equal a logical variable indicating whether to treat the two variances as being

equal. • conf.level confidence level of the interval. • formula a formula of the form lhs ~ rhs where lhs is a numeric variable giving the

data values and rhs a factor with two levels giving the corresponding groups. • data an optional data frame containing the variables in the model formula. • subset an optional vector specifying a subset of observations to be used.

T tests in R

samp1 <- rnorm( 20, 0, 1) # sample 20 numbers from N(0,1) distribution

samp2 <- rnorm( 20, 1.5, 1) # sample from N(1.5,1)

t.test( samp1, samp2,var.equal=TRUE )

Two Sample t-test

data: samp1 and samp2 t = -3.5892, df = 38, p-value = 0.000935alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.9272037 -0.5372251 sample estimates:mean of x mean of y 0.4040969 1.6363113

Outliers A small number of contaminating observations with different distribution

• Outliers can destroy statistical genuine significance

Reason: the estimate of the variance is inflated, reducing the T statistic

• And sometimes create false significance : Type III error: "correctly rejecting the null hypothesis for the wrong reason".

• There are Robust Alternatives to the T-test: Wilcoxon tests

> samp3 <- samp2> samp3[1] = 40 # add one outlier

> t.test( samp1, samp3,var.equal=TRUE )

Two Sample t-test

data: samp1 and samp3 t = -1.5911, df = 38, p-value = 0.1199alternative hypothesis: true difference in means is not

equal to 0 95 percent confidence interval: -7.0519594 0.8450837 sample estimates:mean of x mean of y 0.4040969 3.5075347

> var(samp2)[1] 1.160251> var(samp3)[1] 74.88977

Non-parametric tests:Wilcoxon signed rank test

(alternative to Paired T-test)

• Two samples of paired data (x1,y1), (x2-y2) …. (xn,yn)

• Take differences d1= x1 - y1 … dn= xn – yn

• H0: distribution of d’s is symmetric about 0

• Compute the ranks of the absolute differences, rank(|d1|) etc

• Compute the signs of the differences• Compute W, the sum of the ranks with positive signs• If H0 true then W should be close to the mean rank n/2

• The distribution of W is known and for n>20 a Normal approximation is used.

• in R wilcox.test(x, y, alternative = c("two.sided", "less", "greater"), mu = 0,

paired = TRUE, exact = NULL, correct = TRUE, conf.int = FALSE, conf.level = 0.95, ...)

Wilcoxon rank sum test(alternative to unpaired T-test)

• Two samples of unpaired data x1, x2 ….. xn

y1, y2 ….. ym

• Arrange all the observations into a single ranked series of length N = n+m. • R1 = sum of the ranks for the observations which came from sample 1. The sum of

ranks in sample 2 follows by calculation, since the sum of all the ranks equals N(N + 1)/2

• Compute U = min(R1 ,R2) - n(n + 1)/2.

• z = (U –)/ Normal approximation = nm/2 = nm(n=m+1)/12

• The Wilcoxon Rank sum test is only about 5% less efficient than the unpaired T-test (in terms of the sample size required to achieve the same power) in large samples.

Wilcoxon rank sum test(alternative to unpaired T-test)

>wilcox.test(samp1,samp2)

Wilcoxon rank sum test

data: samp1 and samp2 W = 82, p-value = 0.001041alternative hypothesis: true mu is not equal to 0

> wilcox.test(samp1,samp3) # with outliers

Wilcoxon rank sum test

data: samp1 and samp3 W = 82, p-value = 0.001041alternative hypothesis: true mu is not equal to 0

Binomial Sign Test(T-test on one sample)

• One sample y1 ….. yn

H0: the sample mean = 0

• Count M, the number of y’s that are > 0

• If H0 is true, this should be about n/2, and should be distributed as a Binomial random variable B(n/2,n)

Contingency Tables• Data are a table of counts, with the row and column

margins fixed.

• Ho: the counts in each cell are consistent with the rows and columns acting independently

n11

r1c1/N

n12

r1c2/N

r1 = n11+ n12

n21

r2c1/N

n22

r2c2/N

r2 = n21+ n22

c1

= n11+ n21

c2

= n12+ n22

N

The Chi-Squared statistics

€

X 2 = (Oij − Eijij

∑ )2 /Eij

logLikRatio= 2 Oij log(Oij /Eij )ij

∑

€

≈χ(R−1)(C −1)2

Fisher’s Exact Testfor 2 x 2 Contingency tables

• Similar idea to a permutation test (see below)• Given the marginal totals in a 2x2 table, we can

compute the probability P of getting the observed cell counts assuming the null hypothesis of no associations between classifying factors. – hypergeometric distribution.

• Find all tables with same margins and with probabilities no bigger than P

• P-value of the table is the sum of these probabilities

Fisher’s Exact Test

Smoker Non-smokerMale a b a+bFemale c d c+d

a+c b+d n

€

P =

a+ b

a

⎛

⎝ ⎜

⎞

⎠ ⎟c + d

c

⎛

⎝ ⎜

⎞

⎠ ⎟

n

a+ c

⎛

⎝ ⎜

⎞

⎠ ⎟

R implementation:

m <- matrix(c(10,10,5,12),nrow=2)fisher.test(m)

Fisher's Exact Test for Count Data

data: m p-value = 0.3152alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.513082 11.942154 sample estimates:odds ratio 2.342796

>

Smoker Non-smokerMale 10 10 20Female 5 12 17

15 22 37

Permutation TestsAnother non-parametric Alternative

• Permutation is a general principle with many applications to hypothesis testing.

• But it is not useful for parameter estimation• The method is best understood by an example:

– Suppose a data set is classified into N groups. We are interested in whether the differences between group means (eg if N=2 then the two sample T test or Wilcoxon Rank Sum test is appropriate)

– Compute a statistic S (eg the difference between the maximum and minimum group mean)

– If there are no differences between groups then any permutation of the group labellings between individuals should produce much the same result for S

– Repeat i = 1 to M times:• permute the data• compute the statistic on the permuted data, Si

– The Permutation p-value is the fraction of permutations that exceed S• It’s that simple!

Bootstrapping• Given a set of n things S = {x1 …. x2}, a bootstrap sample is formed by sampling with

replacement from S, to create a new set S* of size n

• Original• Resample

• Some elements of S will be missing in S* and some will be present multiply often (about 1/e = 63% of the elements will be present in S* on average)

• S* can be used as a replacement for S in any algorithm that works on S, and we can repeatedly draw further bootstrap samples.

• Any statistic Z that can be calculated on S can also be calculated on S*, and so we can construct the bootstrap sampling distribution of Z from repeated samples.

• From this we can make inferences about the variation of Z, eg construct a 95% confidence interval (if Z is numeric)

• Can be used for parameter estimation

Bootstrapping• Bootstrapping is a general way to evaluate the uncertainty in estimates of

statistics whilst making few assumptions about the data• It is particularly useful for complex statistics whose distribution cannot be

evaluated analytically• Example:

– In a gene expression study, we wish to find a minimal set of SNPs that in combination will predict the variation in a phenotype.

– Suppose we have a deterministic algorithm A that will find us a plausible set of SNPs (ie a model). • This is likely to be a complex programme and contain some arbitrary steps. • It is also unlikely to give any measure of uncertainty to the model.

– There may be many possible sets of SNPs with similar claims, so we would like a way to average across different models, and evaluate the evidence for a SNP as the fraction of models in which the SNP occurs• To do this, we need a way to perturb the data, to generate new data sets on which we

can apply the procedure A• Bootstrapping is one way to do this

Dealing With Statistical Uncertainty Richard Mott Wellcome Trust Centre for Human Genetics.

Documents

Transcript of Dealing With Statistical Uncertainty Richard Mott Wellcome Trust Centre for Human Genetics.