Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the...
Transcript of Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the...
![Page 1: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/1.jpg)
1/ 36
Lecture 2
Introduction to statistics: The samplingdistribution, t-test
Shravan Vasishth
Universitat [email protected]
http://www.ling.uni-potsdam.de/∼vasishth
April 10, 2020
1 / 36
![Page 2: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/2.jpg)
2/ 36
Lecture 2
The sampling distribution of the mean
Sampling from the normal distribution
The sampling distribution of the mean
When we have a single sample, we know how to compute MLEsof the sample mean and standard deviation, µ and σ.Suppose now that you had many repeated samples; from eachsample, you can compute the mean each time. We can simulatethis situation:
x<-rnorm(100,mean=500,sd=50)
mean(x)
## [1] 502.92
x<-rnorm(100,mean=500,sd=50)
mean(x)
## [1] 497.16
2 / 36
![Page 3: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/3.jpg)
3/ 36
Lecture 2
The sampling distribution of the mean
Sampling from the normal distribution
The sampling distribution of the mean
Let’s repeatedly simulate sampling 1000 times:
nsim<-1000
n<-100
mu<-500
sigma<-100
samp_distrn_means<-rep(NA,nsim)
samp_distrn_sd<-rep(NA,nsim)
for(i in 1:nsim){x<-rnorm(n,mean=mu,sd=sigma)
samp_distrn_means[i]<-mean(x)
samp_distrn_sd[i]<-sd(x)
}
3 / 36
![Page 4: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/4.jpg)
4/ 36
Lecture 2
The sampling distribution of the mean
Sampling from the normal distribution
The sampling distribution of the mean
Plot the distribution of the means under repeated sampling:
Samp. distrn. means
µ
density
470 480 490 500 510 520 530
0.0
00.0
10.0
20.0
30.0
4
4 / 36
![Page 5: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/5.jpg)
5/ 36
Lecture 2
The sampling distribution of the mean
Sampling from the exponential distribution
The sampling distribution of the mean
Interestingly, it is not necessary that the distribution that we aresampling from be the normal distribution.
for(i in 1:nsim){x<-rexp(n)
samp_distrn_means[i]<-mean(x)
samp_distrn_sd[i]<-sd(x)
}
5 / 36
![Page 6: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/6.jpg)
6/ 36
Lecture 2
The sampling distribution of the mean
Sampling from the exponential distribution
The sampling distribution of the mean
Samp. distrn. means
µ
density
0.8 1.0 1.2 1.4
01
23
6 / 36
![Page 7: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/7.jpg)
7/ 36
Lecture 2
The sampling distribution of the mean
The central limit theorem
The central limit theorem
1. For large enough sample sizes, the sampling distribution of themeans will be approximately normal, regardless of theunderlying distribution (as long as this distribution has a meanand variance defined for it).
2. This will be the basis for statistical inference.
7 / 36
![Page 8: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/8.jpg)
8/ 36
Lecture 2
The sampling distribution of the mean
Standard error
The sampling distribution of the mean
We can compute the standard deviation of the samplingdistribution of means:
## estimate from simulation:
sd(samp_distrn_means)
## [1] 0.10191
8 / 36
![Page 9: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/9.jpg)
9/ 36
Lecture 2
The sampling distribution of the mean
Standard error
The sampling distribution of the mean
A further interesting fact is that we can compute this standarddeviation of the sampling distribution from a single sample ofsize n:σ√n
x<-rnorm(100,mean=500,sd=100)
hat_sigma<-sd(x)
hat_sigma/sqrt(n)
## [1] 9.9872
9 / 36
![Page 10: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/10.jpg)
10/ 36
Lecture 2
The sampling distribution of the mean
Standard error
The sampling distribution of the mean
1. So, from a sample of size n, and sd σ or an MLE σ, we cancomputethe standard deviation of the sampling distribution of themeans.
2. We will call this standard deviation the estimated standarderror.SE = σ√
n
I say estimated because we are estimating SE using an anestimate of σ.
10 / 36
![Page 11: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/11.jpg)
11/ 36
Lecture 2
The sampling distribution of the mean
Confidence intervals
Confidence intervals
The standard error allows us to define a so-called 95% confidenceinterval:
µ± 2SE (1)
So, for the mean, we define a 95% confidence interval as follows:
µ± 2σ√n
(2)
11 / 36
![Page 12: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/12.jpg)
12/ 36
Lecture 2
The sampling distribution of the mean
Confidence intervals
Confidence intervals
In our example:
## lower bound:
mu-(2*hat_sigma/sqrt(n))
## [1] 480.03
## upper bound:
mu+(2*hat_sigma/sqrt(n))
## [1] 519.97
12 / 36
![Page 13: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/13.jpg)
13/ 36
Lecture 2
The sampling distribution of the mean
Confidence intervals
The meaning of the 95% CI
If you take repeated samples and compute the CI each time, 95%of those CIs will contain the true population mean.
lower<-rep(NA,nsim)
upper<-rep(NA,nsim)
for(i in 1:nsim){x<-rnorm(n,mean=mu,sd=sigma)
lower[i]<-mean(x) - 2 * sd(x)/sqrt(n)
upper[i]<-mean(x) + 2 * sd(x)/sqrt(n)
}
13 / 36
![Page 14: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/14.jpg)
14/ 36
Lecture 2
The sampling distribution of the mean
Confidence intervals
The meaning of the 95% CI
## check how many CIs contain mu:
CIs<-ifelse(lower<mu & upper>mu,1,0)
table(CIs)
## CIs
## 0 1
## 61 939
## approx. 95% of the CIs contain true mean:
table(CIs)[2]/sum(table(CIs))
## 1
## 0.939
14 / 36
![Page 15: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/15.jpg)
15/ 36
Lecture 2
The sampling distribution of the mean
Confidence intervals
The meaning of the 95% CI
0 20 40 60 80 100
56
58
60
62
64
95% CIs in 100 repeated samples
i−th repeated sample
Score
s
15 / 36
![Page 16: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/16.jpg)
16/ 36
Lecture 2
The sampling distribution of the mean
Confidence intervals
The meaning of the 95% CI
1. The 95% CI from a particular sample does not mean that theprobability that the true value of the mean lies inside thatparticular CI.
2. Thus, the CI has a very confusing and (not very useful!)interpretation.
3. In Bayesian statistics we use the credible interval, which has amuch more sensible interpretation.
However, for large sample sizes, the credible and confidenceintervals tend to be essentially identical.For this reason, the CI is often treated (this is technicallyincorrect!) as a way to characterize uncertainty about our estimateof the mean.
16 / 36
![Page 17: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/17.jpg)
17/ 36
Lecture 2
The sampling distribution of the mean
Confidence intervals
Main points from this lecture
1. We compute maximum likelihood estimates of the meanx = µ and standard deviation σ to get estimates of the truebut unknown parameters.
x =∑n
i=1 xin
2. For a given sample, having estimated σ, we estimate thestandard error:SE = σ/
√n
3. This allows us to define a 95% CI about the estimated mean:µ± 2× SE
From here, we move on to statistical inference and null hypothesissignificance testing (NHST).
17 / 36
![Page 18: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/18.jpg)
18/ 36
Lecture 2
The story so far
1. We defined random variables.
2. We learnt about pdfs and cdfs, and learnt how to computeP (X < x).
3. We learnt about Maximum Likelihood Estimation.
4. We learnt about the sampling distribution of the samplemeans.
This prepares the way for null hypothesis significance testing(NHST).
18 / 36
![Page 19: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/19.jpg)
19/ 36
Lecture 2
Statistical inference
Hypothesis testing
Suppose we have a random sample of size n, and the data comefrom a N(µ, σ) distribution.We can estimate sample mean x = µ and σ, which in turn allowsus to estimate the sampling distribution of the mean under(hypothetical) repeated sampling:
N(x,σ√n
) (3)
19 / 36
![Page 20: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/20.jpg)
20/ 36
Lecture 2
Statistical inference
The one-sample hypothesis test
Imagine taking an independent random sample from a randomvariable X that is normally distributed, with mean 12 and standarddeviation 10, sample size 11. We estimate the mean and SE:
sample <- rnorm(11,mean=12,sd=10)
(x_bar<-mean(sample))
## [1] 5.4785
(SE<-sd(sample)/sqrt(11))
## [1] 3.0762
20 / 36
![Page 21: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/21.jpg)
21/ 36
Lecture 2
Statistical inference
The one-sample t-test
The one-sample test
The NHST approach is to set up a null hypothesis that µ has somefixed value. For example:
H0 : µ = 0 (4)
This amounts to assuming that the true distribution of samplemeans is (approximately*) normally distributed and centeredaround 0, with the standard error estimated from the data.
* I will make this more precise in a minute.
21 / 36
![Page 22: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/22.jpg)
22/ 36
Lecture 2
Statistical inference
The one-sample t-test
Null hypothesis distribution
−20 −10 0 10 20
0.0
0.1
0.2
0.3
0.4
x
density
sample mean
22 / 36
![Page 23: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/23.jpg)
23/ 36
Lecture 2
Statistical inference
The one-sample t-test
NHST
The intuitive idea is that
1. if the sample mean x is near the hypothesized µ (here, 0), thedata are (possibly) “consistent with” the null hypothesisdistribution.
2. if the sample mean x is far from the hypothesized µ, the dataare inconsistent with the null hypothesis distribution.
We formalize “near” and “far” by determining how many standarderrors the sample mean is from the hypothesized mean:
t× SE = x− µ (5)
This quantifies the distance of sample mean from µ in SE units.
23 / 36
![Page 24: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/24.jpg)
24/ 36
Lecture 2
Statistical inference
The one-sample t-test
NHST
So, given a sample and null hypothesis mean µ, we can computethe quantity:
t =x− µSE
(6)
Call this the t-value. Its relevance will just become clear.
24 / 36
![Page 25: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/25.jpg)
25/ 36
Lecture 2
Statistical inference
The one-sample t-test
NHST
The quantity
T =X − µSE
(7)
has a t-distribution, which is defined in terms of the sample size n.We will express this as: T ∼ t(n− 1)Note also that, for large n, T ∼ N(0, 1).
25 / 36
![Page 26: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/26.jpg)
26/ 36
Lecture 2
Statistical inference
The one-sample t-test
NHST
Thus, given a sample size n, and given our null hypothesis, we candraw t-distribution corresponding to the null hypothesisdistribution.For large n, we could even use N(0,1), although it is traditional inpsychology and linguistics to always use the t-distribution nomatter how large n is.
26 / 36
![Page 27: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/27.jpg)
27/ 36
Lecture 2
Statistical inference
The one-sample t-test
The t-distribution vs the normal
1. The t-distribution takes as parameter the degrees of freedomn− 1, where n is the sample size (cf. the normal, which takesthe mean and variance/standard deviation).
2. The t-distribution has fatter tails than the normal for small n,say n < 20, but for large n, the t-distribution and the normalare essentially identical.
27 / 36
![Page 28: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/28.jpg)
28/ 36
Lecture 2
Statistical inference
The one-sample t-test
The t-distribution vs the normal
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
df= 2
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
df= 5
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
df= 15
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
df= 20
28 / 36
![Page 29: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/29.jpg)
29/ 36
Lecture 2
Statistical inference
The one-sample t-test
t-test: Rejection region
So, the null hypothesis testing procedure is:
1. Define the null hypothesis: for example, H0 : µ = 0.
2. Given data of size n, estimate x, standard deviation s,standard error s/
√n.
3. Compute the t-value:
t =x− µs/√n
(8)
4. Reject null hypothesis if t-value is large (to be made moreprecise next).
29 / 36
![Page 30: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/30.jpg)
30/ 36
Lecture 2
Statistical inference
The one-sample t-test
t-test
How to decide when to reject the null hypothesis? Intuitively, whenthe t-value from the sample is so large that we end up far in eithertail of the distribution.
30 / 36
![Page 31: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/31.jpg)
31/ 36
Lecture 2
Statistical inference
The one-sample t-test
t-test
−20 −10 0 10 20
0.0
0.1
0.2
0.3
0.4
t(n−1)
x
density
sample mean
31 / 36
![Page 32: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/32.jpg)
32/ 36
Lecture 2
Statistical inference
The one-sample t-test
Rejection region
1. For a given sample size n, we can identify the “rejectionregion” by using the qt function (see lecture 1).
2. Because the shape of the t-distribution depends on the degreesof freedom (n-1), the critical t-value beyond which we rejectthe null hypothesis will change depending on sample size.
3. For large sample sizes, say n > 50, the rejection point isabout 2.
abs(qt(0.025,df=15))
## [1] 2.1314
abs(qt(0.025,df=50))
## [1] 2.0086
32 / 36
![Page 33: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/33.jpg)
33/ 36
Lecture 2
Statistical inference
The one-sample t-test
t-test: Rejection region
Consider the t-value from our sample in our running example:
## null hypothesis mean:
mu<-0
(t_value<-(x_bar-mu)/SE)
## [1] 1.781
Recall that the t-value from the sample is simply telling you thedistance of the sample mean from the null hypothesis mean µ instandard error units.
t =x− µs/√n
or ts√n
= x− µ (9)
33 / 36
![Page 34: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/34.jpg)
34/ 36
Lecture 2
Statistical inference
The one-sample t-test
t-test: Rejection region
So, for large sample sizes, if | t |> 2 (approximately), we can rejectthe null hypothesis.For a smaller sample size n, you can compute the exact criticalt-value:
qt(0.025,df=n-1)
This is the critical t-value on the left-hand side of thet-distribution. The corresponding value on the right-hand side is:
qt(0.975,df=n-1)
Their absolute values are of course identical (the distribution issymmetric).
34 / 36
![Page 35: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/35.jpg)
35/ 36
Lecture 2
Statistical inference
The one-sample t-test
The t-distribution vs the normal
Given the relevant degrees of freedom, one can compute the areaunder the curve as for the Normal distribution:
pt(-2,df=10)
## [1] 0.036694
pt(-2,df=20)
## [1] 0.029633
pt(-2,df=50)
## [1] 0.025474
Notice that with large degrees of freedom, the area under thecurve to the left of -2 is approximately 0.025.
35 / 36
![Page 36: Introduction to statistics: The sampling distribution, t-test · The sampling distribution of the mean Sampling from the normal distribution The sampling distribution of the mean](https://reader035.fdocuments.net/reader035/viewer/2022071008/5fc5971f1fde8307790e0d94/html5/thumbnails/36.jpg)
36/ 36
Lecture 2
Statistical inference
The one-sample t-test
The t.test function
The t.test function in R delivers the t-value:
## from t-test function:
## t-value
t.test(sample)$statistic
## t
## 1.781
36 / 36