Post on 14-Jul-2020
Stat 204, Part 4 Inference
Chapter 8: Sampling Variability and Sampling Distributions
These notes reflect material from our text, Statistics, Learning from Data, First Edition, by Roxy Peck,published by CENGAGE Learning, 2015.
Sampling distributions
Three distributions : population, data, sampling
Sampling distribution of the sample proportion
Sampling distribution of the sample mean
10 15 20 25 30 35 40
0.00
0.05
0.10
0.15
0.20
Population distribution vs. sampling distribution of sample mean
Frequency
populationsample means
LLN and CLT
LLN : Xn → µ as n gets larger.
CLT : the distribution of Xn → Normal Distribution as n gets larger.
Both theorems require certain conditions to be satisfied for the theorem to be applicable. The CLTrequires
• independent sample elements (from less than 10% of the population)
• relatively large sample size (for instance, at least 30 elements)
• data which are not highly skewed and without extreme outliers
in order to conclude that the sample mean is approximately normally distributed with standard deviationapproximately SE (OIS, p.168).
Spring 2016 Page 1 of 6
Stat 204, Part 4 Inference
Sampling distributions of the sample mean
Uniform distribution.
0.0 0.2 0.4 0.6 0.8 1.0
0.6
0.8
1.0
1.2
1.4
Uniform Distribution, U[0,1]
xs
Probability
n=2, n.samples=10000
Sample mean
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0200
400
600
800
n=5, n.samples=10000
Sample mean
Frequency
0.2 0.4 0.6 0.8 1.0
0500
1000
1500
n=30, n.samples=10000
Sample mean
Frequency
0.3 0.4 0.5 0.6 0.7
0500
1000
1500
Exponential distribution.
0.0 0.2 0.4 0.6 0.8 1.0
0.4
0.6
0.8
1.0
Exponential Distribution
xs
Probability
n=2, n.samples=10000
Sample mean
Frequency
0 1 2 3 4 5
0500
1500
2500
n=5, n.samples=10000
Sample mean
Frequency
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0500
1000
1500
n=30, n.samples=10000
Sample mean
Frequency
0.5 1.0 1.5
0500
1000
2000
Spring 2016 Page 2 of 6
Stat 204, Part 4 Inference
Normal distribution. The sampling distribution of a normal distribution is a normal distribution, evenfor small sample sizes.
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
Normal Distribution
xs
Probability
n=2, n.samples=10000
Sample mean
Frequency
-3 -2 -1 0 1 2 3
0500
1500
2500
n=5, n.samples=10000
Sample mean
Frequency
-1.5 -0.5 0.0 0.5 1.0 1.5
0500
1000
1500
n=30, n.samples=10000
Sample mean
Frequency
-0.5 0.0 0.5 1.0
0500
1000
1500
2000
Sampling distributions of counts and proportions
Imagine taking a sample of size 100 from a Bernoulli population with p = 0.6. Do this 1000 times andmake a histogram of the counts of successes. Take another 1000 samples of size 100 and make a histogramof the proportions of successes in the samples. What do you observe?
0.0
0.4
0.8
Population Distribution(p=0.6)
Outcome
Pro
babi
lity
●
●
0 1
Sampling DistributionCounts
results
Fre
quen
cy
40 50 60 70 80
010
020
030
0
Sampling DistributionProportions
results
Fre
quen
cy
0.4 0.6 0.8
010
020
030
0
Distribution of a statistic (mean or proportion)
Two scenarios : (1) If we are studying a quantitative variable and we know the population parametersµ and σ, what can be said of the sample statistics? (2) If we know the sample statistics x and s, what canbe said of the population parameters? The first question is easiest to answer, but is rarely the case. Thesecond question leads to much of contemporary statistics.
Spring 2016 Page 3 of 6
Stat 204, Part 4 Inference
We might be studying a quantitative variable in a population (normal or not) with population pa-rameters µ and σ, and we wish to know the sampling distribution of the sample mean x. Or we mightbe studying a categorical variable in a population with proportion p, and we wish to know the samplingdistribution of the sample proportion p̂. The following table summarizes the sample distributions in bothcases (Probability and Statistics, Open Learning Initiative, CMU).
Variable Statistic Shape Center Standard Error Conditions
quantitative (σ known) x Normal µ σ/√n n ≥ 30 or approx. normal
quantitative (σ unknown) x t µ s/√n n ≥ 30 or approx. normal
categorical p̂ Normal p√
p(1−p)n min(np, n(1− p)) ≥ 15
Barrels of marbles and beans
Three distributions : population, data, sampling
Here is a more leisurely view of sampling distributions. Suppose that in the middle of the classroomwe have a big barrel of red and white marbles. A proportion p of this large population of marbles is red. Ifwe actually knew exactly how many red and how many white marbles were in the barrel, we could make asmall table displaying the number of red and white marbles in the barrel. Agresti and Franklin would callthis table the population distribution. Beside that barrel is another large barrel full of bright green limabeans. The beans vary in size from large to small, but most are medium-sized. We are actually interestedin the weights of the beans, so it is interesting to know that the average weight of the beans in this barrelis µ and the standard deviation of their weights is σ. A graph of the actual distribution of weights of thelima beans in the barrel would be an illustration of a population distribution.
Sampling distribution of the sample proportion
Next to the barrels is a small scoop. We begin by sampling from the barrel with the marbles. Scoopout a small but fixed number of marbles, n, and count the proportion of red marbles in your scoop, p̂.Make a small table displaying the number of red and the number of white marbles in your scoop. Agrestiand Franklin would call this table the data distribution of the marbles in your sample. Another studenttakes a scoop of the same number n of marbles, and calculates the statistic p̂ for the marbles in her scoop.We might expect these two statistics to be similar, but probably not exactly equal. The question now is,if a large number of students took similar samples of size n from the barrel of marbles, and calculatedthe proportion p̂ of red marbles in each scoop, what would the collection of statistics p̂ look like. Theanswer to that question is called the sampling distribution of p̂. Since our samples are independent, theCLT says that the sampling distribution is approximately normal. It has mean µ and standard deviation√p(1− p)/n.
Sampling distribution of the sample mean
Now turn to the barrel of lima beans. We want to study the weights of the beans in our samples, so weborrow a precision balance from the chemistry department. Scoop out a small but fixed number of limabeans, n, weigh each bean in your scoop, and calculate the average weight of the beans in your scoop, x̄.Make a stripchart (dot plot) of the weights of the beans in your sample. This strip chart is an illustration ofthe data distribution of the beans in your sample. Now the next student repeats the process, and calculatesthe average weight of the beans in the second scoop, x̄. We might expect these two statistics to be similar,but probably not exactly equal. The question now is, if a large number of students took similar samples of
Spring 2016 Page 4 of 6
Stat 204, Part 4 Inference
size n from the barrel of lima beans, and calculated the average weight of the beans x̄ in each scoop, whatwould the collection of statistics x̄ look like. The answer to that question is called the sampling distributionof x̄. Since our samples are independent, the CLT says that the sampling distribution is approximatelynormal, and in fact if the distribution of the population weights is actually normal, then the samplingdistribution is also normal (not just approximately normal). In either case, the sampling distribution hasmean µ and standard deviation σ/
√n.
Standard Error
The standard deviation of a statistic is called a standard error, so in the sequel we will write
SEx̄ = σ/√n
orSEp̂ =
√p(1− p)/n.
Cherry Blossom Run, 2012
16,924 runners participated in the Cherry Blossom 10 Mile Run in 2012. Take the population to be allof the runners, and estimate their average time to complete the race and the average age of the participantsby taking 1000 samples of 100 runners each and calculating the corresponding sample statistics. The resultsare illustrated in the population and sampling distributions shown below. (OpenIntro Statistics, 2nd ed.,pp.159-164).
Population Distribution (time)
Time (Min)
Frequency
40 60 80 120 160
02000
Sampling Distribution (time)
Sample mean
Frequency
40 60 80 120 160
0100200
Population Distribution (age)
Age
Frequency
0 20 40 60 80
01000
3000
Sampling Distribution (age)
Sample mean
Frequency
0 20 40 60 80
050
100
150
Spring 2016 Page 5 of 6
Stat 204, Part 4 Inference
Inference
Outline for one-variable inference for sample data (Peck, chapters 7–13):
The goal is to generalize from a sample to learn about a population
• categorical variable
- one proportion
- confidence interval - one-sample z CI for a proportion
- hypothesis testing - one-sample z HT for a proportion
- difference of two proportions
- confidence interval - two-sample z CI for a difference in proportions
- hypothesis testing - two-sample z HT for a difference in proportions
• continuous variable
- one mean
- confidence interval - one-sample t CI for a mean
- hypothesis testing - one-sample t HT for a mean
- difference of paired means
- confidence interval - paired t CI for a difference in means
- hypothesis testing - paired t HT for a difference in means
- difference of independent means
- confidence interval - two-sample t CI for a difference in means
- hypothesis testing - two-sample t HT for a difference in means
Exercises
We will attempt to solve some of the following exercises as a community project in class today. Finish thesesolutions as homework exercises, write them up carefully and clearly, and hand them in at the beginningof class next Friday.
Homework 8a – sampling distributions
Exercises from Chapter 8:8.2 (histograms), 8.3 (imports), 8.15 (sampling distribution), 8.16 (normal), 8.26 (sampling distribution)
Homework 8b – sampling distributions
Exercises from Chapter 8:8.27 (sampling distribution), 8.29 (polling), 8.30 (credit card), 8.50 (sample size), 8.52 (hurricane)
Spring 2016 Page 6 of 6