STAT 113 Inferential Statistics I
Transcript of STAT 113 Inferential Statistics I
Inference Goals Parameter Estimation Sampling Distributions
STAT 113Inferential Statistics I
Foundational Concepts
Colin Reimer Dawson
March 30, 2022
1 / 20
Inference Goals Parameter Estimation Sampling Distributions
Sampling and Inference: The “Big Picture”
2 / 20
Inference Goals Parameter Estimation Sampling Distributions
Population (AKA, the “Process”, or the“Phenomenon”): All potential cases that we are interested insaying something about, or, equivalently, the process thatgenerated the data (and could generate a different dataset ifthe study were repeated).
Sample (AKA The Data Set): The set of cases weactually have data for (a subset of the population, or a singlesnapshot produced by the data-generating process)
Statistical Inference: Using a representative snapshot(a data set) to say something about an underlyingpopulation/process/phenomenon
3 / 20
Inference Goals Parameter Estimation Sampling Distributions
Outline
Inference Goals
Parameter Estimation
Sampling Distributions
4 / 20
Inference Goals Parameter Estimation Sampling Distributions
Two Main Goals of Inference
1. Estimating unknown quantities in a population using a dataset (by reporting confidence intervals)
2. Assessing strength of evidence about “yes/no” questions(by carrying out hypothesis tests)
5 / 20
Inference Goals Parameter Estimation Sampling Distributions
Statistics vs. Parameters• Summary values (like mean, median, standard deviation) exist
for both Populations/Processes/Phenomena and forSamples/dataSets/Snapshots.• In a Population/Process, such a summary value is called aParameter• In a Sample/dataSet/Snapshot, these values are calledStatistics, and are used to make inferences about thecorresponding parameter
Notation: Parameters and StatisticsSummary Value Parameter Statistic
Mean µ X̄Proportion p p̂Correlation ρ r
Slope of a Line β b̂Difference in Means µ1 − µ2 X̄1 − X̄2
. . . . . . . . . 6 / 20
Inference Goals Parameter Estimation Sampling Distributions
Outline
Inference Goals
Parameter Estimation
Sampling Distributions
7 / 20
Inference Goals Parameter Estimation Sampling Distributions
Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.
• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)• The mean flavor-life of all gumballs produced from the
factory is a parameter (write µ for the pop. mean)• I can only test a subset (a sample/snapshot) — ideally, a
random one.• The mean flavor-life of the gumballs tested is a statistic
(write x̄ for the sample mean).
• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20
Inference Goals Parameter Estimation Sampling Distributions
Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)
• The mean flavor-life of all gumballs produced from thefactory is a parameter (write µ for the pop. mean)• I can only test a subset (a sample/snapshot) — ideally, a
random one.• The mean flavor-life of the gumballs tested is a statistic
(write x̄ for the sample mean).
• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20
Inference Goals Parameter Estimation Sampling Distributions
Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)• The mean flavor-life of all gumballs produced from the
factory is a parameter (write µ for the pop. mean)
• I can only test a subset (a sample/snapshot) — ideally, arandom one.• The mean flavor-life of the gumballs tested is a statistic
(write x̄ for the sample mean).
• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20
Inference Goals Parameter Estimation Sampling Distributions
Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)• The mean flavor-life of all gumballs produced from the
factory is a parameter (write µ for the pop. mean)• I can only test a subset (a sample/snapshot) — ideally, a
random one.
• The mean flavor-life of the gumballs tested is a statistic(write x̄ for the sample mean).
• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20
Inference Goals Parameter Estimation Sampling Distributions
Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)• The mean flavor-life of all gumballs produced from the
factory is a parameter (write µ for the pop. mean)• I can only test a subset (a sample/snapshot) — ideally, a
random one.• The mean flavor-life of the gumballs tested is a statistic
(write x̄ for the sample mean).
• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to SamplingIf all potential gumballs have a flavor-life distribution representedby this histogram...
Process Mean = 66.8
55 60 65 70 75 80Flavor Life (minutes)
then a random sample of 10 gumballs might have flavor liveslike...
Sample Mean = 65.7
55 60 65 70 75 80Flavor Life (minutes)
9 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to Sampling
Process Mean = 66.8
55 60 65 70 75 80Flavor Life (minutes)
Another sample of 10 gumballs might have different flavor-lives:
Sample Mean = 66.5
55 60 65 70 75 80Flavor Life (minutes)
10 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to Sampling
Process Mean = 66.8
55 60 65 70 75 80Flavor Life (minutes)
Yet another sample of 10 gumballs might look like this:
Sample Mean = 66.5
55 60 65 70 75 80Flavor Life (minutes)
11 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to Sampling
Process Mean = 66.8
55 60 65 70 75 80Flavor Life (minutes)
Or this:
Sample Mean = 66.3
55 60 65 70 75 80Flavor Life (minutes)
12 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to Sampling
Process Mean = 66.8
55 60 65 70 75 80Flavor Life (minutes)
We could get this one, but it’s less likely:
Sample Mean = 70.7
55 60 65 70 75 80Flavor Life (minutes)
13 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to SamplingOn the other hand, if the process produced gumballs with greaterlongevity in general...
Process Mean = 71.8
55 60 65 70 75 80Flavor Life (minutes)
then the first four sample means (65.7, 66.5, 66.5, 66.3) areunlikely, whereas the last one (70.7) is more likely.
Sample Mean = 70.7
55 60 65 70 75 80Flavor Life (minutes)
14 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to Sampling
• Each potential dataset (sample) is animperfect/incomplete snapshot of the underlyingpopulation/process/phenomenon
• Therefore, statistics are imperfect reflections of theunderlying parameters• However, if samples are representative, statistics areusually close to the corresponding parameter• So, we can estimate (with some, but not full certainty) that
the unknown underlying parameter is probably close to thecorresponding statistic
15 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to Sampling
• Each potential dataset (sample) is animperfect/incomplete snapshot of the underlyingpopulation/process/phenomenon• Therefore, statistics are imperfect reflections of the
underlying parameters
• However, if samples are representative, statistics areusually close to the corresponding parameter• So, we can estimate (with some, but not full certainty) that
the unknown underlying parameter is probably close to thecorresponding statistic
15 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to Sampling
• Each potential dataset (sample) is animperfect/incomplete snapshot of the underlyingpopulation/process/phenomenon• Therefore, statistics are imperfect reflections of the
underlying parameters• However, if samples are representative, statistics areusually close to the corresponding parameter
• So, we can estimate (with some, but not full certainty) thatthe unknown underlying parameter is probably close to thecorresponding statistic
15 / 20
Inference Goals Parameter Estimation Sampling Distributions
Variability due to Sampling
• Each potential dataset (sample) is animperfect/incomplete snapshot of the underlyingpopulation/process/phenomenon• Therefore, statistics are imperfect reflections of the
underlying parameters• However, if samples are representative, statistics areusually close to the corresponding parameter• So, we can estimate (with some, but not full certainty) that
the unknown underlying parameter is probably close to thecorresponding statistic
15 / 20
Inference Goals Parameter Estimation Sampling Distributions
Outline
Inference Goals
Parameter Estimation
Sampling Distributions
16 / 20
Inference Goals Parameter Estimation Sampling Distributions
Definition: Sampling Distribution
• Consider all possible datasets of a certain sample size, n,produced by taking a representative snapshot (sample) froma process/phenomenon/population.
• Each one has its own value for a particular statistic (like themean of a certain variable).• A sampling distribution is the collection of values of all of
these statistics (such as sample means)• Note that this is a hypothetical/theoretical construction; we
almost never actually have more than onedataset/sample/statistic
17 / 20
Inference Goals Parameter Estimation Sampling Distributions
Definition: Sampling Distribution
• Consider all possible datasets of a certain sample size, n,produced by taking a representative snapshot (sample) froma process/phenomenon/population.• Each one has its own value for a particular statistic (like the
mean of a certain variable).
• A sampling distribution is the collection of values of all ofthese statistics (such as sample means)• Note that this is a hypothetical/theoretical construction; we
almost never actually have more than onedataset/sample/statistic
17 / 20
Inference Goals Parameter Estimation Sampling Distributions
Definition: Sampling Distribution
• Consider all possible datasets of a certain sample size, n,produced by taking a representative snapshot (sample) froma process/phenomenon/population.• Each one has its own value for a particular statistic (like the
mean of a certain variable).• A sampling distribution is the collection of values of all of
these statistics (such as sample means)
• Note that this is a hypothetical/theoretical construction; wealmost never actually have more than onedataset/sample/statistic
17 / 20
Inference Goals Parameter Estimation Sampling Distributions
Definition: Sampling Distribution
• Consider all possible datasets of a certain sample size, n,produced by taking a representative snapshot (sample) froma process/phenomenon/population.• Each one has its own value for a particular statistic (like the
mean of a certain variable).• A sampling distribution is the collection of values of all of
these statistics (such as sample means)• Note that this is a hypothetical/theoretical construction; we
almost never actually have more than onedataset/sample/statistic
17 / 20
Inference Goals Parameter Estimation Sampling Distributions
Sample Distribution 6= Sampling Distribution
Sample Distribution 6= Sampling Distribution
Sample Distribution 6= Sampling Distribution
• The cases in a sample are individual observations• The cases in a sampling distribution are statistics (such as
means), each from a different potential dataset
18 / 20
Inference Goals Parameter Estimation Sampling Distributions
If the process produces a flavor-life distribution like this:
Process Mean = 66.8
55 60 65 70 75 80Flavor Life (minutes)
which could yield any of the following data setsSample Mean = 65.7
55 60 65 70 75 80Flavor Life (minutes)
Sample Mean = 66.5
55 60 65 70 75 80Flavor Life (minutes)
Sample Mean = 66.5
55 60 65 70 75 80Flavor Life (minutes)
then each potential set of 10 gumballs has a mean flavor life.The sampling distribution of all such potential means mightlook like this:
55 60 65 70 75 80Mean Flavor Life (minutes)
19 / 20
Inference Goals Parameter Estimation Sampling Distributions
If the process produces a flavor-life distribution like this:
Process Mean = 66.8
55 60 65 70 75 80Flavor Life (minutes)
which could yield any of the following data setsSample Mean = 65.7
55 60 65 70 75 80Flavor Life (minutes)
Sample Mean = 66.5
55 60 65 70 75 80Flavor Life (minutes)
Sample Mean = 66.5
55 60 65 70 75 80Flavor Life (minutes)
then each potential set of 10 gumballs has a mean flavor life.The sampling distribution of all such potential means mightlook like this:
55 60 65 70 75 80Mean Flavor Life (minutes)
19 / 20
Inference Goals Parameter Estimation Sampling Distributions
Demo: StatKey
http://lock5stat.com/statkey
20 / 20