Stats for Engineers Lecture 7
description
Transcript of Stats for Engineers Lecture 7
![Page 1: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/1.jpg)
Stats for Engineers Lecture 7
![Page 2: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/2.jpg)
Confidence intervals
During component manufacture, a random sample of 500 are weighed each day, and each day a 95% confidence interval is calculated for the mean weight of the components. On the first day we obtain a confidence interval for the mean weight of Kg. Which of the following is correct on average if the (unknown) mean weight and (known) standard deviation remain constant on different days?
1 2 3 4
74%
13%13%
0%
1. On 95% of days, the mean weight is in the range Kg
2. 95% of the daily sample means lie in the range Kg
3. On 95% of days the calculated confidence interval contains
4. 95% of the components have weights in the range Kg
![Page 3: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/3.jpg)
A 95% confidence interval for if we measure a sample mean and already know is
Recap: Confidence Intervals for the mean
Random sample from , where is known [or large data sample so can estimate it accurately, ] but is unknown.
We want a confidence interval for .
Reminder:
With probability 0.95, a Normal random variables lies within 1.96 standard deviations of the mean.
95% of the time expect in
P=0.025P=0.025
𝑋=𝜇 ±1.96 √ 𝜎2
𝑛 ⇒𝜇=𝑋 ±√ 𝜎2
𝑛
![Page 4: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/4.jpg)
Confidence interval interpretation
During component manufacture, a random sample of 500 are weighed each day, and each day a 95% confidence interval is calculated for the mean weight of the components. On the first day we obtain a confidence interval for the mean weight of Kg. Which of the following are correct on average if the (unknown) mean weight and (known) standard deviation remain constant on different days?
, so 95% of the time the sample mean lies in
Every day we get a different sample, so different .
Day 1: Day 2: Day 3: Day 4: …
e.g. Confidence interval
…
95% of the time, is in 95% of the time, is in
here
![Page 5: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/5.jpg)
On 95% of days, the mean weight is in the range Kg
95% of the daily sample means lie in the range Kg
95% of the components have weights in the range Kg
Other answers?
NO – the confidence interval we calculated is for the mean weight, not individual weights(and so the mid-point is incorrect)
NO – the correct statement is that 95% of the time lies in Statement would only be true if on the first day we got , which has negligible probability
NO - the mean weight is assumed to be a constant. It is either in the range or it isn’t – if true for one day it will be true for all days
![Page 6: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/6.jpg)
A confidence interval for if we measure a sample mean and already know is where
In general can use any confidence level, not just 95%.
95% confidence level has 5% in the tails, i.e. p=0.05 in the tails.
In general to have probability in the tails; for two tail, in each tail:
p/2p/2
Q
E.g. for a 99% confidence interval, we would want .
![Page 7: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/7.jpg)
Two tail versus one tail Does the distribution have two small tails or one? Or are we only interested in upper or lower limits?
If the distribution is one sided, or we want upper or lower limits, a one tail interval may be more appropriate.
P=0.05
95% One tail
P=0.025P=0.025
95% Two tail
![Page 8: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/8.jpg)
Example
You are responsible for calculating average extra-urban fuel efficiency figures for new cars. You test a sample of 100 cars, and find a sample mean of The standard deviation is . What is the 95% confidence interval for the average fuel efficiency?
Answer:
Sample size if and 95% confidence interval is .
⇒55.4 ±1.96√ 1.22100=55.4 ±0.235
i.e. mean in 55.165 to 55.63 mpg at 95% confidence
![Page 9: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/9.jpg)
1 2 3 4
0% 0%0%0%
Confidence interval
Given the confidence interval just constructed, it is correct to say that approximately 95% of new cars will have efficiencies between 55.165 and 55.63 mpg?
Question from Derek Bruff
1. YES – high confidence2. YES – low confidence3. NO – high confidence4. NO – low confidence
NO: mpg given in the question is the standard deviation of the individual car efficiencies (i.e. expect new cars in a range . The confidence interval we calculated is the range we expect the mean efficiency to lie in (much smaller range).
Countdown
10
![Page 10: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/10.jpg)
Example: Polling
A sample of 1000 random voters were polled, with 350 saying they will vote for the Conservatives and 650 saying another party. What is the 95% confidence interval for the Conservative share of the vote?
Answer: this is Binomial data, but large so can approximate as Normal
𝜎 2=𝑛𝑝 (1−𝑝 )≈1000×0.35× (1−0.35 )=227.5
Random variable is the number voting Conservative,
Take variance from the Binomial result with
⇒𝜎=√227.5≈15.195% confidence interval for the total votes is
95% confidence interval for the fraction of the votes is
350±29.61000 ≈0.35±0.03 i.e. 3% confidence interval
![Page 11: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/11.jpg)
Example – variance unknown
A large number of steel plates will be used to build a ship. A sample of ten are tested and found to have sample mean and sample variance What is the 95% confidence interval for the mean weight ?
Reminder:Sample Variance:
![Page 12: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/12.jpg)
Normal data, variance unknown Random sample from , where a are both unknown.
Want a confidence interval for , using observed sample mean and variance.
When we know the variance: use which is normally distributed
But don’t know , so have to use sample estimate
When we don’t know the variance: use which has a t-distribution (with d.o.f)
Sometimes more fully as “Student’s t-distribution”
Wikipedia
Remember:
![Page 13: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/13.jpg)
𝜈=𝑛−1=1 𝜈=𝑛−1=5 𝜈=𝑛−1=50
Normal
t-distribution
For large the t-distribution tends to the Normal - in general broader tails
![Page 14: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/14.jpg)
Confidence Intervals for the mean
If is known, confidence interval for is to , where is obtained from Normal tables (z=1.96 for two-tailed 95% confidence limit).
If is unknown, we need to make two changes:
(i) Estimate by , the sample variance;
(ii) replace z by , the value obtained from t-tables,
The confidence interval for if we measure a sample mean and sample variance is: to .
![Page 15: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/15.jpg)
t-tables
give for different values Q of the cumulative Student's t-distributions, and for different values of
Q
𝑄 (𝑡𝜈)=∫−∞
𝑡𝜈
𝑓 𝜈 (𝑡 )𝑑𝑡
The parameter is called the number of degrees of freedom.
(when the mean and variance are unknown, there are degrees of freedom to estimate the variance)
![Page 16: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/16.jpg)
Q
For a 95% confidence interval, we want the middle 95% region, so Q = 0.975 (0.05/2=0.025 in both tails).
Similarly, for a 99% confidence interval, we would want Q = 0.995.
![Page 17: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/17.jpg)
t-distribution example:
A large number of steel plates will be used to build a ship. A sample of ten are tested and found to have sample mean and sample variance What is the 95% confidence interval for the mean weight ?
From t-tables, for Q = 0.975
Answer:
95% confidence interval for is:
i.e. 1.95 to 2.31
= 2.2622.
⇒𝜇=2.13±2.2622√ 0.25210=(2.13 ±0.18 ) kg
![Page 18: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/18.jpg)
1 2 3 4
5%
14%
33%
48%
Confidence interval width
We constructed a 95% confidence interval for the mean using a random sample of size n = 10 with sample mean . Which of the following conditions would NOT probably lead to a narrower confidence interval?
Question adapted from Derek Bruff
1. If you decreased your confidence level
2. If you increased your sample size 3. If the sample mean was smaller4. If the population standard
deviation was smaller
![Page 19: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/19.jpg)
95% confidence interval for is:; width is
Confidence interval width
We constructed a 95% confidence interval for the mean using a random sample of size n = 10 with sample mean . Which of the following conditions would NOT probably lead to a narrower confidence interval?
Decrease your confidence level?
larger tail smaller smaller confidence interval
Increase your sample size?
Smaller sample mean?
Smaller population standard deviation?
larger smaller confidence interval ( and both likely to be smaller)
smaller just changes mid-point, not width
likely to be smaller smaller confidence interval
![Page 20: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/20.jpg)
Sample size
How many random samples do you need to reach desired level of precision?
For example, for Normal data, confidence interval for is .
Suppose we want to estimate to within , where (and the degree of confidence) is given.
⇒𝑛=𝑡𝑛− 12 𝑠2
𝛿2
Want 𝛿=𝑡𝑛−1√ 𝑠2𝑛Need: - Estimate of (e.g. previous experiments)
- Estimate of . This depends on n, but not very strongly.
e.g. take for 95% confidence.
Rule of thumb: for 95% confidence, choose
![Page 21: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/21.jpg)
Example
A large number of steel plates will be used to build a ship. Ten are tested and found to have sample mean weight and sample variance How many need to be tested to determine the mean weight with 95% confidence to within ?
Answer:
⇒𝑛=𝑡𝑛− 12 𝑠2
𝛿2=2.1
20.252
0.12=27.6
Want 𝛿=0.1 kg=𝑡𝑛− 1√ 𝑠2𝑛Take for 95% confidence.
i.e. need to test about 28
![Page 22: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/22.jpg)
1. 2 3 4
17% 17%
28%
39%
Number of samples
If you need 28 samples for the confidence interval to be approximately how many samples would you need to get a more accurate answer with confidence interval
1. 88.52. 2803. 28004. 28000
𝛿=𝑡𝑛−1√ 𝑠2𝑛 so need more. i.e. 2800
![Page 23: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/23.jpg)
Linear regression
Linear regression: fitting a straight line to the mean value of as a function of
403020100
250
200
150
100
x
y
We measure a response variable at various values of a controlled variable
e.g. measure fuel efficiency at various values of an experimentally controlled external temperature
![Page 24: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/24.jpg)
𝑥
𝑦
𝑥1 𝑥2 𝑥3
Distribution of when
Regression curve: fits the mean values of the distributions
𝑦=𝑎𝑥+𝑏
![Page 25: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/25.jpg)
403020100
250
200
150
100
x
y
From a sample of values at various , we want to fit the regression curve.
e.g.
![Page 26: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/26.jpg)
403020100
250
200
150
100
x
y
Or is it
What do we mean by a line being a ‘good fit’?
![Page 27: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/27.jpg)
Straight line plots
Which graph is of the line ?
1 2 3 4
56%
0%
33%
11%
1. Plot2. Plot23. Plot34. Plot4
1. 2.
3. 4.
𝑥
𝑦
![Page 28: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/28.jpg)
Simple model for data:
𝑦 𝑖=𝑎+𝑏𝑥 𝑖+𝑒𝑖
Equation of straight line is
Random errorStraight line
- Linear regression model
Simplest assumption: for all , and 's are independent
![Page 29: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/29.jpg)
Maximum likelihood estimate = least -squares estimate
Minimize 𝐸=∑𝑖𝑒𝑖2=¿∑
𝑖(𝑦 𝑖− �̂� 𝑖)=∑
𝑖( 𝑦 𝑖−𝑎−𝑏𝑥 𝑖 )
2¿
Data point Straight-line prediction
Model is
E is defined and can be minimized even when errors not Normal – least-squares is simple general prescription for fitting a straight line
(but statistical interpretation in general less clear)
Want to estimate parameters a and b, using the data.
e.g. - choose and to minimize the errors
![Page 30: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/30.jpg)
The line has been proposed as a line of best fit for the following four sets of data. For which data set is this line the best fit (minimum )?
1 2 3 4
57%
0%
24%19%
Question from Derek Bruff
1. Pic12. Pic23. Pic correct4. pic4
1. 2.
3. 4.
![Page 31: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/31.jpg)
How to find and that minimize ?
For minimum want and , see notes for derivation
Solution is the least-squares estimates and :
�̂�=𝑆𝑥𝑦
𝑆𝑥𝑥∧�̂�=𝑦− �̂�𝑥
𝑆𝑥𝑥=∑𝑖𝑥 𝑖2−
(∑𝑖 𝑥𝑖)2
𝑛 =∑𝑖
(𝑥𝑖−𝑥 )2
𝑆𝑥𝑦=∑𝑖𝑥 𝑖 𝑦 𝑖−
∑𝑖𝑥 𝑖∑
𝑖𝑦 𝑖
𝑛 =∑𝑖
(𝑥𝑖−𝑥) ( 𝑦 𝑖− 𝑦 )❑
Where
Sample means
�̂�=�̂�+�̂� 𝑥Equation of the fitted line is
![Page 32: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/32.jpg)
Note that since
⇒ �̂�−𝑦=�̂� (𝑥−𝑥)
i.e. is on the line
403020100
250
200
150
100
x
y
𝑥
𝑦
![Page 33: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/33.jpg)
y 240 181 193 155 172 110 113 75 94x 1.6 9.4 15.5 20.0 22.0 35.5 43.0 40.5 33.0
Example: The data y has been observed for various values of x, as follows:
Fit the simple linear regression model using least squares.
Answer:
n = 9 ,
�̂�=𝑆𝑥𝑦
𝑆𝑥𝑥∧�̂�=𝑦− �̂�𝑥
𝑆𝑥𝑥=∑𝑖𝑥 𝑖2−
(∑𝑖 𝑥𝑖)2
𝑛
𝑆𝑥𝑦=∑𝑖𝑥 𝑖 𝑦 𝑖−
∑𝑖𝑥 𝑖∑
𝑖𝑦 𝑖
𝑛
Want to fit
,
𝑆𝑥𝑦=26864−220.50×1333.0
9
𝑆𝑥𝑥=7053.7−220.52
9¿1651.42
¿−5794.1
⇒�̂�=𝑆𝑥𝑦
𝑆𝑥𝑥=− 5794.5
1651.45¿−3.5086
![Page 34: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/34.jpg)
Answer:
n = 9 ,
�̂�=𝑆𝑥𝑦
𝑆𝑥𝑥∧�̂�=𝑦− �̂�𝑥
𝑆𝑥𝑥=∑𝑖𝑥 𝑖2−
(∑𝑖 𝑥𝑖)2
𝑛
𝑆𝑥𝑦=∑𝑖𝑥 𝑖 𝑦 𝑖−
∑𝑖𝑥 𝑖∑
𝑖𝑦 𝑖
𝑛
Want to fit
,
𝑆𝑥𝑦=26864−220.50×1333.0
9
𝑆𝑥𝑥=7053.7−220.52
9¿1651.42
¿−5794.1
⇒�̂�=𝑆𝑥𝑦
𝑆𝑥𝑥=− 5794.5
1651.45¿−3.5086
Now just need
�̂�=𝑦−�̂� 𝑥¿ 1333.0
9−(−3.5086)× (220.50 )
9=234.1
So the fit is approximately
![Page 35: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/35.jpg)
Which of the following data are likely to be most appropriately modelled using a linear regression model?
1 2 3
55%
25%20%
1. Correct2. Errors change3. Not straight
1. 2.
3.
![Page 36: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/36.jpg)
Estimating : variance of y about the fitted line Estimated error is:
, so the ordinary sample variance of the 's is
In fact, this is biased since two parameters, a and b have been estimated. The unbiased estimate is:
¿𝑆𝑦𝑦 −�̂�𝑆𝑥𝑦
𝑛−2[derivation in notes]
Quantifying the goodness of the fit
Residual sum of squares
![Page 37: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/37.jpg)
Which of the following plots would have the greatest residual sum of squares [variance of about the fitted line]?
1 2 3
11%
72%
17%
Question from Derek Bruff
1. Pic12. Pic23. Pic correct
1. 2. 3.
![Page 38: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/38.jpg)
Confidence interval for the slope, b
Reminder: Normal data with unknown variance, confidence interval for is:
X t snn 1
2
X t snn 1
2
to
is the estimate of , the variance of
It can be shown that , estimated by ( degrees of freedom).
b t
Snxx
2
2 b t
Snxx
2
2to
Confidence interval for b is
E.g. if you want to see if is significantly non-zero
![Page 39: Stats for Engineers Lecture 7](https://reader036.fdocuments.net/reader036/viewer/2022062301/56816172550346895dd0fcee/html5/thumbnails/39.jpg)
Predictions
403020100
250
200
150
100
x
y
For given of interest, what is mean ?
Predicted mean value: .
It can be shown that
Confidence interval for mean y at given x
Extrapolation: Often not reliable
What is the error bar?