More Powerpoint Slides for Least Squares Lines and Inference
Transcript of More Powerpoint Slides for Least Squares Lines and Inference
Least Squares Regression
Fitting a Line to Bivariate Data
Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables.
In addition, we would like to have a numerical description of how both variables vary together. For instance, is one variable increasing faster than the other one? And we would like to make predictions based on that numerical description.
But which line best describes our data?
Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added (Pythagoras).
The regression lineThe least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible.
Properties
is the predicted y value (y hat) b1 is the slopeb0 is the y-intercept
ˆ y
“b0" is in units of y"b1" is in units of y / units of x
The least-squares regression line can be shown to have this equation:
0 1y b b x 1
y
x
sb r
swhere
0 1b y b x
1y
x
sb r
s
First we calculate the slope of the line, b; from statistics we already know:
r is the correlation.sy is the standard deviation of the response variable y.sx is the the standard deviation of the explanatory variable x.
Once we know b1, the slope, we can calculate b0, the y-intercept:
0 1b y b x where x and y are the sample means of the x and y variables
How to:
This means that we don't have to calculate a lot of squared distances to find the least-squares regression line for a data set. We can instead rely on the equation.
But typically, we use a 2-var stats calculator or stats software.
BEWARE!!!Not all calculators and software use the same convention:
Some use instead:
bxay ˆ
ˆ y ax b
Make sure you know what YOUR calculator gives you for a and b before you answer homework or exam questions.
Software output
interceptslope
R2
absolute value of rR2
interceptslope
The equation completely describes the regression line.
NOTE: The regression line always passes through the point with coordinates (xbar, ybar).
The distinction between explanatory and response variables is crucial in regression. If you exchange y for x in calculating the regression line, you will get the wrong line. Recall that
Regression examines the distance of all points from the line in the y direction only.
Hubble telescope data about galaxies moving away from earth:
These two lines are the two regression lines calculated either correctly (x = distance, y = velocity, solid line) or incorrectly (x = velocity, y = distance, dotted line).
1y
x
sb r
s
In regression we examine the variation in the response variable (y) given change in the explanatory variable (x).
The correlation is a measure of spread (scatter) in both the x and y directions in the linear relationship.
Correlation versus regression
Making predictions: interpolationThe equation of the least-squares regression allows to predict y for
any x within the range studied. This is called interpolating.
ˆ y 0.0144x 0.0008 Nobody in the study drank 6.5 beers, but by finding the value of from the regression line for x = 6.5 we would expect a blood alcohol content of 0.094 mg/ml.
mg/ml 0944.00008.0936.0ˆ0008.05.6*0144.0ˆ
yy
y
Year Powerboats Dead Manatees1977 447 131978 460 211979 481 241980 498 161981 513 241982 512 201983 526 151984 559 341985 585 331986 614 331987 645 391988 675 431989 711 501990 719 47
There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths.
(in 1000’s)
The least squares regression line has the equation:
1.214.415.62ˆ 4.41)500(125.0ˆ yy
Roughly 21 manatees.
Thus if we were to limit the number of powerboat registrations to 500,000, what could we expect for the number of manatee deaths?
ˆ y 0.125x 41.4
ˆ y 0.125x 41.4
Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line.
This can be a very stupid thing to do, as seen here.
Hei
ght i
n In
ches
Hei
ght i
n In
ches !!!
!!!
Extrapolation
Sometimes the y-intercept is not a realistic possibility. Here we have
negative blood alcohol content, which makes no sense…
y-intercept shows negative blood alcoholBut the negative value is
appropriate for the equation
of the regression line.
There is a lot of scatter in the
data, and the line is just an
estimate.
The y intercept
R-squared = r2; the proportion of y-variation explained by changes in x.
r2 represents the proportion of
the variation in y (vertical scatter
from the regression line) that can
be explained by changes in x. 1
y
x
sb r
s
r2, the coefficient of determination, is the square of the correlation coefficient.
r = -1r2 = 1
Changes in x explain 100% of the variations in y.
Y can be entirely predicted for any given value of x.
r = 0r2 = 0
Changes in x explain 0% of the variations in y.
The value(s) y takes is (are) entirely independent of what value x takes.
Here the change in x only explains 76% of the change in y. The rest of the change in y (the vertical scatter, shown as red arrows) must be explained by something other than x.
r = 0.87r2 = 0.76
Example: SAT scores
SAT Mean per State vs % Seniors Taking Test
y = -2.2375x + 1023.4R2 = 0.7542
820
870
920
970
1020
1070
1120
0 10 20 30 40 50 60 70 80
% of Seniors Taking Test
Mea
n SA
T Sc
ore
SAT scores: calculations
33.882 24.103 947.549 62.1 .868
,
62.1slope .868 2.2363524.103
intercept 947.549 ( 2.236)33.882 1023.309ˆleast squares prediction line 1023.309 2.236
x y
y
x
x s y s r
sb r a y bx
s
b
ay x
SAT scores: result
SAT Mean per State vs % Seniors Taking Test
y = -2.2375x + 1023.4R2 = 0.7542
820
870
920
970
1020
1070
1120
0 10 20 30 40 50 60 70 80
% of Seniors Taking Test
Mea
n SA
T Sc
ore
r2 = (-.868)2 = .7534
About 75% of the variation in state mean SAT scores is explained by differences in the % of seniors that take the test.
If 57% of NC seniors take the SAT, the predicted mean score is
ˆ 1023.309 2.23635(57) 895.84y
There is a great deal of variation in BAC for the same number of beers drunk. A person’s blood volume is a factor in the equation that was overlooked here.
In the first plot, number of beers only explains 49% of the variation in blood alcohol content.
But number of beers / weight explains 81% of the variation in blood alcohol content.
Additional factors contribute to variations in BAC among individuals (like maybe some genetic ability to process alcohol).
We changed number of beers to number of beers/weight of person in lb.
r =0.7r2 =0.49
r =0.9r2 =0.81
Grade performance
If class attendance explains 16% of the variation in grades, what is the correlation between percent of classes attended and grade?
1. We need to make an assumption: attendance and grades are positively correlated. So r will be positive too.
2. r2 = 0.16, so r = +√0.16 = + 0.4
A weak correlation.
Transforming relationshipsA scatterplot might show a clear relationship between two quantitative variables, but issues of influential points or non linearity prevent us from using correlation and regression tools.
Transforming the data – changing the scale in which one or both of the variables are expressed – can make the shape of the relationship linear in some cases.
Example: Patterns of growth are often exponential, at least in their initial phase. Changing the response variable y into log(y) or ln(y) will transform the pattern from an upward-curved exponential to a straight line.
0
1000
2000
3000
4000
5000
0 30 60 90 120 150 180 210 240
Time (min)
Bac
teria
l cou
nt
0
1
2
3
4
0 30 60 90 120 150 180 210 240
Time (min)Lo
g of
bac
teria
l cou
nt
Exponential bacterial growth
log(2n) = n*log(2) ≈ 0.3n
Taking the log changes the growth pattern into a straight line.
In ideal environments, bacteria multiply through binary fission. The number of bacteria can double every 20 minutes in that way.
1 - 2 - 4 - 8 - 16 - 32 - 64 - …
Exponential growth 2n,not suitable for regression.
r = 0.86, but this is misleading.
The elephant is an influential point. Most mammals are very small in comparison.
Without this point, r = 0.50 only.
Body weight and brain weight in 96 mammal species
Now we plot the log of brain weight against the log of body weight.
The pattern is linear, with r = 0.96. The vertical scatter is homogenous → good for predictions of brain weight from body weight (in the log scale).
Inference for least squares linesInference for simple linear regression
Simple linear regression model
Conditions for inference
Confidence interval for regression parameters
Significance test for the slope
Confidence interval for E(y) for a given x
Prediction interval for y for a given x
The data in a scatterplot are a random
sample from a population that may
exhibit a linear relationship between x
and y. Different sample different plot.
ˆ y 0.125x 41.4
Now we want to describe the population mean response E(y) as a function of the explanatory
variable x: E(y)= + x.
And to assess whether the observed relationship is statistically significant (not entirely explained by chance events due to random sampling).
Simple linear regression modelIn the population, the linear regression equation is E(y) = 0 + 1x.
Sample data then fits the model:
Data = fit + residual yi = (0 + 1xi) + (i)
where the i are
independent and Normally distributed N(0,).
Linear regression assumes equal standard deviation of y ( is the same for all values of x).
E(y) = + x
The intercept , the slope , and the standard deviation of y are the
unknown parameters of the regression model We rely on the random
sample data to provide unbiased estimates of these parameters.
The value of ŷ from the least-squares regression line is really a prediction of
the mean value of y (E(y)) for a given value of x.
The least-squares regression line (ŷ = b0 + b1x) obtained from sample data
is the best estimate of the true population regression line (E(y) = + x).
ŷ unbiased estimate for mean response E(y)
b0 unbiased estimate for intercept 0
b1 unbiased estimate for slope
The regression standard error, s, for n sample data points is
calculated from the residuals (yi – ŷi):
se is an unbiased estimate of the regression standard deviation
The population standard deviation for y at any given value of x represents the spread of the normal distribution of
the i around the mean E(y) .
2 2ˆ( )
2 2i i
e
residual y ys
n n
Conditions for inference The observations are independent.
The relationship is indeed linear.
The standard deviation of y, σ, is the same for all values of x.
The response y varies normally around its mean.
Using residual plots to check for regression validityThe residuals (y − ŷ) give useful information about the contribution of
individual data points to the overall pattern of scatter.
We view the residuals in
a residual plot:
If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fit a linear model, have normally distributed
residuals for each value of x, and constant standard deviation σ.
Residuals are randomly scattered good!
Curved pattern the relationship is not linear.
Change in variability across plot σ not equal for all values of x.
What is the relationship between the average speed a car is driven and its fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars.
The relationship is curved.
When speed is log transformed
(log of miles per hour, LOGMPH)
the new scatterplot shows a
positive, linear relationship.
Residual plot:
The spread of the residuals is
reasonably random—no clear pattern.
The relationship is indeed linear.
But we see one low residual (3.8, −4)
and one potentially influential point
(2.5, 0.5).
Normal quantile plot for residuals:
The plot is fairly straight, supporting
the assumption of normally distributed
residuals.
Data okay for inference.
Slide 1- 35
Standard Error for the Slope Three aspects of the scatterplot affect the standard error
of the regression slope: spread around the line, se
spread of x values, sx
sample size, n.
The formula for the standard error (which you will probably never have to calculate by hand) is:
1 1e
x
sSE bn s
Confidence interval for 1
Estimating the regression parameters 0, 1 is a case of one-sample
inference with unknown population standard deviation.
We rely on the t distribution, with n – 2 degrees of freedom.
A level C confidence interval for the slope, 1, is proportional to the
standard error of the least-squares slope:
b1 ± t* SE(b1)t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.
We estimate the standard error of b1 with
where
n is the sample size, sx is the ordinary standard deviation of the x values
1 1e
x
sSE bn s
2ˆ
2e
y ys
n
Confidence interval for 0
A level C confidence interval for the intercept, 0 , is proportional to
the standard error of the least-squares intercept:
b0 ± t* SEb0
The intercept usually isn’t interesting. Most hypothesis tests and confidence intervals for regression are about the slope.
Hypothesis test for the slopeWe may look for evidence of a significant relationship between
variables x and y in the population from which our data were drawn.
For that, we can test the hypothesis that the regression slope
parameter β1 is equal to zero.
H0: β1 = 0 vs. Ha: β1 ≠ 0
Testing H0: β1 = 0 also allows to test the hypothesis of no
correlation between x and y in the population.
Note: A test of hypothesis for 0 is irrelevant (0 is often not even achievable).
1slope y
x
sb r
s
Hypothesis test for the slope (cont.)We usually test the hypothesis H0: β1 = 0 vs. Ha: β1 ≠ 0 but we can also
test H0: β1 = 0 vs. Ha: β1 < 0 or H0: β1 = 0 vs. Ha: β1 > 0
To do this we calculate the test statistic
Use the t dist. with n – 2
df to find the
P-value of the test.
Note: Software typically provides
two-sided p-values.
1
1
0btSE b
SPSS
Using technologyComputer software runs all the computations for regression analysis.
Here is some software output for the car speed/gas efficiency example.
SlopeIntercept
p-value for tests of significance
Confidence intervals
The t-test for regression slope is highly significant (p < 0.001). There is a
significant relationship between average car speed and gas efficiency.
Excel
SAS
“intercept”: intercept“logmph”: slope
P-value for tests of significance
confidence intervals
Confidence Intervals and Prediction Intervals for
Predicted Values Once we have a useful regression, how can we indulge our natural desire to predict, without being irresponsible?
Now we have standard errors—we can use those to construct a confidence interval for the predictions and to report our uncertainty honestly.
Slide 1- 43
An Example: Body Fat and Waist Size Consider an example that involves investigating the
relationship in adult males between % Body Fat and Waist size (in inches). Here is a scatterplot of the data for 250 adult males of various ages:
Confidence Intervals and Prediction Intervals for Predicted
Values(cont.) For our %body fat and waist size example, there are two questions we could ask:1. Do we want to know the mean %body fat for all men
with a waist size of, say, 38 inches?2. Do we want to estimate the %body fat for a particular
man with a 38-inch waist?
The predicted %body fat is the same in both questions, but we can predict the mean %body fat for all men whose waist size is 38 inches with a lot more precision than we can predict the %body fat of a particular individual whose waist size happens to be 38 inches.
Confidence Intervals and Prediction Intervals for Predicted
Values(cont.) We start with the same prediction in both cases. We are predicting for a new individual, one
that was not in the original data set. Call his x-value xν. The regression predicts %body fat as
0 1y b b x
Confidence Intervals and Prediction Intervals for Predicted
Values(cont.) Both intervals take the form
The SE’s will be different for the two questions we have posed.
2ˆ ny t SE
Confidence Intervals and Prediction Intervals for Predicted
Values(cont.)1. The standard error of the mean predicted value is:
2. Individuals vary more than means, so the standard error for a single predicted value is larger than the standard error for the mean:
2
221ˆ esSE SE b x x
n
2
22 21ˆ e
es
SE y SE b x x sn
Confidence Intervals and Prediction Intervals for Predicted
Values (cont.)Confidence interval for Prediction interval for y
2
2* 22 1ˆ e
n
sy t SE b x x
n 2
2* 2 2
2 1ˆ e
n e
sy t SE b x x s
n
Confidence Intervals for Predicted Values Here’s a look at the difference
between predicting for a mean and predicting for an individual.
The solid green lines near the regression line show the 95% confidence intervals for the mean predicted value, and the dashed red lines show the prediction intervals for individuals.
The solid green lines and the dashed red lines curve away from the least squares line as x moves farther away from xbar.
More on confidence intervals for
As seen on the preceding slides, we can calculate a confidence
interval for the population mean of all responses y when x takes the
value x (within the range of data tested): denote this expected value
E(y) by
This interval is centered on ŷ, the unbiased estimate of .
The true value of the population mean at a particular
value x, will indeed be within our confidence
interval in C% of all intervals calculated
from many different random samples.
The level C confidence interval for the mean response μ at a given
value x of x is centered on ŷ (unbiased estimate of μ):
A separate confidence interval is
calculated for μ along all the values
that x takes.
Graphically, the series of confidence
intervals is shown as a continuous
interval on either side of ŷ.
t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.
95% confidence bands for as
x varies over all x values
*2ˆ ˆ( )ny t SE
More on prediction intervals for yOne use of regression is for predicting the value of y, ŷ, for any value
of x within the range of data tested: ŷ = b0 + b1x.
But the regression equation depends on the particular sample drawn.
More reliable predictions require statistical inference:
To estimate an individual response y for a given value of x, we use a
prediction interval. If we randomly sampled many times, there
would be many different values of y
obtained for a particular x following
N(0, σ) around the mean response μ.
The level C prediction interval for a single observation on y when
x takes the value x is:
The prediction interval represents
mainly the error from the normal
distribution of the residuals i.
Graphically, the series confidence
intervals is shown as a continuous
interval on either side of ŷ.
95% prediction interval for ŷ as
x varies over all x values
t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.
*2ˆ ˆ( )ny t SE y
The confidence interval for μ contains with C% confidence the
population mean μ of all responses at a particular value x.
The prediction interval contains C% of all the individual values
taken by y at a particular value x.
95% prediction interval for y
95% confidence interval for
Estimating uses a smaller
confidence interval than estimating
an individual in the population
(sampling distribution narrower than
population
distribution).
1918 influenza epidemicDate # Cases # Deathsweek 1 36 0week 2 531 0week 3 4233 130week 4 8682 552week 5 7164 738week 6 2229 414week 7 600 198week 8 164 90week 9 57 56week 10 722 50week 11 1517 71week 12 1828 137week 13 1539 178week 14 2416 194week 15 3148 290week 16 3465 310week 17 1440 149
0100020003000400050006000700080009000
10000
0100200300400500600700800
1918 influenza epidemic
0100020003000400050006000700080009000
10000
# ca
ses
diag
nose
d
0100200300400500600700800
# de
aths
repo
rted
# Cases # Deaths
1918 flu epidemics
The line graph suggests that 7 to 9% of those diagnosed with the flu died within about a week of
diagnosis.
We look at the relationship between the number of deaths in a given week and the number of new
diagnosed cases one week earlier.
EXCEL Regression Statistics Multiple R 0.911 R Square 0.830 Adjusted R Square 0.82 Standard Error 85.07 Observations 16.00
Coefficients St. Error t Stat P-value Lower 95% Upper 95% Intercept 49.292 29.845 1.652 0.1209 (14.720) 113.304 FluCases0 0.072 0.009 8.263 0.0000 0.053 0.091
1918 flu epidemic: Relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier.
1( )SE b
s
r = 0.91
P-value for H0: β1 = 0
b1
P-value very small reject H0 β1 significantly different from 0
There is a significant relationship between the number of flu
cases and the number of deaths from flu a week later.
SPSS
Least squares regression line95% prediction interval for y
95% confidence interval for y
CI for mean weekly death
count one week after x=
4000 flu cases are
diagnosed: µ within about
300–380.
Prediction interval for a
weekly death count one
week after x= 4000 flu
cases are diagnosed: y
within about 180–500
deaths.
What is this?
A 90% prediction interval
for the height (above) and
a 90% prediction interval for
the weight (below) of male
children, ages 3 to 18.