Section 12.1 The Simple Regression Modellonghai/teaching/2019/stat245... · 2019. 12. 18. ·...
Transcript of Section 12.1 The Simple Regression Modellonghai/teaching/2019/stat245... · 2019. 12. 18. ·...
Section 12.1The Simple Regression Model
1/61
A Motivating ExampleA Motivating ExampleVisual and musculoskeletal problems associated with the use of videodisplay terminals (VDTs) have become rather common in recentyears. Some researchers have focused on vertical gaze direction as asource of eye strain and irritation. This direction is known to beclosely related to ocular surface area (OSA), so a method ofmeasuring OSA is needed. The accompanying representative data ony = OSA (cm2) and x = width of the palpebral fissure (i.e., thehorizontal width of the eye opening, in cm) is from the article“Analysis of Ocular Surface Area for Comfortable VDT WorkstationLayout” (Ergonomics, 1996: 877–884). The order in whichobservations were obtained was not given, so for convenience they arelisted in increasing order of x values.
2/61
Original Data
3/61
Scatterplot of Data
4/61
Another ExampleAnother ExampleForest growth and decline phenomena throughout the world haveattracted considerable public and scientific interest. The article“Relationships Among Crown Condition, Growth, and Stand Nutritionin Seven Northern Vermont Sugarbushes” (Canad. J. Forest Res.,1995: 386–397) included a scatter plot of y = mean crown dieback(%), one indicator of growth retardation, and x = soil pH (higher pHcorresponds to more acidic soil), from which the following observations were taken:
5/61
Scatterplot of Data
6/61
A Linear Probabilistic Model A Linear Probabilistic Model
, where
Or in another expression
, independently
Graphically
7/61
Understanding linear line and
1) The population regression line is the line of mean Y values given fixed .
2) The second sequence of equalities tells us that the amount ofvariability in the distribution of Y is the same at any particular x valueas it is at any other x value— this is the property of homogeneousvariation about the population regression line.
8/61
Understanding linear line and
9/61
Section 12.2 Estimating Model Parameters
10/61
Estimating Model Parameters Estimating Model Parameters
Intuition
11/61
PRINCIPLE OF LEAST SQUARES PRINCIPLE OF LEAST SQUARES
12/61
Obtaining Least Square EstimatorsObtaining Least Square Estimators
13/61
14/61
ExampleExampleGlobal warming is a major issue, and CO2 emissions are an importantpart of the discussion. What is the effect of increased CO2 levels on the environment? In particular, what is the effect of these higher levelson the growth of plants and trees? The article “Effects of AtmosphericCO2 Enrichment on Biomass Accumulation and Distribution in Eldarica Pine Trees” (J. Experiment. Botany, 1994: 345–349) describes the results of growing pine trees with increasing levels of CO2 in the air. There were two trees at each of four levels of CO2 concentration, and the mass of each tree was measured after 11 months of the experiment. Here are the observations with x = atmospheric concentration of CO2 (mL/L, or ppm) and y = tree mass
(kg), along with and . The mass measurements were read from a graph in the article.
15/61
Data and Estimates of Coefficients
16/61
Scatterplot with Estimated Regression Line
17/61
Estimating Estimating
Understanding
18/61
Fitted Values and ResidualsFitted Values and Residuals
19/61
SSE and Estimator for SSE and Estimator for
A Short-cut Formula for Computing SSE
Note: Using the above short-cut formula, the numbers of digits in
and must be much larger than the number of digits in . Otherwise, large round-off error would occur.
20/61
ExampleExampleThe article “Promising Quantitative Nondestructive EvaluationTechniques for Composite Materials” (Materials Eval., 1985: 561–565) reports on a study to investigate how the propagation of anultrasonic stress wave through a substance depends on the propertiesof the substance. The accompanying data on fracture strength (x, as apercentage of ultimate tensile strength) and attenuation (y, inneper/cm, the decrease in amplitude of the stress wave) in fiberglass-reinforced polyester composites was read from a graph that appearedin the article. The simple linear regression model is suggested by thesubstantial linear pattern in the scatter plot.
21/61
Estimating Regression Line and
22/61
The Coefficient of Determination The Coefficient of Determination
Different Strength of Linear Effects
23/61
SSR = SST - SSE
24/61
The Coefficient of Determination The Coefficient of Determination
25/61
ExampleExample
26/61
Reading Outputs of Statistical ProgramReading Outputs of Statistical ProgramAn example of MINITAB Results. R outputs similarly.
27/61
Section 12.3 Inferences About the RegressionCoefficient
28/61
Simulated Estimates of Simulated Estimates of Look at R simulated linear lines.
Another one
29/61
Expressing Expressing as a linear function of as a linear function of
30/61
Sampling Distribution of Sampling Distribution of
Proof: on blackboard.
31/61
Sampling Distribution of Sampling Distribution of TT
Proof:
32/61
A Confidence Interval for A Confidence Interval for
A upper or lower bound for is:
33/61
ExampleExampleIs it possible to predict graduation rates from freshman test scores?Based on the aver- age SAT score of entering freshmen at a university,can we predict the percentage of those freshmen who will get a degreethere within six years? We use a random sample of 20 universitiesfrom the 248 national universities listed in the 2005 edition ofAmerica’s Best Colleges, published by U.S. News & World Report.
34/61
Scatterplot
35/61
36/61
A Closer Look at the Dataset
37/61
Hypothesis-Testing Procedures Hypothesis-Testing Procedures
38/61
ExampleExampleIn the previous SAT score example, we want to test:
vs
39/61
Regression and ANOVARegression and ANOVATesting: vs
SSR = SST - SSE
When is true, .
40/61
Reading SAS Outputs of Regression AnalysisReading SAS Outputs of Regression AnalysisFor SAT score example:
41/61
Section 12.4: Inferences Concerning and thePrediction of Future Y Values
42/61
Sampling Distribution of Predicted Mean of Sampling Distribution of Predicted Mean of YY
43/61
Proof:
More on blackboard.
44/61
Inferences of Mean of Inferences of Mean of YY given given
45/61
ExampleExampleRefer to the SAT score example.
, .
Let’s now calculate a confidence interval, using a 95% confidence level, for the mean graduation rate for all universities having an average freshman SAT of 1200—that is, a confidence interval for
.
The interval is centered at
46/61
Results of of CI
47/61
A Prediction Interval for a Future Value of A Prediction Interval for a Future Value of Y Y
Mean and Variance
48/61
Sampling Distribution
:
The interpretation of the prediction level is that if the above PI is used repeatedly, in the long run the resulting intervals will actually contain the observed y values 100(1- α )% of the time.
49/61
ExampleExampleFor SAT example. Let's calculate a 95% prediction interval for agraduation rate that would result from selecting a single universitywhose average SAT is 1200. Relevant quantities from that exampleare
The t critical value is 2.101. The 95% prediction interval is:
50/61
Section 12.5 Correlation
51/61
Definition of sample correlation coefficient Definition of sample correlation coefficient
52/61
ExampleExampleAn accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article “ProductivityRatings Based on Soil Series” (Prof. Geographer, 1980: 158 –163)argues, an acceptable soil productivity index is not so easy to comeby. One difficulty is that productivity is determined partly by whichcrop is planted, and the relationship between yield of two differentcrops planted in the same soil may not be very strong. To illustrate,the article presents the accompanying data on corn yield x and peanutyield y (mT/ha) for eight different types of soil.
53/61
Calculating r
54/61
Properties of rProperties of r
55/61
Examples of Different Examples of Different rr
56/61
Rules of Thumb To State Strength of Linear RelationshipsRules of Thumb To State Strength of Linear Relationships
A frequently asked question is, “When can it be said that there is astrong correlation between the variables, and when is the correlationweak?” A reasonable rule of thumb is to say that the correlation is
• weak if 0 < r < 0.5,
• strong if .8 < r< 1, and
• moderate otherwise.
It may surprise you that r = 0.5 is considered weak, but r2 = .25implies that in a regression of y o n x, only 25% of observed yvariation would be explained by the model.
57/61
Correlation CoefficientCorrelation Coefficient
r is an estimate (observation) of the population parameter . The
random variable R is a function of both X iand Y i .
58/61
Sampling Distribution of Sampling Distribution of RRAssuming that ( ) has a bivariate normal distribution.
Proof: The T defined here is equivalent to the T defined in Slide 32,where we have shown that T|X has distribution for each X . Itfollows that the marginal distribution of T has distribution too.
59/61
ExampleExampleNeurotoxic effects of manganese are well known and are usuallycaused by high occupational exposure over long periods of time. Inthe fields of occupational hygiene and environmental hygiene, therelationship between lipid peroxidation, which is responsible fordeterioration of foods and damage to live tissue, and occupationalexposure had not been previously reported. The article “LipidPeroxidation in Workers Exposed to Manganese” (Scand. J. WorkEnviron. Health, 1996: 381–386) gave data on x manganeseconcentration in blood (ppb) and y concentration (μ mol/L) ofmalondialdehyde, which is a stable product of lipid peroxidation, bothfor a sample of 22 workers exposed to manganese and for a controlsample of 45 individuals. The value of r =0.29, from which
The p-value for two-tailed test = 0.052.
60/61
Further Courses for Regression Analysis:
STAT 344: Applied Regression Analysis
Talking about regression on multiple inputs, checking models, etc.
STAT 443: Linear Models
Talking about the sampling distributions in rigorous manners.
61/61