Section 12.1 The Simple Regression Modellonghai/teaching/2019/stat245... · 2019. 12. 18. ·...

Section 12.1The Simple Regression Model

1/61

A Motivating ExampleA Motivating ExampleVisual and musculoskeletal problems associated with the use of videodisplay terminals (VDTs) have become rather common in recentyears. Some researchers have focused on vertical gaze direction as asource of eye strain and irritation. This direction is known to beclosely related to ocular surface area (OSA), so a method ofmeasuring OSA is needed. The accompanying representative data ony = OSA (cm2) and x = width of the palpebral fissure (i.e., thehorizontal width of the eye opening, in cm) is from the article“Analysis of Ocular Surface Area for Comfortable VDT WorkstationLayout” (Ergonomics, 1996: 877–884). The order in whichobservations were obtained was not given, so for convenience they arelisted in increasing order of x values.

2/61

Original Data

3/61

Scatterplot of Data

4/61

Another ExampleAnother ExampleForest growth and decline phenomena throughout the world haveattracted considerable public and scientific interest. The article“Relationships Among Crown Condition, Growth, and Stand Nutritionin Seven Northern Vermont Sugarbushes” (Canad. J. Forest Res.,1995: 386–397) included a scatter plot of y = mean crown dieback(%), one indicator of growth retardation, and x = soil pH (higher pHcorresponds to more acidic soil), from which the following observations were taken:

5/61

Scatterplot of Data

6/61

A Linear Probabilistic Model A Linear Probabilistic Model

, where

Or in another expression

, independently

Graphically

7/61

Understanding linear line and

1) The population regression line is the line of mean Y values given fixed .

2) The second sequence of equalities tells us that the amount ofvariability in the distribution of Y is the same at any particular x valueas it is at any other x value— this is the property of homogeneousvariation about the population regression line.

8/61

Understanding linear line and

9/61

Section 12.2 Estimating Model Parameters

10/61

Estimating Model Parameters Estimating Model Parameters

Intuition

11/61

PRINCIPLE OF LEAST SQUARES PRINCIPLE OF LEAST SQUARES

12/61

Obtaining Least Square EstimatorsObtaining Least Square Estimators

13/61

ExampleExampleGlobal warming is a major issue, and CO2 emissions are an importantpart of the discussion. What is the effect of increased CO2 levels on the environment? In particular, what is the effect of these higher levelson the growth of plants and trees? The article “Effects of AtmosphericCO2 Enrichment on Biomass Accumulation and Distribution in Eldarica Pine Trees” (J. Experiment. Botany, 1994: 345–349) describes the results of growing pine trees with increasing levels of CO2 in the air. There were two trees at each of four levels of CO2 concentration, and the mass of each tree was measured after 11 months of the experiment. Here are the observations with x = atmospheric concentration of CO2 (mL/L, or ppm) and y = tree mass

(kg), along with and . The mass measurements were read from a graph in the article.

15/61

Data and Estimates of Coefficients

16/61

Scatterplot with Estimated Regression Line

17/61

Estimating Estimating

Understanding

18/61

Fitted Values and ResidualsFitted Values and Residuals

19/61

SSE and Estimator for SSE and Estimator for

A Short-cut Formula for Computing SSE

Note: Using the above short-cut formula, the numbers of digits in

and must be much larger than the number of digits in . Otherwise, large round-off error would occur.

20/61

ExampleExampleThe article “Promising Quantitative Nondestructive EvaluationTechniques for Composite Materials” (Materials Eval., 1985: 561–565) reports on a study to investigate how the propagation of anultrasonic stress wave through a substance depends on the propertiesof the substance. The accompanying data on fracture strength (x, as apercentage of ultimate tensile strength) and attenuation (y, inneper/cm, the decrease in amplitude of the stress wave) in fiberglass-reinforced polyester composites was read from a graph that appearedin the article. The simple linear regression model is suggested by thesubstantial linear pattern in the scatter plot.

21/61

Estimating Regression Line and

22/61

The Coefficient of Determination The Coefficient of Determination

Different Strength of Linear Effects

23/61

SSR = SST - SSE

24/61

The Coefficient of Determination The Coefficient of Determination

25/61

ExampleExample

26/61

Reading Outputs of Statistical ProgramReading Outputs of Statistical ProgramAn example of MINITAB Results. R outputs similarly.

27/61

Section 12.3 Inferences About the RegressionCoefficient

28/61

Simulated Estimates of Simulated Estimates of Look at R simulated linear lines.

Another one

29/61

Expressing Expressing as a linear function of as a linear function of

30/61

Sampling Distribution of Sampling Distribution of

Proof: on blackboard.

31/61

Sampling Distribution of Sampling Distribution of TT

Proof:

32/61

A Confidence Interval for A Confidence Interval for

A upper or lower bound for is:

33/61

ExampleExampleIs it possible to predict graduation rates from freshman test scores?Based on the average SAT score of entering freshmen at a university,can we predict the percentage of those freshmen who will get a degreethere within six years? We use a random sample of 20 universitiesfrom the 248 national universities listed in the 2005 edition ofAmerica’s Best Colleges, published by U.S. News & World Report.

34/61

Scatterplot

35/61

A Closer Look at the Dataset

37/61

Hypothesis-Testing Procedures Hypothesis-Testing Procedures

38/61

ExampleExampleIn the previous SAT score example, we want to test:

vs

39/61

Regression and ANOVARegression and ANOVATesting: vs

SSR = SST - SSE

When is true, .

40/61

Reading SAS Outputs of Regression AnalysisReading SAS Outputs of Regression AnalysisFor SAT score example:

41/61

Section 12.4: Inferences Concerning and thePrediction of Future Y Values

42/61

Sampling Distribution of Predicted Mean of Sampling Distribution of Predicted Mean of YY

43/61

Proof:

More on blackboard.

44/61

Inferences of Mean of Inferences of Mean of YY given given

45/61

ExampleExampleRefer to the SAT score example.

, .

Let’s now calculate a confidence interval, using a 95% confidence level, for the mean graduation rate for all universities having an average freshman SAT of 1200—that is, a confidence interval for

.

The interval is centered at

46/61

Results of of CI

47/61

A Prediction Interval for a Future Value of A Prediction Interval for a Future Value of Y Y

Mean and Variance

48/61

Sampling Distribution

:

The interpretation of the prediction level is that if the above PI is used repeatedly, in the long run the resulting intervals will actually contain the observed y values 100(1- α )% of the time.

49/61

ExampleExampleFor SAT example. Let's calculate a 95% prediction interval for agraduation rate that would result from selecting a single universitywhose average SAT is 1200. Relevant quantities from that exampleare

The t critical value is 2.101. The 95% prediction interval is:

50/61

Section 12.5 Correlation

51/61

Definition of sample correlation coefficient Definition of sample correlation coefficient

52/61

ExampleExampleAn accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article “ProductivityRatings Based on Soil Series” (Prof. Geographer, 1980: 158 –163)argues, an acceptable soil productivity index is not so easy to comeby. One difficulty is that productivity is determined partly by whichcrop is planted, and the relationship between yield of two differentcrops planted in the same soil may not be very strong. To illustrate,the article presents the accompanying data on corn yield x and peanutyield y (mT/ha) for eight different types of soil.

53/61

Calculating r

54/61

Properties of rProperties of r

55/61

Examples of Different Examples of Different rr

56/61

Rules of Thumb To State Strength of Linear RelationshipsRules of Thumb To State Strength of Linear Relationships

A frequently asked question is, “When can it be said that there is astrong correlation between the variables, and when is the correlationweak?” A reasonable rule of thumb is to say that the correlation is

• weak if 0 < r < 0.5,

• strong if .8 < r< 1, and

• moderate otherwise.

It may surprise you that r = 0.5 is considered weak, but r2 = .25implies that in a regression of y o n x, only 25% of observed yvariation would be explained by the model.

57/61

Correlation CoefficientCorrelation Coefficient

r is an estimate (observation) of the population parameter . The

random variable R is a function of both X iand Y i .

58/61

Sampling Distribution of Sampling Distribution of RRAssuming that ( ) has a bivariate normal distribution.

Proof: The T defined here is equivalent to the T defined in Slide 32,where we have shown that T|X has distribution for each X . Itfollows that the marginal distribution of T has distribution too.

59/61

ExampleExampleNeurotoxic effects of manganese are well known and are usuallycaused by high occupational exposure over long periods of time. Inthe fields of occupational hygiene and environmental hygiene, therelationship between lipid peroxidation, which is responsible fordeterioration of foods and damage to live tissue, and occupationalexposure had not been previously reported. The article “LipidPeroxidation in Workers Exposed to Manganese” (Scand. J. WorkEnviron. Health, 1996: 381–386) gave data on x manganeseconcentration in blood (ppb) and y concentration (μ mol/L) ofmalondialdehyde, which is a stable product of lipid peroxidation, bothfor a sample of 22 workers exposed to manganese and for a controlsample of 45 individuals. The value of r =0.29, from which

The p-value for two-tailed test = 0.052.

60/61

Further Courses for Regression Analysis:

STAT 344: Applied Regression Analysis

Talking about regression on multiple inputs, checking models, etc.

STAT 443: Linear Models

Talking about the sampling distributions in rigorous manners.

61/61

Section 12.1 The Simple Regression Modellonghai/teaching/2019/stat245... · 2019. 12. 18. ·...

Documents

Transcript of Section 12.1 The Simple Regression Modellonghai/teaching/2019/stat245... · 2019. 12. 18. ·...