Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, 8.6.1 Outliers and Influential Observations...

22
Lecture 18: Thurs., Nov. 6th • Chapters 8.3.2, 8.4, 8.6.1 • Outliers and Influential Observations • Transformations • Interpretation of log transformations (8.4) •R 2 (8.6.1)
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    2

Transcript of Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, 8.6.1 Outliers and Influential Observations...

Lecture 18: Thurs., Nov. 6th

• Chapters 8.3.2, 8.4, 8.6.1

• Outliers and Influential Observations

• Transformations

• Interpretation of log transformations (8.4)

• R2 (8.6.1)

Outliers and Influential Observations

• An outlier is an observation that lies outside the overall pattern of the other observations. A point can be an outlier in the x direction, the y direction or in the direction of the scatterplot. For regression, the outliers of concern are those in the x direction and the direction of the scatterplot. A point that is an outlier in the direction of the scatterplot will have a large residual.

• An observation is influential if removing it markedly changes the least squares regression line. A point that is an outlier in the x direction will often be influential.

• The least squares method is not resistant to outliers. Follow the outlier examination strategy in Display 3.6 for dealing with outliers in x direction and outliers in the direction of scatterplot.

Outliers Example

• Does the age at which a child begins to talk predict a later score on a test of mental ability at a later age?

• gesell.JMP contains data on the age at first word (x) and their Gesell Adaptive score (y), an ability test taken at a later age.

• Child 18 is an outlier in the x direction and potentially influential. Child 19 is an outlier in the direction of the scatterplot.

• To assess whether a point is influential, fit the least squares line with and without the point (excluding the row to fit it without the point) and see how much of a difference it makes.

• Child 18 is highly influential; child 19 is not highly influential.

Bivariate Fit of Score By Age

50

7080

100

120

Score

5 10 15 20 25 30 35 40 45Age

Parameter Estimates Term Estimate Std Error Prob>|t|

Intercept 109.87384 5.067802 <.0001 Age -1.126989 0.310172 0.0018

Bivariate Fit of Score By Age

70

80

90

100

110

120

130

Score

5 10 15 20 25Age

Parameter Estimates Term Estimate Std Error Prob>|t|

Intercept 105.62987 7.161928 <.0001 Age -0.779221 0.516733 0.1489

Case Study 8.1.1

• Biologists are interested in the relationship between the area of islands (X) and the number of animal and plant species (Y) living on them.– Estimates of this relationship are useful in conservation

biology for predicting species extinction rates due to diminishing habitat.

• Data in Display 8.1 are number of reptile and amphibian species and the island areas for seven islands in the West Indies.

Scatterplots for Species Data

• Regression function does not appear to be linear.

Bivariate Fit of SPECIES By AREA

0

25

50

75

100

125

SP

EC

IES

-10000100003000050000AREA

-20

0

20

Re

sid

ua

l

-10000 10000 30000 50000AREA

Case Study 8.1.2

• In an industrial laboratory, batches of electrical insulating fluid were subject to different voltages until insulating property of fluids broke down.

• Y=time to breakdown of an insulating fluid, X=voltage. Residual plots shows “horn shaped” pattern indicating both nonlinearity and nonconstant variance.

0

500

1000

1500

2000

2500

TIM

E

24 26 28 30 32 34 36 38 40VOLTAGE

-500

500

1500R

esi

du

al

24 26 28 30 32 34 36 38 40VOLTAGE

Tukey’s Bulging Rule

• Draw a circle, divide into 4 pieces • Try transformations based on what quadrant the shape

of the data falls in.• Upper left: sqrt X, log X, 1/X, Y2 • Upper right: X2 Y2 • Lower left: sqrt X, log X, 1/X, sqrt Y, log Y, 1/Y • Lower right: X2, sqrt Y, log Y, 1/Y • Try different transformations, draw residual plots and

see which works best. If no transformation works, polynomial regression (Ch. 9) must be used.

Transformations for Voltage Data

-3

-1

1

3

5

7

Lo

g T

ime

24 26 28 30 32 34 36 38 40VOLTAGE

-5

-2

1

Re

sid

ua

l

24 26 28 30 32 34 36 38 40VOLTAGE

Bivariate Fit of Square Root of Time By VOLTAGE

0

10

20

30

40

50

Sq

ua

re

Ro

ot

of

Tim

e

242628303234363840VOLTAGE

-20

0

20

Re

sid

ua

l

24 26 28 30 32 34 36 38 40VOLTAGE

Transformations for Species Data

0

25

50

75

100

125

SP

EC

IES

-2.5 0 2.5 5 7.5 1012.5Log Area

-30

-10

10

30

Re

sid

ua

l

-2.5 0 2.5 5 7.5 10 12.5Log Area

-2.5

0

2.5

5

7.5

10

12.5

Lo

g A

rea

1.5 2 2.5 3 3.5 4 4.5 5Log Species

-1.0

0.00.5

Re

sid

ua

l

1.5 2 2.5 3 3.5 4 4.5 5Log Species

Prediction After Transformation

• To predict y given x (or to estimate ) when y has been transformed to f(y) and x to g(x),

• • Species Data log-log transformation. Y transformed to log

Y, X transformed to log X

• Predicted number of species given area = 30000:– Predicted number of log species given log area =

log(30000)=10.31 equals 1.94+0.25*10.31=4.52.– Predicted number of species given area = 30000 equals

exp(predicted number of log species given log area = log(30000)) = exp(4.52) =91.84.

}|{ XY

)})()(|{ˆ(}|{ˆ 1 xgXgYfxXY

Linear Fit

Log Species = 1.9365081 + 0.2496799 Log Area

Second Prediction Example

• For voltage data, if using the square root transformation, to predict y based on x,

• Predicted Time for Voltage = 30:– Predicted Square Root of Time for Voltage =

30 equals 61.78-1.70*30 = 10.78– Predicted Time for Voltage = 30 equals

10.782=116.21

Linear Fit

Square Root of Time = 61.784472 - 1.6958968 VOLTAGE

Testing whether Y is Associated with X

• To test whether Y is associated with X, we can test whether f(Y) is associated with g(X) by testing whether the slope is zero in the transformed model.

• Strong evidence that number of species is associated with area.

• Interpreting the slope and intercept is difficult except for log transformations.

Linear Fit

Log Area = -7.595201 + 3.9585895 Log Species Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 61.784472 7.776881 7.94 <.0001 VOLTAGE -1.695897 0.233695 -7.26 <.0001

Interpreting log transformations

• Case I: Response is logged, explanatory variable is not logged.

• Median{Y|X}=• Consequently, Median{Y|(X+1)}/ Median{Y|X} =

• Interpretation:

– If , as X increases by 1, the median of Y increases by

– If , as X increases by 1, the median of Y decreases by

XXY 10}|{ )exp()exp( 10 X

)exp( 1

01 %100*)1( 1 e

01 %100*)1( 1e

Interpretation in Voltage Study

• Interpretation: It is estimated that the median failure time decreases by

with each 1kV increase in voltage. 95% CI: Median failure time decreases by

for 1KV increase in voltage.

%40%100*)1( 1 e

Parameter Estimates Term Estimate Prob>|t| Lower 95% Upper 95%

Intercept 18.955459 <.0001 15.149663 22.761254 VOLTAGE -0.507365 <.0001 -0.621729 -0.393001

-3

-1

1

3

5

7

Lo

g T

ime

24 26 28 30 32 34 36 38 40VOLTAGE

%)50.32%,29.46(%100*)1,1( 393.6217. ee

Case II: Explanatory variable is logged

• • Implies • Interpretation: Doubling of X is associated with

change in the mean of Y.• Species Example:

• Interpretation: Doubling of Area is associated with an increase in mean species of 8.86*log(2) = 6.14. 95% CI = (4.52*log(2),13.20*log(2))=(3.13,9.15)

XXY log)}log(|{ 10

)2log()}log(|{)}2log(|{ 1 XYXY)2log(1

Parameter Estimates Term Estimate Prob>|t| Lower 95% Upper 95%

Intercept -5.294043 0.6845 -36.88145 26.293365 Log Area 8.8605231 0.0033 4.5212408 13.199805

Case III: Both response and explanatory variable logged

• Interpretation:– A doubling of X is associated with a multiplicative

change of in the median of Y.

– A ten-fold increase in X is associated with a change of

in the median of Y.

)log()}log(|){log( 10 XXY 10}|{ XeXYMedian

12

110

Case III Example

• Species Example:

• Since , “associated with each doubling of island area is a 19% increase in the median number of bird species. 95% CI for multiplicative increase = (16.4%, 21.5%)

1.52

2.53

3.54

4.55

Lo

g S

pe

cie

s

-2.5 0 2.5 5 7.5 1012.5Log Area

Parameter Estimates Term Estimate Prob>|t| Lower 95% Upper 95%

Intercept 1.9365081 <.0001 1.7099593 2.1630569 Log Area 0.2496799 <.0001 0.218558 0.2808018

19.12 25.0

R-Squared

• The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable.

• Total sum of squares = . Best sum of squared prediction error without using x.

)%squares of sum Total

squares of sum Residual - squares of sum Total(1002 R

2

1)( YY

n

i i

R-Squared example

• R2= 86.69%. Read as “86.69 percent of the variation in neuron activity was explained by linear regression on years played.”

Bivariate Fit of Neuron activity index By Years playing

0

5

10

15

20

25

30

Ne

uro

n a

ctivity in

de

x

0 5 10 15 20Years playing

Linear Fit

Neuron activity index = 7.9715909 + 1.0268308 Years playing Summary of Fit

RSquare 0.866986 RSquare Adj 0.855902 Root Mean Square Error 3.025101 Mean of Response 15.89286 Observations (or Sum Wgts) 14

Interpreting R2

• If the residuals are all zero (a perfect fit), then R2 is 100%. If the least squares line has slope 0, R2 will be 0%.

• R2 is useful as a unitless summary of the strength of linear association but– It is not useful for assessing model adequacy (e.g., linearity)

or whether or not there is an association– A good R2 depends on the context. In precise laboratory

work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.

Coverage of Second Midterm

• Transformations of the data for two group problem (Ch. 3.5)

• Welch t-test (Ch. 4.3.2)• Comparisons Among Several Samples (5.1-5.3,

5.5.1)• Multiple Comparisons (6.3-6.4)• Simple Linear Regression (Ch. 7.1-7.4, 7.5.3)• Assumptions for Simple Linear Regression and

Diagnostics (Ch. 8.1-8.4, 8.6.1, 8.6.3)