chapter2notes5s

16
Math2010 - Statistical Methods I Chapter 2 - Simple Linear Regression (Notes 5) Stefanie Biedermann

description

math2012

Transcript of chapter2notes5s

  • Math2010 - Statistical Methods I

    Chapter 2 - Simple Linear Regression(Notes 5)

    Stefanie Biedermann

  • 2.16 Bodyfat Example

    I Knowledge of the fat content of the human body isphysiologically and medically important. The fat content mayinfluence susceptibility to disease, the outcome of disease, theeffectiveness of drugs (especially anaesthetics) and the ability towithstand adverse conditions including exposure to cold andstarvation.

    I In practice, fat content is difficult to measure directly one way isby measuring body density which requires subjects to beweighed underwater! For this reason, it is useful to try to relatesimpler measures such as skinfold thicknesses (which arereadily measured using calipers) to body fat content and thenuse these to estimate the body fat content.

    I Dr R. Telford collected skinfold (the sum of four skinfoldmeasurements) and percent body fat measurements on 102 elitemale athletes training at the Australian Institute of Sport.

  • I The scatterplot of the data shows that the percent body fatincreases linearly with the skinfold measurement

    I From the plot we might also be concerned that the variation inpercent body fat seems to increases with the skinfoldmeasurement.

    I To take a closer look we can fit a simple linear regression modeland examine the diagnostic plots as well

    I From the scatter plot there do not seem to be any unusual datapoints and there are no other obvious patterns to note.

  • 40 60 80 100

    68

    1012

    1416

    1820

    Skinfold

    Fat

    Figure: Scatterplot of bodyfat against skinfold

    I We can see in the residual plot that there is a fan or funnel shapeevident, giving more evidence to our concern of heteroscedacity

    I The normal probability plot of the residuals also seems toindicates that the normality assumption is not appropriate in thetails of the distribution

  • 6 8 10 12 14 16 18

    -2-1

    01

    2

    Fitted values

    Residuals

    -2 -1 0 1 2

    -2-1

    01

    2

    Normal Q-Q Plot

    Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    Figure: Anscombe and normal probability plot from a linear regression ofbodyfat on skinfold

    I It is often the case that we can simplify the relationshipsexhibited by data by transforming one or more of the variables.

  • I In the example, we want to preserve the simple linearrelationship for the conditional mean but stabilise the variability(i.e. make it constant) so that it is easier to describe. Wetherefore try transforming both variables.

    I There is no theoretical argument to suggest a transformationhere but empirical experience suggests that we try the logtransformation.

    I The scatterplot on the log scale shows that the logtransformation of both variables preserves the linear conditionallocation relationship and stabilises the conditional variability.

    I After examining the diagnostics, we conclude that the linearregression model provides a good approximation to theunderlying population.

  • 3.4 3.6 3.8 4.0 4.2 4.4 4.6

    1.82.0

    2.22.4

    2.62.8

    3.0

    Scatterplot

    log Skinfold

    log Fa

    t

    1.8 2.0 2.2 2.4 2.6 2.8

    -0.2

    -0.1

    0.00.1

    Anscombe plot

    Fitted values

    Residuals

    -2 -1 0 1 2

    -0.2

    -0.1

    0.00.1

    Normal Q-Q Plot

    Theoretical Quantiles

    Samp

    le Qu

    antile

    s

    Figure: Scatterplot (top left) Anscombe plot (top right) and normal probabilityplot (bottom) on the transformed scale

  • I Let Yi denote the log percent body fat measurement and Xi thelog skinfold measurement on athlete i , where i = 1, . . . ,102.Then we have

    E(Yi |xi) = 0 + 1xi and Var(Yi |xi) = 2

    I Fitting the model to the data results in estimates of theparameters so that we can write:

    Y = 1.25 + 0.88x(0.097)(0.025)

    where the standard errors of the intercepts and slope are thenumbers in parentheses.

    I We also obtain a variance estimate S2 = 0.0067 on 100 degreesof freedom.

  • Parameter EstimatesTerm Estimate Std Error t Ratio Prob>|t|

    Intercept 1.250 0.097 12.940 0.000logskin 0.882 0.024 35.590 0.000

    Analysis of VarianceSource DF SS MS F Prob>F

    logskin 1 8.5240 8.5240 1266.3 0.0000Residuals 100 0.6731 0.0067

    Total 101 9.1971

    I A 95% confidence interval for the population slope parameter 1 is(0.83, 0.93).

  • I A common test of interest is the test of the hypothesis that1 = 0. i.e. that skinfold is not useful for predicting bodyfat.

    I The 95% confidence interval for 1 does not contain zero so wecan conclude that there is evidence against the hypothesis andthat skinfold is a useful variable in the model for predicting bodyfat.

    I A formal hypothesis test can be carried out by computing thet-ratio 35.59 which has a p-value of zero (to 4 decimal places).We conclude that 1 is significantly different from zero.

    I A test of the hypothesis 1 = 1 is also of interest. This isbecause:

  • I A 95% confidence interval for the mean log body fat percentageof all individuals with a skinfold of 70 is (2.474,2.522).

    I A 95% prediction interval for the log body fat percentage of asingle male athlete with a skinfold of 70 is (2.334,2.663).

    I Although in our body fat example, we have worked on the simplerlog scale, it may be useful to make predictions on the raw scale.

  • I We can also plot confidence bands across the range of theskinfold measurements taken for the mean log body fatpercentage at any given skinfold measurement

    I We can also plot prediction bands across the range of theskinfold measurements taken for the log body fat percentage of asingle male athlete at any given skinfold measurement

    I The bands are widests at the extremes and narrowest at thecentre

    I Also note that the prediction band is always wider than theconfidence band

  • 3.4 3.6 3.8 4.0 4.2 4.4 4.6

    1.8

    2.0

    2.2

    2.4

    2.6

    2.8

    3.0

    log Skinfold

    log

    Fat

    Scatter plot with the fitted line and 95% confidence/prediction bands

    Fitted line95% CI for mean95% PI for observation

    Figure: Scatterplot of log bodyfat against log skinfold with fitted regressionline and 95% confidence bands and prediction bands

  • I The validity of the inferences depends on the representativenessof the sample and the validity of the model.

    I It is vital when collecting data to ensure that it is representative ofthe population of interest and before making inferences to ensurethat the model is appropriate.

    I Strictly speaking, model assessment is a form of informalinference and has an impact on other inferences but it is notsensible to simply hope that a model is valid when empiricalverification is available.