Multiple, stepwise and non-linear regression models Peter Shaw RU Y = 426 – 4.6X r = 0.956 OR...

33
Multiple, stepwise and non-linear regression models Peter Shaw RU Exponentialdecay curve 0 100 200 300 400 500 600 0 20 40 60 80 100 tim e activity Series1 Y = 426 – 4.6X r = 0.956 OR Y=487*exp (-0.022*X) r = 1.00

Transcript of Multiple, stepwise and non-linear regression models Peter Shaw RU Y = 426 – 4.6X r = 0.956 OR...

Multiple, stepwise and non-linear regression models

Peter Shaw

RU

Exponential decay curve

0

100

200

300

400

500

600

0 20 40 60 80 100

time

act

ivit

y

Series1

Y = 426 – 4.6X r = 0.956

OR

Y=487*exp(-0.022*X)

r = 1.00

Introduction

Last week we met and ran bivariate correlation/regression analyses, which were probably simple and familiar to most of you.

This week we extend this concept, looking at variants; MLR simple and stepwise, and non-linear regressions.

The universals here are linear regression equations, and goodness of fit estimated as R2. The bivariate r we met last week is not meaningful in many cases here.

Multiple Linear Regression, MLR

MLR is just like bivariate regression, except that you have multiple independent (“explanatory”) variables.

Model: Y = A + B*X1 + C*X2 (+ D*X3….) Very often the explanatory variables are co-correlated,

hence the dislike of the term “independents” The goodness of fit is indicated by the R2 value, where R2

= proportion of total variance in Y which is explained by the regression. This is just like bivariate regression, except that we don’t used a signed (+/-) r value.

(Why no r? Because with 2 variables the line either goes up or down. With 3 variables the line can go +ve wrt X1 and -ve wrt X2)

Height 1st Leaf No.flowers

mm mm

108 12 18

111 11 10

147 23 22

218 21 14

240 37 12

223 30 25

242 28 45

480 77 40

290 40 50

263 55 30

250 Height, mm

Leaf 1, mm

50

No. flowers

30

60

500

250 Height, mm500

Leaf 1, mm100

No. flowers

60

30

A three dimensional data space.

Three variables define a three-dimensional data space, which can be visualised using three-dimensional graphs.

The 3-D graph of flower number, leaf length and plant height, seen from two rotations.

Figure 3.19. The 3-D graph relating crake density to mowing date and vegetation type. The width of the circles indicates corncrake number, ranging from 0 to the maximum of 10.

Age, years50 100

00

1.0

2.0

3.0

Dep

th, m

m

Erosion depth on marble tombstones in three Sheffield churchyards as a function of the age of the stone.Basic regression equation: Y = -0.03 + 0.027*X [Nb there’s an argument for forcing A = 0, but I’ll skip that]

City centre

City edge

Rural site

Age, years50 100

00

1.0

2.0

3.0

Dep

th, m

mErosion depth on marble tombstones in three Sheffield churchyards as a function of the age of the stone. A separate regression line has been fitted for each of the three churchyards.

City centre

City edge

Rural siteKey

erosion = 0.42 + 0.024*age + -0.13*distance, R2=0.73 (97 df)which can be compared with the bivariate equation:

erosion = -0.28 + 0.027*age, R2 = 0.63 (98 df).

There are several philosophical catches here: Intercept? Parallel lines?

Leads to ancova

Notice how adding a second explanatory variable increased the % variance explained – this is the whole point of MLR.

In fact adding an explanatory variable is ALMOST CERTAIN to increase the R2 – if it does not, it must be perfectly correlated with the first explanatory variable (or have zero variance). The question is really whether it is worth adding that new variable?

SRCE df SS MS F

Age 1 58.995 3 58.995 167**

Error 98 34.533 .352

Total 99 93.529

How does this relate to R2? Recall that R2 = % variance explained = 58.995/93.529 = 0.6308

SRCE df SS MS F

A&D 2 68.33 34.166 131**

Error 97 25.196 0.26

Total 99 93.529

H0: MS (A&D-A) = MS error

F1,97 = (68.33-58.995)/1/0.26 = 35.9**

This leads to Stepwise MLR

Given 1 dependent variable and lots of explanatory variables, start by finding the single highest correlation, and if significant, iterate:

1: Find the explanatory variable that causes the greatest reduction in error variance

2: Is this reduction significant?

If so add to the model and repeat.

SPSS agrees! Both age and distance are predictors

Data dredging tip:

If you have (say) 50 spp, and want to find out quickly which one shows the sharpest response to a 1or0 factor (MF, logged vs unlogged etc), try this:

Set the factor as dependent, then run stepwise MLR using all spp as predictors of that factor. Eccentric but works like a charm!

MLR is widely used in research, especially for “data dredging”.

It gives accurate R2 values, and valid least-squares estimates of parameters – intercept, gradients wrt X1, X2 etc.

Sometimes its power can seem almost magical – I have a neat little dataset in which MLR can derive parameters of the polynomial function

xX = A + B*X + C*X2 + D*X3…

MLR is powerful and dangerous:

We can find the general relation between number of points and number of rows – empirically! We solve the polynomial by MLR, then

1+2+3+4..+99+100 = ?

Empirically fitting polynomial functions by MLR:

The triangular function

The case of the mysterious tipburn on the CEGB’s Scots Pine

Tipburn on Pinus sylvestris in Liphook open air fumigation experiment 1985-1990

We had some homogenous aged pines exposed to SO2 in a sophisticated open-air fumigation system, max c. 200ppb. Books told us that Pine only got visible injury above 100 ppm – so we were well into the invisible damage region. Except that it wasn’t invisible on a few plants! About 5% were sensitive and could show damage in our conditions. Apart from DNA, what was going on?

Data structure

35 rows of data = 7 plots * 5 years

#plants burnt (=dep) then columns describing SO2.

Column: annual mean; 12 * monthly mean; 52*weekly mean; 365 * daily mean, rather more hourly means ..

Stepwise MLR picked up the mean SO2 conc. during the week of budburst as best predictor. Bingo!

There is a serious pitfall when the explanatory variables are co-correlated (which is frequent, almost routine).

In this case the gradients should be more-or-less ignored! They may be accurate, but are utterly misleading.

An illustration: Y correlates +vely with X1 and with X2. Enter Y, X1 and X2 into an MLR Y = A + B*X1 + C*X2 But if B >0, C may well be < 0! On the face of it this suggests that Y is –vely

correlated with X2. In fact this correlation is with the residual information content in X2 after its

correlation with X1 is removed. At least one multivariate package (CANOCO) instructs that prior to analysis

you must ensure all explanatory variables are uncorrelated, by excluding offenders from the analysis.

This caveat is widely ignored, indeed widely unknown.

But beware!

Colinearity & Variance inflation factors

To diagnose colinearity, compute the VIF for each variable:

VIFj = 1/(1-R2j)

Where R2j is the multiple R2 between variable J and all

others. If your maximum VIF > 10, worry! Exclude variables or run a PCA (coming up soon!)

Non-linear regressions You should now be familiar with the

notion of fitting a best fit line to a scattergraph, and the use of an R2 value to assess the significance of a relationship.

Be aware that linear regression implicitly fits a straight line - it would be possible to have a perfect relationship between dependent & independent variables, but r is non-significant!

R = 0.95**

R = 0 NS

Sometimes straight lines just aren’t enough!

Are these data linear?X Y0 01 12 43 94 165 256 36

R = 0.96 P<0.05Y = 8X - 9.33 - but is it?!

There are many cases where the line you want to fit is not a straight one!

Population dynamics are often exponential (up or down), as are many physical processes such as heating/cooling, radioactive decay etc.

Body size relationships (allometry) are almost always curvy.

Almost always it is still possible to use familiar, unsophisticated linear regression on these. You just have to apply a simple mathematical trick first.

Temp

time

BMR

Mass

To illustrate the basic idea..

Take these clearly non-linear data. Y correlates well with X, but by calculating a new variable the correlation can be made perfect. X Y0 01 12 43 94 165 256 36

Y#0123456

Calculate Y# = Square root(Y)

It is now evident that the correlation between X and Y* is perfect.Y# = 1*X + 0, r = 1.0, p<0.01

Logarithms.. For better or worse, the tricks we need to

straighten lines are based on the mathematical functions called logarithms.

When I were a lad… we used “logs” at school every day for long multiplications/divisions. Calculators stopped all that, and I find that I have to explain the basics here.

Forgive me if you already know this – someone sitting next to you won’t!

Let’s start with a few powers: N 2N 3N 10N

0 1 1 1 1 2 3 10 2 4 9 100 3 8 27 1000 4 16 81 10000 5 32 243 100000 6 64 729 1000000 Given these we see that 26 64, 34 = 81 etc. An alternative way of expressing this is to say that the

Logarithm base 2(64) = 6, logarithm base 3(81) = 4. This is abbreviated: Log2(16) = 4, log3(9) = 2 etc.

In practice you are not offered log base 2 or 3 on calculators, though they are easy to work out.

What you do get is log10 or loge, where e = 2.7182818… Loge is usually written Ln.

The power of logs is that they translate multiplication into addition! This is easy to see, sticking for a moment with logs base 2

8 * 4 = (2 * 2 * 2) * (2 * 2) [23 * 22] = (2*2*2*2*2) [25] =32 [Note 3+2 =5]

Universally: Logx(A) + Logx(B) = Logx(A*B) For virtually any values of x, A and B. (Exclude numbers <=0)

Which log to use?

Answer: It generally makes no difference at all! PROVIDING you are consistent. Throughout this lecture I will use ln = loge, but I could equally have used log10.

In case you wanted to know: Loge(X) = log10(x) *loge(10) ETC

The case where you need to be careful is fitting an explicitly exponential model, but if you calculate a half life you will get the same answer in any log base.

Given this we can deal with 2 distinct types of non-linear relationship which cover >95% of biological/environmental situations. 1: Exponential curves. These come out of

population dynamics, where growth and decline typically involve proportionate changes (ie 10% per year) and have a doubling time or a half life.

Power relationships, where doubling the independent variable changes the dependent by 2X where X is the power constant. This gives us the allometric relationships typical of size scaling functions (body size/BMR, river diameter/discharge etc).

Exponential models

The model can be recognised by having a half life or doubling time, and is defined as follows:

Y = A*exp(B*X) Where A,B are constants, exp(x) is

the natural antilog ex = 2.718x

Transform by getting log(Y) Ln(y) = ln(A) + B*X

B<0 = exponential decayThis never quite goes flat but heads to an asymptote

B>0 = exponential growth.This is explosive – never persists long!

For exponential data:

Plot log(Y) vs X, and use regression to get the intercept + gradient.

Intercept = ln(A), gradient = B If you want the half life, use: T½ = ln(½)/B

Notice that the gradient will vary according to log base, so half life is safer as it doesn’t.

Count Time

500 0

400 10

320 20

240 30

200 40

160 50

120 60

100 70

85 80

70 90

Exponential decay curve

0

100

200

300

400

500

600

0 20 40 60 80 100

time

act

ivit

y

Series1

A basic linear fit:

Y = 426 – 4.6X

r = 0.956

Exponential decay after log-transformation

0

1

2

3

4

5

6

7

0 20 40 60 80 100

time

log

(Y)

Series2 Y = 6.19 – 0.022, r = 1Hence Y=487*exp(-0.022*X)

From this we get:Half life = ln(0.5)/-0.022 = 31.5sec

Log-transform the Y axis!

Power relationships(allometric models) Here the model is: Y = A*XB

A,B are constants as before

Linearise by log-transforming Y AND X

Ln(Y) = ln(A)+ln(X)*B Plot ln(y) vs ln(X): intercept = ln(A),

gradient = B

BMR

Mass

XVAR

100806040200

YV

AR

7

6

5

4

3

2

1

3.77 291.21 33.43 246.57 886.82 956.45 854.9 49

Y X

0.58 1.460.08 0.480.54 1.380.82 1.940.83 1.980.81 1.930.69 1.69

LogY LogX

LOGX

2.01.81.61.41.21.0.8.6.4

LOG

Y

1.0

.8

.6

.4

.2

0.0

logY = -0.155 + 0.5*log10XHence Y = 0.7*sq. root(X)

To summarise:

1: plot and inspect data 2: explore by getting r value for

I: Y vs X [linear] II: ln(Y) vs X [exponential] III: ln(Y) vs ln(X) [allometric]

The best r value implies the best model, though the differences can be very minor.

This won’t always work! Some models defy linear

transformations. The commonest example is a quadratic relationship

Y = A + B*X + C*X2

Here there is no alternative but to call up a very powerful, general procedure, the general linear model (GLM). SPSS offers this as well as quadratic regression: you get an r value + estimates of A, B and C. Handle this just as we solved the triangular function above.

N flux from soil

Insectdensity