1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading,...

39
1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen [email protected] k

Transcript of 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading,...

Page 1: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

1

To centre or not to centre …or perhaps do it twice

Ian Jolliffe

Universities of Reading, Southampton, Aberdeen

[email protected]

Page 2: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

2

Outline of talk

• Introduction

• Covariance and correlation

• Principal component analysis (PCA - EOF analysis)

• Uncentred analyses

• Doubly-centred analyses

• Concluding remarks

Page 3: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

3

Covariance

Given a data set xij, i = 1, 2, …, n; j = 1, 2, …p, consisting of n observations on p variables, the covariance between the jth and kth variable is, with obvious notation (though divisor (n-1) instead of n might be more appropriate here):

))((1

1kikj

n

iijjk xxxx

ns

Page 4: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

4

Covariance and correlation

• The correlation between variables j and k is

rjk = sjk/[sjjskk]½

• The covariance sjk is the (j,k)th element of the matrix SCC = XT

CCXCC/n, where XCC is the matrix whose (i,j)th element is

) (j ijx x

Page 5: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

5

Centering

• The notation XCC indicates that X has been column-centred. There are several alternatives– No centering (uncentred), giving XUC

– Row centering, giving XRC

– Double centering, giving XDC

Page 6: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

6

Other forms of covariance

• For each of the X matrices, we can calculate a matrix of ‘modified covariances’, as

S = XTX/n

For example, an ‘uncentred covariance matrix’ can be defined to have elements

ik

n

iij jkx x

ns

1

1

Page 7: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

7

Other forms of correlation

• ‘Correlations’ can be defined corresponding to each of the modified covariances

• Hyvärinen et al. (2001, pp 24,25) define correlation as an uncentred version, but covariance with column centering!

Page 8: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

8

PCA (EOF analysis) – some definitions, terminology

• If x is a vector of p variables, then the principal components (PCs) are linear combinations aT

1x, aT

2x, … aTpx

• In the kth PC, ak the vector of coefficients or loadings is chosen so that the variance of aT

kx is maximised, subject to a normalisation constraint aT

kak = 1, and subject to successive PCs being uncorrelated

Page 9: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

9

PCA – more definitions, terminology

• The optimisation problem which defines PCs turns out, like many in multivariate analysis, to be an eigenvalue problem

• The variances of the PCs are eigenvalues of the covariance matrix of x, in descending order, and the vectors of loadings ak are the corresponding eigenvectors.

• If variables are replaced by standardised variables, obtained by dividing by respective standard deviations, then PCA finds eigenvalues and eigenvectors of the correlation matrix

Page 10: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

10

Varieties of PCA

• As well as the covariance/correlation dichotomy, we can do corresponding analyses on the various modified versions

• All have been used somewhere in the literature but it not always obvious how to interpret what is being done, and what the results mean

Page 11: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

11

Examples

• We illustrate the various analyses with two toy examples– Monthly averages of maximum daily

temperature for 16 UK stations (n=16; p=12) in 2002

– Monthly precipitation totals for 15 UK stations (n=15; p=12) in 2002

• For the first of these, analyses were done using both Celsius and Fahrenheit

Page 12: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

12Index

Data

121110987654321

0.5

0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

VariablePC1PC2

Loadings of correlation matrix PC1, PC2Temperature data

Page 13: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

13

Temperature data – column centred

• Correlation matrix analysis has first PC as a measure of overall temperature at each station; second PC measures seasonal cycle

• Correlation matrix analysis is invariant to use of Celsius or Fahrenheit; so is covariance analysis because the transformation is the same for all variables

• Covariance analysis is similar except that loadings on the first PC are slightly more variable, reflecting different variance values; similar amounts of variation are accounted for by PC1 (73%, 74% for correlation, covariance respectively)

Page 14: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

14Index

Data

121110987654321

0.5

0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

VariableCovPC1CovPC2

Loadings of covariance matrix PC1, PC2Temperature data

Page 15: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

15

Precipitation data – column centred

• Because precipitation is less spatially structured than temperature, correlations are more variable and not all are positive

• Hence loadings on first PC are no longer nearly uniform. The correlation analysis has two main exceptions (July, December); the covariance analysis is much more variable, due to large differences in variances (Sep =159, Feb = 4664)

Page 16: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

16Index

Data

121110987654321

0.50

0.25

0.00

-0.25

-0.50

VariablePC1PC2

Loadings of correlation matrix PC1, PC2Precipitation data

Page 17: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

17Index

Data

121110987654321

0.75

0.50

0.25

0.00

-0.25

-0.50

VariableCovPC1CovPC2

Loadings of covariance matrix PC1, PC2Precipitation data

Page 18: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

18

Precipitation data – column centred II

• Also PC1 is less dominant than for temperature (65% covariance, 55% correlation)

• PC2 is dominated by months that are least correlated with the rest in both analyses, though details are different

Page 19: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

19

Temperature data – uncentred ‘covariance’ analysis

• We are now looking at directions with the maximum variation with respect to the origin, rather than with respect to the mean. Hence the mean itself often determines the form of the first (frequently very dominant) ‘component’

• In this example, PC1 & PC2 have similar loadings to those in the column-centred analysis, but the first PC is a much more dominant source of variation and a seasonal cycle is now apparent in PC1 reflecting the annual cycle in the means

Page 20: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

20Index

Data

121110987654321

0.40

0.35

0.30

0.25

0.20

0.15

Variable

UCCovCPC1

CovPC1UCCovFPC1

Loadings of Covariance PC1Column-cented, Uncentred (Celsius), Uncentred (Fahrenheit)

Page 21: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

21

Temperature data – uncentred ‘covariance’ analysis II

• Results are not invariant to choice of scale• Because values for Fahrenheit are further from the

origin than Celsius, the PC1 is even more dominant (99.95% of ‘variation’ for °F; 99.73% for °C; 74.0% for column-centred)

• Also loadings in PC1 are less variable for °F than for °C in uncentred analysis

• It seems unwise to use uncentred analyses unless the origin is meaningful. Even then, it will be uninformative if all measurements are far from the origin

Page 22: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

22Index

Data

121110987654321

0.5

0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

Variable

UCCovCPC2

CovPC2UCCovFPC2

Loadings of Covariance PC2Column-cented, Uncentred (Celsius), Uncentred (Fahrenheit)

Page 23: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

23

Temperature data – uncentred ‘correlation’ analysis

• Not invariant to choice of scale, but PC1 is very close to an equally weighted combination of all variables in both cases

• PC2 is also quite similar in both cases – seasonal cycle again

• Larger numbers for °F so more extreme behaviour (99.94% compared to 99.5% for PC1; greater uniformity of loadings in PC1)

Page 24: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

24

Uncentred analyses and anomalies

• One case where uncentred analyses are appropriate is if we can assume that the population means of our variables are zero, although the sample means are not

• This is the case when the data are anomalies

Page 25: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

25

Precipitation data – uncentred ‘covariance’ analysis

• PC1 again becomes more dominant than in the column-centred analysis (91.7% vs. 65.0%)

• All loadings on PC1 now have the same sign and are more similar in value; PC2 has little in common with PC2 for the column-centred analysis

Page 26: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

26Index

Data

121110987654321

0.6

0.5

0.4

0.3

0.2

0.1

0.0

VariableCovPC1UCCovPC1

Loadings of Covariance PC1Column-cented, Uncentred, Precipitation data

Page 27: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

27Index

Data

121110987654321

0.75

0.50

0.25

0.00

-0.25

-0.50

VariableCovPC2UCCovPC2

Loadings of Covariance PC2Column-cented, Uncentred, Precipitation data

Page 28: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

28

Precipitation data – uncentred ‘correlation’ analysis

• PC1 is very, very close to an equally-weighted combination of all months –it accounts for 91.1% of ‘variation

• PC2 contrasts the first 6 months with the last 6 months. Why? How can this be interpreted?

Page 29: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

29

Temperature data – doubly-centred ‘covariance’ analysis

• This analysis is invariant to choice of °F or °C• PC1 and PC2 have similar loadings to PC2, PC3

in column-centred analysis. This is because the double centering induces a constraint x1 + x2 + … + xp =0. This implies that the first PC in the column-centred analysis now has near-zero variance – other PCs move up one, and the last PC is now given by the relationship above

Page 30: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

30

Temperature data – doubly-centred ‘correlation’ analysis

• Again there is invariance to choice of scale

• PC1 accounts for less ‘variation’ than in ‘covariance’ analysis (63.4% vs. 77.8%) but structure of loadings in first two PCs is similar

Page 31: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

31

Precipitation data – doubly-centred ‘covariance’ analysis

• The double centering again induces a constraint x1 + x2 + … + xp =0, given by the last PC

• Because the first PC in the column centred analysis is not particularly close to x1 + x2 + … + xp, PC2 & PC3 don’t look much like PC1 & PC2 of the column centred analysis for these data

• PC1 accounts for only 40.5% of the (doubly-centred) variation

Page 32: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

32

Precipitation data – doubly-centred ‘correlation’ analysis

• PC1 accounts for 33.6% of ‘variation’ and is similar to that for covariance (Jan, Feb vs. Jul, Aug, Sep, Dec – but how to interpret it?)

• PC2 is completely different in covariance and correlation analyses

Page 33: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

33Index

Data

121110987654321

0.50

0.25

0.00

-0.25

-0.50

VariableUCPC1UCCovPC1

Loadings of Doubly centred PC1Covariance and Correlation Precipitation data

Page 34: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

34Index

Data

121110987654321

0.50

0.25

0.00

-0.25

-0.50

-0.75

VariableUCPC2UCCovPC2

Loadings of Doubly centred PC2Covariance and Correlation Precipitation data

Page 35: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

35

When and why use double centering

• If an analysis is likely to be dominated by an uninteresting ‘size’ PC – all loadings of the same size and roughly equal magnitude (size/shape analysis, species abundance data) – then double-centering removes it

• Can also be thought of as removing row and column effects from a data matrix and concentrating on interactions between row and columns.

• Uncentred analysis accentuates size PCs rather than removing them

Page 36: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

36

Row-centred analysis

• If column-centred analysis is S-mode analysis then row-centred analysis is T-mode

• It is sometimes suggested that T-mode is related to S-mode by simply transposing the data matrix, but this is not the case in general – different centerings are involved

• The relationship does hold if double-centering is used

Page 37: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

37

Related ideas

• Double-centering uses a similar idea to correspondence analysis, but is different in how row and column effects are removed

• There are a number of varieties of correlation and anomaly correlation, corresponding to different choices for centering

• Empirical orthogonal teleconnections (van den Dool et al 2000) use uncentered covariances in a regression context

• Takane and Shibayama (1991) decompose a data matrix into 4 terms. SVDs of sums of one or more these terms give uncentred, column-centred, row-centred and doubly-centred PCAs

Page 38: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

38

Final remarks

• Standard EOF analysis is (relatively) easy to understand – variance maximisation

• For other techniques it’s less clear what we are optimising, and how to interpret the results

• There may be reasons for using no centering or double centering, but potential users need to understand and explain what they are doing

Page 39: 1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen i.t.jolliffe@reading.ac.uk.

39

References

• Hyvärinen, A., Karhunen, J. & Oja, E. (2001). Independent component analysis. Wiley

• Takane, Y. & Shibayama, T. (1991). Principal component analysis with external information on both subjects and variables. Psychometrika, 56, 97-120.

• Van den Dool, H. M., Saha, S. & Johansson, Å. (2000). Empirical othogonal teleconnections. J. Climate, 13, 1421-1435.