Ordinations Peter Shaw. Introduction to ordination “Ordination” is the term used to arrange...

Ordinations

Peter Shaw

Introduction to ordination

“Ordination” is the term used to arrange multivariate data in a rational order, usually into a 2D space (ie a diagram on paper).

The underlying concept is that of taking a set of community data (or chemical or physical…) and organising them into a picture that gives you insight into their organisation.

1,2

34,5

Wet

MuddyDry

Elevation, m193189183177

Bare sand stabilised by Marram grass

0 350 600 850 1100 3500 10,000 Years stable

Cottonwood. Populus deltoides

Black oak Quercus velutina

Pines, Pinus spp

A direct ordination, projecting communities onto time: The sand dune succession at the southern end of lake Michegan (re-drawn from Olsen 1958)

Annual grasslandValley-foothill hardwood

Chaparral

Montanehardwood

Ponserosa pine

Montanehardwood

Mixed conifers

Red fir Jeffreypine pi

nyon

Lodgepolepine

Subalpineconifer

Alpine dwarf scrubW

et m

eado

ws

Moist….…………………Dry

Ele

vati

on, m

hard

woo

ds

1000

2000

3000

4000

A 2-D direct ordination, also called a mosaic diagram, in this case showing the distribution of vegetation types in relation to elevation and moisture in the Sequoia national park. This is an example of a direct ordination, laying out communities in relation to two well-understood axes of variation. Redrawn from Vankat (1982) with kind permission of the California botanical society.

Bray-Curtiss ordination This technique is good for introducing the

concept of ordination, but is almost never used nowadays.

It dates back to the late 1950s when computers were unavailable, and had the advantage that it could be run by hand.

3 steps: Convert raw data to a matrix of distances

between samples identify end points plot each sample on a graph in relation to the

end points.

Sample data - a succession: Year A B C1 100 0 02 90 10 03 80 20 54 60 35 105 50 50 206 40 60 307 20 30 408 5 20 609 0 10 7510 0 0 90

Choose a measure of distance between years:The usual one is the Bray-Curtiss index (the Czekanowski index).

Between year 1 & 2:

A B C Y1 100 0 0

Y2 90 10 0Minimum 90 0 0 = 90Sum 190 10 0 =200

distance = 1-2*90/200 = 0.1

1 2 3 4 5 6 7 8 9 10 1 0.00 2 0.10 0.00 3 0.29 0.12 0.00 4 0.41 0.28 0.15 0.00 5 0.54 0.45 0.33 0.16 0.00 6 0.65 0.50 0.45 0.28 0.12 0.00 7 0.79 0.68 0.54 0.38 0.33 0.27 0.00 8 0.95 0.84 0.68 0.63 0.56 0.49 0.26 0.00 9 1.00 0.89 0.74 0.79 0.71 0.63 0.43 0.18 0.0010 1.00 1.00 0.95 0.90 0.81 0.73 0.56 0.33 0.14 0.00

The matrix of B_C distances between each of the 10 years. Notice that the matrix is symmetrical about the leading diaginal, so only lower half is shown.

PS I did this by hand.PPS it took about 30 minutes!

Now establish the end points - this is a subjective choice, I choose years 1 and 10 as being extreme values.

Now draw a line with 1 at one end, 10 at the other.

The length of this line = a distance of 1.0, based on the matrix above.

1 10

Distance = 1.0

Now locate obs 2, 1.0 from year 10 and 0.1 from year 1. This can be done by circles.

1 10

Distance 1<-> 2 = 0.1 units, distance 2<->10 = 1.0 units, so draw 2 circles:

Year2 locates here:

Radius = 1.0

Radius = 0.1

1 10

23 4 5 6

7 89

The final ordination of the points (approx.)

Principal ComponentsAnalysis, PCA This is our first true multivariate

technique, and is one of my favourites.

It is fairly close to multiple linear regression, with one big conceptual difference (and a hugely different output).

The difference lies in fitting of residuals: MLR fits them vertically to one

special variable (Y, the dependent). PCA fits them orthogonally - all

variables are equally important, there is no dependent.

Y

X1X2

V1V2

V3

Conceptually the intentions are very different:

MLR seeks to find the best hyper-plane through the data. PCA [can be thought of as] starting off by fitting one best

LINE through the cloud of datapoints, like shining a laser beam through an array of tethered balloons. This line passes through the middle of the dataset (defined as the mean of each variable), and runs along the axis which explains the greatest amount of variation within the data.

This first line is known as the first principal axis – it is the most useful single line that can be fitted through the data (formally the linear combination of variables with the greatest variance).

The 1st principal axis of a dataset – shown as a laser in a room oftethered balloons!

Overall mean of the dataset

Often the first axis of a dataset will be an obvious simple source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.

A good indicator of the importance of a principal axis is the % variance it explains. This will tend to decrease as the number of variables = number of axes in the dataspace increases. For half-decent data with 10-20 variables you expect c. 30% of variation on the 1st principal axis.

Having fitted one principal axis, you can fit a 2nd. It will explain less variance than the 1st axis.

This 2nd axis explains the maximum possible variation, subject to 2 constraints:

1: is orthogonal (at 90 degrees to) the 1st

principal axis2: it runs through the mean of the dataset.

The 2nd principal axis of a dataset,– shown in blueThis diagram shows a 3D dataspace (axes being V1, V2 and V3).

Overall mean of the dataset

V11.00.30.51.31.52.52.53.4

V20.11.02.00.81.52.22.83.4

V30.32.12.01.01.22.52.53.5

We can now cast a shadow..

The first 2 principal axes define a 2D plane, on which the positions of all the datapoints can be projected.

Note that this is exactly the same as casting a shadow.

Thus PCA allows us to examine the shadow of a hig-dimensional object, say a 30 dimensional dataspace (defined by 30 variables such as species density).

Such a projection is a basic ordination diagram.

I always use such diagrams when exploring data – the scattergraph of the 1st 2 principal axes of a dataset is the most genuinely informative description of the entire data.

More axes:

A PCA can generate as many principal axes as there are axes in the original dataspace, but each one is progressively less important than the one preceeding it.

In my experience the first axis is usually intelligible (often blindingly obvious), the second often useful, the third rarely useful, 4th and above always seem to be random noise.

To understand a PCA output.. Inspecting the scattergraph is

useful,but you need to know the importance of each variable with respect to each axis. This is the gradient of the axis with respect to that variable.

This is given as a standard in PCA output, but you need to know how the numbers are arrived at to know what the gibberish means!

How it’s done..

1: derive the matrix of all correlation coefficients – the correlation matrix. Note the similarity to Bray-Curtiss ordination: we start with N columns of data, then derive an N*N matrix informing us about the relationship between each pair of columns.

2: Derive the eigenvectors and eigenvalues of this matrix. It turns out that MOST multivariate techniques involve eigenvector analysis, so you may as well get used to the term!

Eigenvectors 1: Involve you knowing a little about matrix multiplication. Matrix multiplication is essentially the same as solving a

shopping bill! I have 2 apples, 1banana and 3 oranges. You have 1 apple 2 bananas and 3 oranges. Costs: 10p per apple, 15p per banana, 20p per orange. I pay 2*10 + 1*15 + 3*20 = 95p you pay 1*10 + 2*15+3*30 = 130p

2 1 3 1 2 3

X 101520

95130=

my amounts

your amounts

my total

your total

Such a calculation is called a linear combination

Eigenvectors 2: Are intimately linked with matrix multiplication. You don’t

have to know this bit, but it would help! Take an [N*N] matrix M, and use it to multiply a [1*N]

vector of 1’s. This gives you a new [1*N] vector, of different numbers.

Call this V1 Multiply V1 by M, to get V2. Multiply V2 * M to get V3, etc. After infinite repetitions the elements of V will settle down

to a steady pattern – this is the dominant eigenvector of the matrix M 1.

Each time 1 is multiplied by M it grows by a constant multiple, which is the first eigenvalue of M E1.

* 1

1

1

V1

* V3V2

* V2V1

After a while, each successive multiplicationpreserves the shape (the eigenvector) while increasing values by a constant amount (the eigenvalue

The projections are made: by multiplying the source data by the corresponding eigenvector

elements, then adding these together. Thus the projection of site A on the first principal axis is based on

the calculation: (spp1A*V11 + spp2A*V21+spp3A*V31…) Where

spp1A = number of species 1 at site A, etc V21 = 1st eigenvector element for species 2

Luckily the PC does this for you. There is one added complication: you do not usually use raw data in the above

calculation. It is possible to do so, but the resulting scores are very dominated by the commonest species.

Instead all species data is first converted to Z scores, so that mean = 0.00 and sd = 1.00 This means that principal axes, which always run through the mean, are always centred

on the origin 0,0,0,0,… It also means that typically half the numbers in a PCA output are negative, and all are

apparently unintelligible!

Formally:

1: Derive eigenvectors of correlation matrix. Call the 1st one E1.

2: convert all raw data into Z scores (mean = 0, sd = 1.0)

3: 1st axis scores = Z * E1 (where Z is an N*R matrix,

E is a 1*N matrix). 2nd axis scores = Z*E2, etc.

The important part of a PCA..

Is interpreting the output! DON’T PANIC!! Stage 1: look at the variance on the 1st axis.

Values above 50% are hopeful, values in the 20s suggest a poor or noisy dataset. There is no significance test, but referal to table of predicted values from the broken stick distribution is often helpful.

Inspecting PCA output, 2 2: Look at the eigenvector elements on the first

axis – which species / variables have large loadings (positive or negative). Do some variables have opposed loadings (one very +ve, one very –ve) suggesting a gradient between the two extremes?

The actual values of the eigenvector elements are meaningless – it is their pattern which matters. In fact the sign of elements will sometimes differ between packages if they use different algorithms! It’s OK! The diagrams will look the same, it’s just that the pattern of the points will be reversed.

Inspect the diagrams

Stage 3: plot the scattergraph of the ordination scores and have a look.

This is the shadow of the dataset. Look for clusters, outliers, gradients.

Each point is one observation, so you can identify odd points and check the values in them. (Good way to pick up typing errors).

Overlay the graph with various markers – often this picks out important trends.

Data on Collembola of industrial waste sites.

The 1st axis of a PCA ordination detected habitat type: woodland vs open lagoon, while the second detected succession within the Tilbury trial plots.

PCA ordination of Collembola

succession on PFA sites

First principal axis

43210-1-2-3

Sec

ond

prin

cipa

l axi

s

5

4

3

2

1

0

-1

-2

-3

Habitat

Scrub woods

Open lagoon

PCA ordination, re-plotted to highlight

succession in the Tilbury trials

First principal axis

4.03.02.01.00.0-1.0-2.0-3.0

Sec

ond

prin

cipa

l axi

s

5

4

3

2

1

0

-1

-2

-3

site age, years

7.00

6.00

5.00

4.00

Biplots

Since there is an intimate connection between eigenvector loadings and axis scores, it is helpful to inspect them together.

There is an elegant solution here, known as a biplot.

You plot the site scores as points on a graph, and put eigenvector elements on the same graph (usually as arrows).

REGR factor score 1 for analysis 1

1.51.0.50.0-.5-1.0-1.5

RE

GR

fact

or

sco

re

2 fo

r a

na

lysi

s

1

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

WATER

1.00

.00

Pond community handout data – site scores plotted by SPSS

These are the new variables which appear after PCA: FAC1 and FAC2. Note aquatic sites at left hand side.

AX1

1.0.50.0-.5-1.0

AX

2

.8

.7

.6

.5

.4

.3

.2

.1

0.0

potentil

potamoge

epilob

ranuscle

phragaus

Factor scores (=eigenvector elements) for each speciesin pond dataset. Note aquatic species at left hand side.


1.51.0.50.0-.5-1.0-1.5

RE

GR

fact

or

sco

re

2 fo

r a

na

lysi

s

1

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

WATER

1.00

.00

Pond community handout data – the biplot.

Note aquatic sites and species at left hand side. Note also that this is only a description of the data – no hypotheses are stated and no significance values can be calculated.

Potamog

Phrag.

Epilobium

Ranunc

Potent.

Other ordination techniques: Corresponence analysis (CA), also known

as Reciprocal Averaging (RA)

This technique is widely used, and relates to the notion of the weighted mean.

1: Take an N*R matrix of sites*species data, and calculate the weighted mean score for each site, giving each species an arbitrary weighting.Now use the site scores to get a weighted mean for each species.

2: repeat stage 1 until the scores stabilise.

CA has a useful feature:

Namely that it derives scores for species and sites at the same time. These can be displayed in a biplot, just as with PCA. This is conceptually simpler than a PCA biplot due to the simultaneous derivation of scores.

The formal algorithm involved here is almost the same as PCA – except that (in theory) you extract eigenvectors of the matrix of chi-squared distances between samples or species, instead of the correlation coefficients.

PCA and CA have an odd feature: This concerns the ordination

diagram produced when analysing a succession, or a community which changes steadily along a transect in space.

Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

You would expect this pattern – and would be wrong!!

Axis 2

Axis 1

ID

10.009.008.007.006.005.004.003.002.001.00

Me

an

120

100

80

60

40

20

0

SPA

SPB

SPC

Idealised successional data (used in Bray-Curtiss ordination last week


1.51.0.50.0-.5-1.0-1.5-2.0

RE

GR

fact

or

sco

re

2 fo

r a

na

lysi

s

12.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

10.00

9.00

8.00

7.00

6.00

5.00

4.00

3.00

2.00

1.00

PCA ordination of idealised successional data – the Horseshoe effect (arch distortion).

This horseshoe effect was a puzzle when 1st discovered..

Ideas were that it represented a bug in the techniques, or a hitherto undiscovered fundamental truth about ecological successions.

Neither is the case. The algorithm simply tells the truth! It’s just that humans are no good at visualising high-dimensional spaces. If they were they would know that successions must be arched in a space where distance = difference between sites.

Think about the difference between an early-successional site and a mid-successional one. It is likely to be absolute – no species in common. How about an early vs a late successional site? The same, not more.

So in a dataspace where separation = difference between sites?

Site 1

Site 2 Site 3

Site 4

not far!

max. separation

max. separationThe only way to ensure that the early-mid distance = early-late distance is for the succession to be portrayed as a curve. It is!

Early

mid

LateAxis 1

Axis 2

This arch effect caught me out too..

In analysing the communities of fungi in pine forests (which roughly corresponded to a succession), I found the the 2nd axis of the PCA correlated significantly with diversity (Shannon index).

Getting my head round why the 2D projection of a 10D space should correlate with the information content of the community was not exactly easy! Well done to my supervisor for pointing out the arch distortion and community diversity both peaked mid-succession.

Ax1

Ax2

Young

Medium

Old

HENCE:

Ax2

Site age

Diversity

Site age

ALSO

HENCE:

Ax2

Diversity

DECORANA (or rather DCA) This arch effect was a well known irritant to ecologists

by the mid 1970s, and was “sorted out” by a piece of software written by Mark Hill (then and still of ITE).

The program was called DECORANA, for DEtrended CORrespondence ANAlysis. The algorithm is properly called Detrended correspondence analysis or DCA - it is a common minor misnomer to confuse the algorithm and Mark Hill’s FORTRAN code which implements the algorithm.

DCA contd. In fact recently a minor bug has been found in the

source code - it could invalidate a few analyses, but in practice seems of minor importance (it concerns the way eigenvalues are sorted when values are tied).

Post-1997 issues of DCA should be safe. I still use an older version, with a reasonably clean conscience!

It is not supplied in any standard package, but is widely available in ecological packages - we have a copy in the dept.

DCA basics The algorithm uses CA rather than PCA (faster

extraction, + simultaneous extraction of species and site scores).

It is iterative: the 1st pass is simply a 2D CA ordination.

The 1st axis is then chopped up into N subsegments (default = 26), and each one rescaled to have a common mean.

The axis is also stretched at each end

This is to ensure that the extremes get equal representation (the technical term is “a constant turnover rate”).

Ax1

Ax21

4

2

3 5

6

7

1 2 3 4 5 6 7

Before

1 2 3 4 5 6 7

After

DCA repeats this procedure.. Until the iterations converge. This gives the 1st DCA axis.

It has an eigenvalue, species and site scores just like CA - but don’t ask what the numbers actually mean!!

The procedure is then repeated to get a 2nd axis - same procedure, but the 1st axis is removed by subtraction at each stage.

DECORANA also gives a 3rd axis by the same algorithm, then stops. Note that the DCA algorithm could give N axes. As I said, 4th axes and above tend to be merely noise.

DECORANA is presented..

As biplots, although the scores can be subject to ANOVA etc.

There is no hypothesis inherent in DCA so no significance test inherent.

It is excellent for dealing with successions etc, and is many ecologists ordination of choice.

I try to avoid it only because the input data format required (Cornell Condensed Format) is an unmitigated headache! (Unless you are happy with fixed columns and Fortran fomat statements).

Cluster Analysis"The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other statistical innovation (with the possible exception of multiple regression techniques)." Cormack (1971) A review of classification. Journal of the Royal Statistical Society A 134, 321-367.

Here the aim is to aggregate individuals into clusters, based on an objective estimate of the distance between them. The aim is to produce a dendrogram:

liphook fungi cluster NN

Distance (Objective Function)

Information Remaining (%)

3.1E+01

100

2.1E+05

75

4.2E+05

50

6.3E+05

25

8.3E+05

0

OBS 01OBS 13OBS 14OBS 19OBS 02OBS 10OBS 12OBS 03OBS 04OBS 06OBS 07OBS 05OBS 26OBS 27OBS 30OBS 33OBS 34OBS 09OBS 11OBS 20OBS 29OBS 17OBS 16OBS 22OBS 23OBS 31OBS 35OBS 32OBS 21OBS 08OBS 15OBS 18OBS 24OBS 28OBS 25

This involves 2 choices, both allowing many options:1: How to measure the distance?2: What rules to build the dendrogram by?

In practice there are c 3 standard distance measures and c5 standard sets of rules to build the dendrogram, giving 15 different algorithms for cluster analysis.In addition to these 15 different STANDARD ways of making dendrogram, there are other options. One package (called CLUSTAN) offers a total around 180 different algorithms.

But each different algorithm can generate a different shape of the dendrogram, a different pattern of relationships – and you have no way of knowing which one is most useful.

For a book I explored a small number of datasets in painful detail using various multivariate techniques, and they all gave the same basic story, identified the same extreme values and clusters.Except cluster analysis, which told me several different stories, all garbage!!

Worse, if you do happen to find a dendrogram sequence that makes sense – be careful, it may be lying to you! Dendrograms can be re-arranged around any joint (or node), and quite distantly related points can end up side-by-side.

D C B A

C and B next to each other(but connected only by a high-level node)

B A D C

C and B far apart

Is the same dendrogram as

A B C D B A C D

C D A B D C A B

A B D C B A D C

C D B A D C B A

The 8 different ways of presenting the dendrogram of the quarry floor data (Usher 1975) using nearest neighbour analysis on a matrix of euclidean distances.

These patterns can re-arrange around any node, like a child’s mobile.

The number of permutations of a dendrogram with n points is 2(n-1) !!

This conjunction of S4 (top) and S1, S2, S3 below it was mentioned as showing the Lake Superior samples clustered together.

I have cut&pasted this dendrogram to show an alternative, valid arrangement, in which the Lake Superior samples are widely scattered.

The lesson here – ignore the ordering of objects along a cluster analysis axis! (Unlike all ordinations, in which order matters and is preserved under transformations.)

I know of at least one case of published work using a dendrogram to classify lake communities, where one stated discovery was that samples from one particular lake clustered together. It is true that they were next to each other on the published dendrogram, but there was a fault line down the middle of the cluster, which could have been re-arranged to show it as 2 widely spaced clusters!

My summary about cluster analysis – DON’T!! Ordinate it, or use TWINSPAN. I do have concern about the over-reliance of DNA researchers on dendrograms, although they seem to operate with fewer choices (3 distance measures, 6 ways to build up a tree) and usually publish topologically valid trees!

"The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis". Byron Morgan, personal communication.

TWINSPANTWINSPAN was written by Mark Hill (of ITE, now CEH, author of DECORANA) in 1979, and has DCA at its core. The aim is to produce a 2-way ordered classification of the data. This produces 2 linked dendrograms one of species, one of sites), but these are fixed – may not rotate, and in my experience make a great deal of sense. It also shows you which species are most associated with which areas and identifies indicator species for different parts of the dendrograms. Like DCA, this analysis has so far been confined to ecologists but deserves to be use far more widely.

Its technical details are a lecture in themselves – read it up well if you choose to use this.

TWO-WAY ORDERED TABLEOrdered list of species Ordered list of observations 1223331111222222 22333111 1 1 52912516781345788906034049143235672 2 CORTSEMI 455344---1-----1----2-------------- 000 4 INOCLACE 413-----1-------------------------- 000 Dendrogram to classify species 6 LACTRUFU --5411-------------223------------- 000 9 SUILLUTE 24423--1-3--11--------------------- 000 1 BOLEFERR --1-2-22232224-2-22-1-2-----1------ 001 10 SUILVARI 555-2--4-44525552-33335------------ 001 3 GOMPROSE 3325555555555555152444342---------- 01 8 SUILBOVI 55555555555555553555555555--------- 01 5 LACCPROX 54555455555535535555555555555542555 1 7 PAXIINVO --221232--22212-1514244554122553345 1 Dendrogram to classify observations

00000000000000000000000111111111111 00000011111111111111111000111111111 00011100000000001111111 000111111

CANOCO

Is a package, but just like DECORANA has become associated with the algorithm it implements.

To be correct, CANOCO implements several ordination algorithms but the best known and most used is CCA = Canonical Correspondence Analysis.

CCA Deals with the common ecological situation where you

wish to relate community data to environmental data. Which spp are associated with which chemicals? Is

the pattern random?

Species data

0 1 1 23 3

2 3 4 23 4

2 12 13 3

0 2 1 3 1

Environmental

pH Na K OM

pH Na K OM

pH Na K OM

Details of CCA:

Are nightmarish, unless you enjoy advanced matrix algebra!

The output is not much better! Luckily it can be converted into 2

user-friendly facets: a tri-plot a significance value

BarkingwoodsThurrock

woods

TilbZTilbO5

Thurrock lagoon

LOI

P

pH

Conductivity

CtBp

Em

Enico

Fc

Hv Im

FmIn

Ip

IpalIvp

Ivg

Ll

Lc

Se

SspSvTm

TilbC

CCA triplot of Collembola succession on PFA in relation to soil conditions.

Increasing age of PFA

Separation ofThurrock’ssaline lagoon

Young dry PFA

Fresh wet PFA

Woodland stage PFA

The significance test:

Takes H0 = no association between community and environment.

It is tested by obtaining an eigenvalue corresponding to the linkage in your actual data, then randomly shuffling the environmental data and re-calculating this eigenvalue. Repeat 200-1000 times, and see how your true value ranks among the randomly shuffled values.

This is a Monte-Carlo test - the way forward for inferential testing (IMHO).

Ordinations Peter Shaw. Introduction to ordination “Ordination” is the term used to arrange...

Documents

Transcript of Ordinations Peter Shaw. Introduction to ordination “Ordination” is the term used to arrange...