Ordinations Peter Shaw. Introduction to ordination “Ordination” is the term used to arrange...
-
date post
19-Dec-2015 -
Category
Documents
-
view
223 -
download
2
Transcript of Ordinations Peter Shaw. Introduction to ordination “Ordination” is the term used to arrange...
Ordinations
Peter Shaw
Introduction to ordination
“Ordination” is the term used to arrange multivariate data in a rational order, usually into a 2D space (ie a diagram on paper).
The underlying concept is that of taking a set of community data (or chemical or physical…) and organising them into a picture that gives you insight into their organisation.
1,2
34,5
Wet
MuddyDry
Elevation, m193189183177
Bare sand stabilised by Marram grass
0 350 600 850 1100 3500 10,000 Years stable
Cottonwood. Populus deltoides
Black oak Quercus velutina
Pines, Pinus spp
A direct ordination, projecting communities onto time: The sand dune succession at the southern end of lake Michegan (re-drawn from Olsen 1958)
Annual grasslandValley-foothill hardwood
Chaparral
Montanehardwood
Ponserosa pine
Montanehardwood
Mixed conifers
Red fir Jeffreypine pi
nyon
Lodgepolepine
Subalpineconifer
Alpine dwarf scrubW
et m
eado
ws
Moist….…………………Dry
Ele
vati
on, m
hard
woo
ds
1000
2000
3000
4000
A 2-D direct ordination, also called a mosaic diagram, in this case showing the distribution of vegetation types in relation to elevation and moisture in the Sequoia national park. This is an example of a direct ordination, laying out communities in relation to two well-understood axes of variation. Redrawn from Vankat (1982) with kind permission of the California botanical society.
Bray-Curtiss ordination This technique is good for introducing the
concept of ordination, but is almost never used nowadays.
It dates back to the late 1950s when computers were unavailable, and had the advantage that it could be run by hand.
3 steps: Convert raw data to a matrix of distances
between samples identify end points plot each sample on a graph in relation to the
end points.
Sample data - a succession: Year A B C1 100 0 02 90 10 03 80 20 54 60 35 105 50 50 206 40 60 307 20 30 408 5 20 609 0 10 7510 0 0 90
Choose a measure of distance between years:The usual one is the Bray-Curtiss index (the Czekanowski index).
Between year 1 & 2:
A B C Y1 100 0 0
Y2 90 10 0Minimum 90 0 0 = 90Sum 190 10 0 =200
distance = 1-2*90/200 = 0.1
1 2 3 4 5 6 7 8 9 10 1 0.00 2 0.10 0.00 3 0.29 0.12 0.00 4 0.41 0.28 0.15 0.00 5 0.54 0.45 0.33 0.16 0.00 6 0.65 0.50 0.45 0.28 0.12 0.00 7 0.79 0.68 0.54 0.38 0.33 0.27 0.00 8 0.95 0.84 0.68 0.63 0.56 0.49 0.26 0.00 9 1.00 0.89 0.74 0.79 0.71 0.63 0.43 0.18 0.0010 1.00 1.00 0.95 0.90 0.81 0.73 0.56 0.33 0.14 0.00
The matrix of B_C distances between each of the 10 years. Notice that the matrix is symmetrical about the leading diaginal, so only lower half is shown.
PS I did this by hand.PPS it took about 30 minutes!
Now establish the end points - this is a subjective choice, I choose years 1 and 10 as being extreme values.
Now draw a line with 1 at one end, 10 at the other.
The length of this line = a distance of 1.0, based on the matrix above.
1 10
Distance = 1.0
Now locate obs 2, 1.0 from year 10 and 0.1 from year 1. This can be done by circles.
1 10
Distance 1<-> 2 = 0.1 units, distance 2<->10 = 1.0 units, so draw 2 circles:
Year2 locates here:
Radius = 1.0
Radius = 0.1
1 10
23 4 5 6
7 89
The final ordination of the points (approx.)
Principal ComponentsAnalysis, PCA This is our first true multivariate
technique, and is one of my favourites.
It is fairly close to multiple linear regression, with one big conceptual difference (and a hugely different output).
The difference lies in fitting of residuals: MLR fits them vertically to one
special variable (Y, the dependent). PCA fits them orthogonally - all
variables are equally important, there is no dependent.
Y
X1X2
V1V2
V3
Conceptually the intentions are very different:
MLR seeks to find the best hyper-plane through the data. PCA [can be thought of as] starting off by fitting one best
LINE through the cloud of datapoints, like shining a laser beam through an array of tethered balloons. This line passes through the middle of the dataset (defined as the mean of each variable), and runs along the axis which explains the greatest amount of variation within the data.
This first line is known as the first principal axis – it is the most useful single line that can be fitted through the data (formally the linear combination of variables with the greatest variance).
The 1st principal axis of a dataset – shown as a laser in a room oftethered balloons!
Overall mean of the dataset
Often the first axis of a dataset will be an obvious simple source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.
A good indicator of the importance of a principal axis is the % variance it explains. This will tend to decrease as the number of variables = number of axes in the dataspace increases. For half-decent data with 10-20 variables you expect c. 30% of variation on the 1st principal axis.
Having fitted one principal axis, you can fit a 2nd. It will explain less variance than the 1st axis.
This 2nd axis explains the maximum possible variation, subject to 2 constraints:
1: is orthogonal (at 90 degrees to) the 1st
principal axis2: it runs through the mean of the dataset.
The 2nd principal axis of a dataset,– shown in blueThis diagram shows a 3D dataspace (axes being V1, V2 and V3).
Overall mean of the dataset
V11.00.30.51.31.52.52.53.4
V20.11.02.00.81.52.22.83.4
V30.32.12.01.01.22.52.53.5
We can now cast a shadow..
The first 2 principal axes define a 2D plane, on which the positions of all the datapoints can be projected.
Note that this is exactly the same as casting a shadow.
Thus PCA allows us to examine the shadow of a hig-dimensional object, say a 30 dimensional dataspace (defined by 30 variables such as species density).
Such a projection is a basic ordination diagram.
I always use such diagrams when exploring data – the scattergraph of the 1st 2 principal axes of a dataset is the most genuinely informative description of the entire data.
More axes:
A PCA can generate as many principal axes as there are axes in the original dataspace, but each one is progressively less important than the one preceeding it.
In my experience the first axis is usually intelligible (often blindingly obvious), the second often useful, the third rarely useful, 4th and above always seem to be random noise.
To understand a PCA output.. Inspecting the scattergraph is
useful,but you need to know the importance of each variable with respect to each axis. This is the gradient of the axis with respect to that variable.
This is given as a standard in PCA output, but you need to know how the numbers are arrived at to know what the gibberish means!
How it’s done..
1: derive the matrix of all correlation coefficients – the correlation matrix. Note the similarity to Bray-Curtiss ordination: we start with N columns of data, then derive an N*N matrix informing us about the relationship between each pair of columns.
2: Derive the eigenvectors and eigenvalues of this matrix. It turns out that MOST multivariate techniques involve eigenvector analysis, so you may as well get used to the term!
Eigenvectors 1: Involve you knowing a little about matrix multiplication. Matrix multiplication is essentially the same as solving a
shopping bill! I have 2 apples, 1banana and 3 oranges. You have 1 apple 2 bananas and 3 oranges. Costs: 10p per apple, 15p per banana, 20p per orange. I pay 2*10 + 1*15 + 3*20 = 95p you pay 1*10 + 2*15+3*30 = 130p
2 1 3 1 2 3
X 101520
95130=
my amounts
your amounts
my total
your total
Such a calculation is called a linear combination
Eigenvectors 2: Are intimately linked with matrix multiplication. You don’t
have to know this bit, but it would help! Take an [N*N] matrix M, and use it to multiply a [1*N]
vector of 1’s. This gives you a new [1*N] vector, of different numbers.
Call this V1 Multiply V1 by M, to get V2. Multiply V2 * M to get V3, etc. After infinite repetitions the elements of V will settle down
to a steady pattern – this is the dominant eigenvector of the matrix M 1.
Each time 1 is multiplied by M it grows by a constant multiple, which is the first eigenvalue of M E1.
* 1
1
1
V1
* V3V2
* V2V1
After a while, each successive multiplicationpreserves the shape (the eigenvector) while increasing values by a constant amount (the eigenvalue
The projections are made: by multiplying the source data by the corresponding eigenvector
elements, then adding these together. Thus the projection of site A on the first principal axis is based on
the calculation: (spp1A*V11 + spp2A*V21+spp3A*V31…) Where
spp1A = number of species 1 at site A, etc V21 = 1st eigenvector element for species 2
Luckily the PC does this for you. There is one added complication: you do not usually use raw data in the above
calculation. It is possible to do so, but the resulting scores are very dominated by the commonest species.
Instead all species data is first converted to Z scores, so that mean = 0.00 and sd = 1.00 This means that principal axes, which always run through the mean, are always centred
on the origin 0,0,0,0,… It also means that typically half the numbers in a PCA output are negative, and all are
apparently unintelligible!
Formally:
1: Derive eigenvectors of correlation matrix. Call the 1st one E1.
2: convert all raw data into Z scores (mean = 0, sd = 1.0)
3: 1st axis scores = Z * E1 (where Z is an N*R matrix,
E is a 1*N matrix). 2nd axis scores = Z*E2, etc.
The important part of a PCA..
Is interpreting the output! DON’T PANIC!! Stage 1: look at the variance on the 1st axis.
Values above 50% are hopeful, values in the 20s suggest a poor or noisy dataset. There is no significance test, but referal to table of predicted values from the broken stick distribution is often helpful.
Inspecting PCA output, 2 2: Look at the eigenvector elements on the first
axis – which species / variables have large loadings (positive or negative). Do some variables have opposed loadings (one very +ve, one very –ve) suggesting a gradient between the two extremes?
The actual values of the eigenvector elements are meaningless – it is their pattern which matters. In fact the sign of elements will sometimes differ between packages if they use different algorithms! It’s OK! The diagrams will look the same, it’s just that the pattern of the points will be reversed.
Inspect the diagrams
Stage 3: plot the scattergraph of the ordination scores and have a look.
This is the shadow of the dataset. Look for clusters, outliers, gradients.
Each point is one observation, so you can identify odd points and check the values in them. (Good way to pick up typing errors).
Overlay the graph with various markers – often this picks out important trends.
Data on Collembola of industrial waste sites.
The 1st axis of a PCA ordination detected habitat type: woodland vs open lagoon, while the second detected succession within the Tilbury trial plots.
PCA ordination of Collembola
succession on PFA sites
First principal axis
43210-1-2-3
Sec
ond
prin
cipa
l axi
s
5
4
3
2
1
0
-1
-2
-3
Habitat
Scrub woods
Open lagoon
PCA ordination, re-plotted to highlight
succession in the Tilbury trials
First principal axis
4.03.02.01.00.0-1.0-2.0-3.0
Sec
ond
prin
cipa
l axi
s
5
4
3
2
1
0
-1
-2
-3
site age, years
7.00
6.00
5.00
4.00
Biplots
Since there is an intimate connection between eigenvector loadings and axis scores, it is helpful to inspect them together.
There is an elegant solution here, known as a biplot.
You plot the site scores as points on a graph, and put eigenvector elements on the same graph (usually as arrows).
REGR factor score 1 for analysis 1
1.51.0.50.0-.5-1.0-1.5
RE
GR
fact
or
sco
re
2 fo
r a
na
lysi
s
1
2.0
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
WATER
1.00
.00
Pond community handout data – site scores plotted by SPSS
These are the new variables which appear after PCA: FAC1 and FAC2. Note aquatic sites at left hand side.
AX1
1.0.50.0-.5-1.0
AX
2
.8
.7
.6
.5
.4
.3
.2
.1
0.0
potentil
potamoge
epilob
ranuscle
phragaus
Factor scores (=eigenvector elements) for each speciesin pond dataset. Note aquatic species at left hand side.
REGR factor score 1 for analysis 1
1.51.0.50.0-.5-1.0-1.5
RE
GR
fact
or
sco
re
2 fo
r a
na
lysi
s
1
2.0
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
WATER
1.00
.00
Pond community handout data – the biplot.
Note aquatic sites and species at left hand side. Note also that this is only a description of the data – no hypotheses are stated and no significance values can be calculated.
Potamog
Phrag.
Epilobium
Ranunc
Potent.
Other ordination techniques: Corresponence analysis (CA), also known
as Reciprocal Averaging (RA)
This technique is widely used, and relates to the notion of the weighted mean.
1: Take an N*R matrix of sites*species data, and calculate the weighted mean score for each site, giving each species an arbitrary weighting.Now use the site scores to get a weighted mean for each species.
2: repeat stage 1 until the scores stabilise.
CA has a useful feature:
Namely that it derives scores for species and sites at the same time. These can be displayed in a biplot, just as with PCA. This is conceptually simpler than a PCA biplot due to the simultaneous derivation of scores.
The formal algorithm involved here is almost the same as PCA – except that (in theory) you extract eigenvectors of the matrix of chi-squared distances between samples or species, instead of the correlation coefficients.
PCA and CA have an odd feature: This concerns the ordination
diagram produced when analysing a succession, or a community which changes steadily along a transect in space.
Year 1 Year 2 Year 3 Year 4 Year 5 Year 6
You would expect this pattern – and would be wrong!!
Axis 2
Axis 1
ID
10.009.008.007.006.005.004.003.002.001.00
Me
an
120
100
80
60
40
20
0
SPA
SPB
SPC
Idealised successional data (used in Bray-Curtiss ordination last week
REGR factor score 1 for analysis 1
1.51.0.50.0-.5-1.0-1.5-2.0
RE
GR
fact
or
sco
re
2 fo
r a
na
lysi
s
12.0
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
10.00
9.00
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
PCA ordination of idealised successional data – the Horseshoe effect (arch distortion).
This horseshoe effect was a puzzle when 1st discovered..
Ideas were that it represented a bug in the techniques, or a hitherto undiscovered fundamental truth about ecological successions.
Neither is the case. The algorithm simply tells the truth! It’s just that humans are no good at visualising high-dimensional spaces. If they were they would know that successions must be arched in a space where distance = difference between sites.
Think about the difference between an early-successional site and a mid-successional one. It is likely to be absolute – no species in common. How about an early vs a late successional site? The same, not more.
So in a dataspace where separation = difference between sites?
Site 1
Site 2 Site 3
Site 4
not far!
max. separation
max. separationThe only way to ensure that the early-mid distance = early-late distance is for the succession to be portrayed as a curve. It is!
Early
mid
LateAxis 1
Axis 2
This arch effect caught me out too..
In analysing the communities of fungi in pine forests (which roughly corresponded to a succession), I found the the 2nd axis of the PCA correlated significantly with diversity (Shannon index).
Getting my head round why the 2D projection of a 10D space should correlate with the information content of the community was not exactly easy! Well done to my supervisor for pointing out the arch distortion and community diversity both peaked mid-succession.
Ax1
Ax2
Young
Medium
Old
HENCE:
Ax2
Site age
Diversity
Site age
ALSO
HENCE:
Ax2
Diversity
DECORANA (or rather DCA) This arch effect was a well known irritant to ecologists
by the mid 1970s, and was “sorted out” by a piece of software written by Mark Hill (then and still of ITE).
The program was called DECORANA, for DEtrended CORrespondence ANAlysis. The algorithm is properly called Detrended correspondence analysis or DCA - it is a common minor misnomer to confuse the algorithm and Mark Hill’s FORTRAN code which implements the algorithm.
DCA contd. In fact recently a minor bug has been found in the
source code - it could invalidate a few analyses, but in practice seems of minor importance (it concerns the way eigenvalues are sorted when values are tied).
Post-1997 issues of DCA should be safe. I still use an older version, with a reasonably clean conscience!
It is not supplied in any standard package, but is widely available in ecological packages - we have a copy in the dept.
DCA basics The algorithm uses CA rather than PCA (faster
extraction, + simultaneous extraction of species and site scores).
It is iterative: the 1st pass is simply a 2D CA ordination.
The 1st axis is then chopped up into N subsegments (default = 26), and each one rescaled to have a common mean.
The axis is also stretched at each end
This is to ensure that the extremes get equal representation (the technical term is “a constant turnover rate”).
Ax1
Ax21
4
2
3 5
6
7
1 2 3 4 5 6 7
Before
1 2 3 4 5 6 7
After
DCA repeats this procedure.. Until the iterations converge. This gives the 1st DCA axis.
It has an eigenvalue, species and site scores just like CA - but don’t ask what the numbers actually mean!!
The procedure is then repeated to get a 2nd axis - same procedure, but the 1st axis is removed by subtraction at each stage.
DECORANA also gives a 3rd axis by the same algorithm, then stops. Note that the DCA algorithm could give N axes. As I said, 4th axes and above tend to be merely noise.
DECORANA is presented..
As biplots, although the scores can be subject to ANOVA etc.
There is no hypothesis inherent in DCA so no significance test inherent.
It is excellent for dealing with successions etc, and is many ecologists ordination of choice.
I try to avoid it only because the input data format required (Cornell Condensed Format) is an unmitigated headache! (Unless you are happy with fixed columns and Fortran fomat statements).
Cluster Analysis"The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other statistical innovation (with the possible exception of multiple regression techniques)." Cormack (1971) A review of classification. Journal of the Royal Statistical Society A 134, 321-367.
Here the aim is to aggregate individuals into clusters, based on an objective estimate of the distance between them. The aim is to produce a dendrogram:
liphook fungi cluster NN
Distance (Objective Function)
Information Remaining (%)
3.1E+01
100
2.1E+05
75
4.2E+05
50
6.3E+05
25
8.3E+05
0
OBS 01OBS 13OBS 14OBS 19OBS 02OBS 10OBS 12OBS 03OBS 04OBS 06OBS 07OBS 05OBS 26OBS 27OBS 30OBS 33OBS 34OBS 09OBS 11OBS 20OBS 29OBS 17OBS 16OBS 22OBS 23OBS 31OBS 35OBS 32OBS 21OBS 08OBS 15OBS 18OBS 24OBS 28OBS 25
This involves 2 choices, both allowing many options:1: How to measure the distance?2: What rules to build the dendrogram by?
In practice there are c 3 standard distance measures and c5 standard sets of rules to build the dendrogram, giving 15 different algorithms for cluster analysis.In addition to these 15 different STANDARD ways of making dendrogram, there are other options. One package (called CLUSTAN) offers a total around 180 different algorithms.
But each different algorithm can generate a different shape of the dendrogram, a different pattern of relationships – and you have no way of knowing which one is most useful.
For a book I explored a small number of datasets in painful detail using various multivariate techniques, and they all gave the same basic story, identified the same extreme values and clusters.Except cluster analysis, which told me several different stories, all garbage!!
Worse, if you do happen to find a dendrogram sequence that makes sense – be careful, it may be lying to you! Dendrograms can be re-arranged around any joint (or node), and quite distantly related points can end up side-by-side.
D C B A
C and B next to each other(but connected only by a high-level node)
B A D C
C and B far apart
Is the same dendrogram as
A B C D B A C D
C D A B D C A B
A B D C B A D C
C D B A D C B A
The 8 different ways of presenting the dendrogram of the quarry floor data (Usher 1975) using nearest neighbour analysis on a matrix of euclidean distances.
These patterns can re-arrange around any node, like a child’s mobile.
The number of permutations of a dendrogram with n points is 2(n-1) !!
This conjunction of S4 (top) and S1, S2, S3 below it was mentioned as showing the Lake Superior samples clustered together.
I have cut&pasted this dendrogram to show an alternative, valid arrangement, in which the Lake Superior samples are widely scattered.
The lesson here – ignore the ordering of objects along a cluster analysis axis! (Unlike all ordinations, in which order matters and is preserved under transformations.)
I know of at least one case of published work using a dendrogram to classify lake communities, where one stated discovery was that samples from one particular lake clustered together. It is true that they were next to each other on the published dendrogram, but there was a fault line down the middle of the cluster, which could have been re-arranged to show it as 2 widely spaced clusters!
My summary about cluster analysis – DON’T!! Ordinate it, or use TWINSPAN. I do have concern about the over-reliance of DNA researchers on dendrograms, although they seem to operate with fewer choices (3 distance measures, 6 ways to build up a tree) and usually publish topologically valid trees!
"The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis". Byron Morgan, personal communication.
TWINSPANTWINSPAN was written by Mark Hill (of ITE, now CEH, author of DECORANA) in 1979, and has DCA at its core. The aim is to produce a 2-way ordered classification of the data. This produces 2 linked dendrograms one of species, one of sites), but these are fixed – may not rotate, and in my experience make a great deal of sense. It also shows you which species are most associated with which areas and identifies indicator species for different parts of the dendrograms. Like DCA, this analysis has so far been confined to ecologists but deserves to be use far more widely.
Its technical details are a lecture in themselves – read it up well if you choose to use this.
TWO-WAY ORDERED TABLEOrdered list of species Ordered list of observations 1223331111222222 22333111 1 1 52912516781345788906034049143235672 2 CORTSEMI 455344---1-----1----2-------------- 000 4 INOCLACE 413-----1-------------------------- 000 Dendrogram to classify species 6 LACTRUFU --5411-------------223------------- 000 9 SUILLUTE 24423--1-3--11--------------------- 000 1 BOLEFERR --1-2-22232224-2-22-1-2-----1------ 001 10 SUILVARI 555-2--4-44525552-33335------------ 001 3 GOMPROSE 3325555555555555152444342---------- 01 8 SUILBOVI 55555555555555553555555555--------- 01 5 LACCPROX 54555455555535535555555555555542555 1 7 PAXIINVO --221232--22212-1514244554122553345 1 Dendrogram to classify observations
00000000000000000000000111111111111 00000011111111111111111000111111111 00011100000000001111111 000111111
CANOCO
Is a package, but just like DECORANA has become associated with the algorithm it implements.
To be correct, CANOCO implements several ordination algorithms but the best known and most used is CCA = Canonical Correspondence Analysis.
CCA Deals with the common ecological situation where you
wish to relate community data to environmental data. Which spp are associated with which chemicals? Is
the pattern random?
Species data
0 1 1 23 3
2 3 4 23 4
2 12 13 3
0 2 1 3 1
Environmental
pH Na K OM
pH Na K OM
pH Na K OM
Details of CCA:
Are nightmarish, unless you enjoy advanced matrix algebra!
The output is not much better! Luckily it can be converted into 2
user-friendly facets: a tri-plot a significance value
BarkingwoodsThurrock
woods
TilbZTilbO5
Thurrock lagoon
LOI
P
pH
Conductivity
CtBp
Em
Enico
Fc
Hv Im
FmIn
Ip
IpalIvp
Ivg
Ll
Lc
Se
SspSvTm
TilbC
CCA triplot of Collembola succession on PFA in relation to soil conditions.
Increasing age of PFA
Separation ofThurrock’ssaline lagoon
Young dry PFA
Fresh wet PFA
Woodland stage PFA
The significance test:
Takes H0 = no association between community and environment.
It is tested by obtaining an eigenvalue corresponding to the linkage in your actual data, then randomly shuffling the environmental data and re-calculating this eigenvalue. Repeat 200-1000 times, and see how your true value ranks among the randomly shuffled values.
This is a Monte-Carlo test - the way forward for inferential testing (IMHO).