Multivariate Ordination Macquarie University R users group workshop November 2015 Michael Strack and...
-
Upload
ashley-perry -
Category
Documents
-
view
216 -
download
0
Transcript of Multivariate Ordination Macquarie University R users group workshop November 2015 Michael Strack and...
Multivariate OrdinationMacquarie University R users group workshop November 2015
Michael Strack and Drew AllenWorkshop materials at (tiny link)
Disclaimer
• I am not an expert at any of this - please correct me if I say or do something stupid, or stop me if something isn’t clear
• We will skim over some details, but still, expect information overload. That’s OK. Don’t worry too much about the fine details of the code. Focus more on stats.
• I’ll spend more time on what these can do than what they shouldn’t do. Investigate the appropriateness of these techniques further!
• You may not be ready to apply these techniques immediately (you’ll need to read further), but this workshop should help a little
Structure
1. Introduction to ordination2. Descriptive techniques• PCA• A quick overview of variations: CA, PCoA, NMDS
3. Canonical (constrained) techniques• Conceptual introduction• Example: CCA• A quick overview of variations: RDA, CCA, dbRDA
4. (if time) Bonus round: Who’s afraid of linear algebra?
What is multivariate ordination?
• Multivariate = multiple variables (biological data almost always)• Each environmental condition = 1 variable• Each measured trait = 1 variable• Abundance of each species = 1 variable• Conceptually, each variable = 1 axis• May have thousands of variables & axes… hard to analyse!
• Ordination = relative spatial positions• Multivariate data = multidimensional data• Spatial relationships between observations can highlight major patterns• Spatial reasoning is intuitive!• Crucial metric = distance
Level 1: univariate ordination
• One variable – or “axis”, or “dimension” – called X• Two observations (a = x1, b = x2)• What is the distance?
• Easy, since there’s only one dimension (univariate data)• distance = x2 – x1 (or vice-versa)
Variable X
x1 x2
Level 2: bivariate ordination
• Two variables: X, Y• Two observations: a = x1y1, b = x2y2
• What is the distance?• Could do x2 - 1y2 – 1
• OR draw a line & measure it• We define this line as a new axis, because we
can• Variation in the data is conserved• Dimensionality is reduced• Analysis is simplified• Real-world units are lost: Interpretation is tricky!
Synthetic axis= Principle Components axis 1 (PCA1)
New coordinates use this system
Observation a
Observation b
Principle Components Analysis (PCA)
• Congratulations, you just did your first (trivial) multivariate analysis: “principle components analysis” (PCA)• We’ll focus on this for a while – many concepts general to ordination• ACTIVITY (10 minutes)
• Form small groups• Play with this site: http://setosa.io/ev/principal-component-analysis/• Observe how well PCA simplifies various bivariate relationships (linear, blob, parabola
etc.)• Similarly, observe which 3D alignment of the PCAs performs “best” – try to maximize
variation on PCA1 and minimize that on PCA2• Observe how the arch in the data limits your ability to simplify the variation• Discuss
Some PCA Observations
• Axes are computed in order of most least variance captured• The first PC axis can have any “direction” through the data, but each
subsequent axis is orthogonal (i.e. at right-angles) to all other axes in multidimensional space• PC axes are therefore “linearly independent” (like North-South/East-West)• PCA is in a family of purely descriptive techniques• No statistical hypotheses – that means no p-tests etc.• However, PCA is most effective (in terms of accurately representing the
greatest possible variance on the least possible axes) on multinormal & multilinear data
PCA – huh! – what is it good for?
• Data exploration, simplification, pattern picking• PCA focuses on Euclidean (i.e. geometric) distance• Therefore will best capture linear relationships• Think about your data!• Community data (e.g. species counts) generally not suitable• Often environmental variables work• Imprecise (semi-quantitative) data work OK – a bit like Spearman’s
correlation• Even binary data apparently, though I have no experience with this• Think about your data!
Finally, some R!
• Today’s program essentially lifted wholesale from Numerical Ecology with R
• Community ecology focus – but methods useful for many data types
• Q: Michael, how did you learn all this stuff anyway?• A:
1. Pestering David Nipperess2. Repeatedly bashing this book against my thick skull3. Laborious websearching4. Trial-and error coding
• Happily, all the code and data may be found at http://adn.biol.umontreal.ca/~numericalecology/numecolR/ and in the directory I supplied
• If you want to learn more, I can highly recommend this as quite well-commented example code
Code time
1. Open the “multi_ord” script for your OS2. Set the working directory to the source file location3. Install & mount needed packages if you haven’t already4. Import & prep data – Doubs River fish/environment dataset (?doubs)5. Plot a map of this site for context6. Run summary of environmental data for more context7. Run PCA using confusingly-named Vegan function “rda”
• PCA variables must be dimensionally homogeneous or dimensionless • Argument scale = TRUE scales variables to unit variance (ASSUMES NORMALITY)
8. Plot results (both plots)
What am I looking at?
• NEED TO KNOW: “scaling” (not the scale argument!)• Scaling 1: “distance”, focuses on distances between objects• Scaling 2: “correlation”, focuses on correlations between descriptors
(variables), represented by the angles of the vectors• Can’t have your cake and eat it, unfortunately.
• Note the slight curve in the positions of the sites – artefact called the “horseshoe effect” from long gradients• This one is not particularly serious. If you see a stronger shape, stop your
analysis and consult further.• More info: http://ordination.okstate.edu/PCA.htm
So what?
• Let’s interpret the plots (ideally we should check some boring things first, but let’s live dangerously for now)
Interpretation rules!
• 90° position of objects along descriptor vectors approximates value• Scaling = 1 (distance)• Eigenvectors are scaled to unit length• Distances among objects are approximately conserved• Angles among descriptor variables are meaningless
• Scaling = 2 (correlation)• Eigenvectors are scaled to . Longer vectors = more contribution• Distances among objects are not well conserved!• Angles between descriptors reflect their correlations
Quality control (ideally precedes inference!)• Run the first “summary” function (scaling = 2 i.e. correlation)• The {vegan} idiom:• “Inertia” = variance. Note that since we scaled all variables to unit variance,
the inertia in this case is just the number of variables (11)• Unconstrained vs. constrained: descriptive vs. tested statistical techqniues
(relevant later). PCA is a descriptive, unconstrained technique.• Eigenvalues (λj): measure of inertia captured by each axis• “Species scores”: {vegan} refers to all response variables as “species” for
historical reasons. These scores are the coordinates of the variable vector arrowheads (refer to your plots)• “Site scores”: coordinates of objects/observations/samples.
How many PCA axes to interpret?
• source() script “evplot.R” and run on PCA object• Two criteria:• Kaiser-Guttman: based on mean eigenvalues of axes• Broken stick: based on randomly dividing a “stick” of unit length• Specific choice probably not that important (but I prefer broken stick)• Also implemented in function “PCAsignificance” of package {BiodiversityR}
• Bottom line:• Higher eigenvalues in fewer axes means your PCA is working well, in the sense
of the “simplification/variance captured” trade-off• Don’t interpret axes with small eigenvalues
Questions?
It doesn’t have to be Euclidean
• PCA is a member of a family of unconstrained ordination techniques• Conceptually, Euclidean geometry is not the only way to define
“distance”. Several variations:• Correspondence analysis (CA): uses χ2 distance, works on count or frequency-like
data e.g. species data tables• Principal coordinate analysis or “classic” multidimensional scaling (PCoA or MDS):
works on any distance measure e.g. similarity indices (Bray-Curtis, Jaccard etc.)• Non-metric multidimensional scaling (NMDS): rather different! Tries to preserve
the rank-order relationships between observations when represented in 2D – therefore best for cluster-type analyses
• Multiple correspondence analysis (MCA) is like PCA but explicitly for categorical variables, implemented as function MCA() in package {FactoMineR}
Correspondence analysis (CA)
• Preserves χ2 distance• Use-case
• Frequency or frequency-like data• Dimensionally homogeneous• Non-negative• All of the above apply to e.g. species counts or presence-absence data• Think about your data
• Absolute inertia works differently, but proportion per axis still the key metric for QC/interpretation• I haven’t really got any experience with this…• But let’s give it a go
Code time
1. Run CA on the Doubs fish species data (not environmental variables!) using {vegan} function cca()• {ade4} has an identically-named function, but we’ll use {vegan}
2. Examine the summary using both scalings. You’ll recall:• Scaling = 1 is object-focused (here sites)• Scaling = 2 is correlation-focused (here species)• More later
3. Run function evplot() to see which axes to interpret4. Note the extreme dominance of first axis – in CA, eigenvalues > 0.6
indicate strong gradients
Plotting & interpretation
1. Plot CA (note that several plotting functions can be used for ordinations) with both scalings
2. Interpretation – river sites are numbered 1:30 upstream:downstream
3. CA can produce “arch effect”• Similar to (but not generally as extremeas ) “horseshoe effect” from PCA• Arises from unimodal species abundances on gradients and observing long
gradients• Corrections (“detrending” and detrended correspondence analysis or DCA)
exist but are apparently problematic?
Interpretation rules!
• Scaling = 1• χ2 distance among objects (sites)• Close objects = likely to have similar relative frequencies of species• Objects near a species likely to contain high contribution of that species (or a
“1” if data are presence/absence)
• Scaling = 2• χ2 distance among species• Close species = likely to have similar relative frequencies in objects• Species near object is more likely to be relatively abundant (or present if 1/0
data) in that object
Principle Coordinates Analysis (PCoA)• Provides a Euclidean (i.e. geometry as we know it) representation of
non-Euclidean distance measures• Any measure of similarity/difference (difference = 1 – similarity & vice versa)• Commonly used for community similarity indices (Bray-Curtis for instance)• Another application: sample differences can be computed using dimensionally
heterogeneous variables and Gower’s index, which is pretty cool• If you use a Euclidean distance measure, PCoA is largely the same as PCA!• Think about your data
• When comparing community or species data, decision of when to use CA (χ2 distance) vs. PCoA (Bray-Curtis distance or other) is a subtle one and general recommendations are difficult
Code time – PCoA is a bit different
1. Compute Bray-Curtis dissimilarities for fish data using vegdist()2. Compute PCoA of dissimilarities using cmdscale()
• Observe argument k =• Unlike other methods so far, cmdscale() allows you to specify how many dimensions (k) you want a
solution returned in (default = 2)• An exact representation requires max (n – 1) dimensions, where n = number of observations or
points)• Argument eig = TRUE ensures that the function actually returns eigenvalues for axes,
which it omits by default• Observe warning: PCoA can return negative eigenvalues due to non-Euclidean geometry.
• Generally can be ignored, unless magnitude is large (i.e. larger than largest positive eigenvalue).• Corrections exist, e.g. argument add = TRUE to function cmdscale()
• Plotting species or variables requires them to be projected onto the PCoA plots a posteriori using various methods (weighted averages, correlations with axes…)
Comparing Bray-Curtis PCoA and CA
1. Run the plotting code comparing the two side-by-side2. Interpretation proceeds identically3. In this case, which method seems more informative?
Nonmetric multidimensional scaling (NMDS)• Fundamentally different: not eigenvalue-based but rather iterative,
permutative shuffling to find an optimal ordination• Focus on rank order similarity rather than some measure of absolute
distance• Any distance measure• Choose m dimensions for representation (obviously 2 is generally most
useful)• Can handle a few missing distance values• Accuracy of representation measured by “stress value” (0 to 1)• Like PCoA, species or other variables plotted a posteriori using wa
Code time
1. Run NMDS on Doubs fish data using metaMDS()2. Examine the object and the “stress” component (stress should always be reported in
publications)3. Plot the ordination side-by-side with the Bray-Curtis PCoA
• Interpretation proceeds identically• Compare – does one seem more informative than the other?• NMDS axes are dimensionless
4. There are further diagnostics (Shepherd and goodness-of-fit plots) which we won’t go into
5. Also some vagaries of the fact that it’s an iterative method… local minimum solutions greater than global minima etc. More reading!
6. ALSO also since it’s good for clustering you can add clusters to the plot a posteriori (data not shown)
End of unconstrained ordination!
• Think about your data• Lots of choice• Subtle differences can be important• Confusingly, some ecological distance measures are fully Euclidean
(e.g. Jaccard, Sorensen) – PCA on distance matrix is appropriate!• There’s a fairly nuanced and comprehensive discussion under
“Additional reading” at http://econ.upf.edu/~michael/stanford/
Let’s take a break!
Constrained (or canonical) ordination• Serious business: formal statistical test of relationship between two
data matrices e.g. species/environmental conditions• p-values), R2, variation partitioning etc.
• Another large family of techniques – only time for a couple today!• Asymmetrical methods
• Redundancy analysis (RDA) and distance-based redundancy analysis (dbRDA)• Canonical correspondence analysis (CCA)• Linear discriminant analysis (LDA)
• Symmetrical methods• Canonical correlation analysis (CCorA)• Co-inertia analysis (CoIA)• Multiple factor analysis (MFA)
Redundancy analysis (RDA)
• Essentially this is multivariate linear regression (multiple explanatory & multiple response variables), achieved with a PCA• Ordination of response variables is constrained by the matrix of
explanatory variables• Let’s see how this works…
Don’t worry if you don’t fully understand this – I don’t, either…
As you can see!
OK, maybe we’ll just learn by doing…1. Run the prep code (don’t worry too much about it)2. Create the redundancy analysis using function rda()3. Examine the summary (default scaling = 2, correlations)• Note that the total inertia is now partitioned:
• Constrained inertia = variance in response variables modelled by variance in explanatory variables. Like an R2, and biased like the R2 of multiple regression. Correction later.
• Unconstrained = variance represented by unconstrained PCA (i.e. residuals)
• Note the “proportion explained” for each axis (constrained axes = RDA1:12, unconstrained = PCA1:16)• We included factors (yes, you can use categorical variables!) which, rather than
a vector, have a “centroid” to describe their relationship with sites/species
More summary() info…
• Note the two possible coordinate systems for plotting the sites:• “weighted sums of species scores”• “linear combinations of constraining variables”• There’s apparently no universal best answer… we’ll use the weighted sums
(confusingly abbreviated “wa”) and not worry too much about it
• We should now proceed to QC• Significance testing• Unbiased R2
• But let’s be naughty again and look at a pretty picture
Very good, but what does that look like?1. Run the “triplot” code for each scaling (remember, 1 = distance, 2 =
correlations!)2. Note the factor variables (penlow:penvery_steep) don’t have vector
arrows: they are centroids.3. Interpretation similar to PCA, but additional rules…
Interpretation rules! (additional to PCA)• Centroids for an explanatory variable (e.g. environmental variable) may be
projected at right angles to a response variable vector (e.g. species) to ascertain their relationship• Scaling = 1 (distance)
• Angles between response and explanatory variables reflect correlations – but not angles among response variables
• Distances among centroids, and between centroids and objects, approximate their Euclidean distances
• Scaling = 2 (correlations)• All angles among and between response and explanatory variables reflect their
correlations• Distances among and between centroids and objects do not approximate their Euclidean
distances
OK, better check how good the model is1. In the QC code, check the adjusted R2 – does this seem OK for noisy
ecological data?2. Global significance test. It’s not really an ANOVA – it’s permutation
(F-test)3. Axis-wise significance test. What if there are more than 2
interesting/significant axes?4. Variance inflation factors (VIF): values higher than 10 indicate
substantial collinearity. What should we do?• Parsimony will make this easier to interpret. Let’s learn that with CCA
Canonical correspondence analysis (CCA)• Similar to RDA, but using χ2 distance and CA instead of Euclidean distance
and PCA• Let’s dive right into the code1. Create the CCA object2. Examine the summary (default scaling = 2)• Variation is expressed as “Mean squared contingency coefficient” instead of
inertia. Biased upwards but cannot be easily adjusted3. Run the first 2 plots4. Whoa! Let’s use the display = argument to remove some stuff5. Run the next 2 plots
Interpretation rules
• Scaling 1 (distance)• Projecting an object at right angle on quantitative explanatory variable approximates the value
of that object• An object found near the point representing the centroid of a qualitative explanatory variable is
more likely to possess state “1” for that variable• Distances among centroids of qualitative explanatory variables, and between centroids and
individual objects, approximate χ2 distances
• Scaling 2 (correlation)• The optimum of a species along a quantitative environmental variable can be obtained by
projecting the species at right angle on the variable• A species found near the centroid of a qualitative environmental variable is likely to be found
frequently (or in larger abundances) in the sites possessing state “1” for that variable• Distances among centroids, and between centroids and individual objects, do not approximate
χ2 distances
Let’s do some QC
1. Run the permutation tests of model and axis significance• We have a good model but interpretation is a little tricky because
there are four significant axes. A pairwise plotting is not really feasible.• Let’s find a parsimonious model using ordistep() automated
stepwise parameter selection2. Run the parameter selection code3. Run the significance testing4. Compare the VIFs
The joy of parsimony
• These are powerful techniques and there’s often too much they can say about the data, and it’s hard to know where to start or stop• A parsimonious model finds a manageable number of variables which
pick out the main patterns in the data• It’s a balance between statistical and inferential power• As always, depends on question/hypothesis – ideally a priori• Beware stepwise selection, data mining & multiple comparisons
problem, selective reporting (though many of us are guilty of it…)• With great statistical power comes great responsibility
Other techniques (not shown)
• Partial ordination: include a matrix of “covariables” whose influence on dependent variables will be “partialled out”, leaving the “pure” or independent effect of the explanatory matrix• RDA only:
• Variation partitioning: for up to four explanatory matrices, how much does each one contribute to explained variance of dependent matrix?
• Multivariate analysis of variance (MANOVA)• Distance-based redundancy analysis (dbRDA) basically adapts RDA to use any distance
measure (e.g. Bray-Curtis) via intermediate PCoA in place of PCA
• Linear discriminant analysis (LDA) tests how much of pre-defined groupings is explained by a matrix of independent variables• Symmetrical techniques: no “explanatory” or “dependent” variables, only
correlations…
Bonus round: who’s afraid of linear algebra?• Play with this: http://setosa.io/ev/eigenvectors-and-eigenvalues/
• Clear as mud?
Who wants to do Hacky Hour at the pub?