PLS - Discriminant Analysis (PLS-DA) and Sparse PLS-DA
Transcript of PLS - Discriminant Analysis (PLS-DA) and Sparse PLS-DA
PLS - Discriminant Analysis (PLS-DA) andSparse PLS-DA
B. Liquet1,2,3
1 LMAP, Université de Pau et des Pays de L’Adour’
2 UPPA-E2S
3 ACEMS, QUT, Australia
Introduction
PLSDA overview
PLS−DA
Quantitative
Qualitative
Biological question
I am analysing a single data set (e.g. transcriptomics data) andI would like to classify my samples into known groups andpredict the class of new samples. In addition, I am interested inidentifying the key variables that drive such discrimination.*
The srbct study
The data are directly available in a processed and normalised format from themixOmics package. The Small Round Blue Cell Tumours (SRBCT) datasetincludes the expression levels of 2,308 genes measured on 63 samples.
I The samples are classified into four classes as follows: 8 Burkitt Lymphoma(BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20rhabdomyosarcoma (RMS).
The srbct dataset contains the following:
gene: a data frame with 63 rows and 2308 columns. The expression levels of2,308 genes in 63 subjects.
class: a class vector containing the class tumour of each individual (4 classes intotal).
gene.name: a data frame with 2,308 rows and 2 columns containing furtherinformation on the genes.
To illustrate PLS-DA, we will analyse the gene expression levels of srbct$gene todiscriminate the 4 groups of tumours.
Principle of sparse PLS-DA
Although Partial Least Squares was not originally designed forclassification and discrimination problems, it has often been used forthat purpose.
The response matrix ‘Y‘ is qualitative and is internally recodedas a dummy block matrix that records the membership of eachobservation, i.e. each of the response categories are coded viaan indicator variable.
The PLS regression (now PLS-DA) is then run as if Y was acontinuous matrix.
Principle of sparse PLS-DA
Sparse PLS-DA performs variable selection and classification ina one step procedure. sPLS-DA is a special case of sparse PLSdescribed previously, where `1 penalization is applied on theloading vectors associated to the X data set.
Principle of sparse PLS-DA
library(mixOmics)data(srbct)X <- srbct$geneY <- srbct$classsummary(Y)
EWS BL NB RMS23 8 12 20
dim(X); length(Y)
[1] 63 2308
[1] 63
Quick start
For a quick start we arbitrarily set the number of variables to selectto 50 on each of the 3 components of PLS-DA.
# 1 Run the methodMyResult.splsda <- splsda(X, Y, keepX = c(50,50))# 2 Plot the samplesplotIndiv(MyResult.splsda)
# 3 Plot the variablesplotVar(MyResult.splsda)
# Selected variables on component 1selectVar(MyResult.splsda, comp=1)$name
Plot samples
EWS.T1EWS.T2EWS.T3EWS.T4
EWS.T6
EWS.T7
EWS.T9EWS.T11
EWS.T12
EWS.T13
EWS.T14EWS.T15
EWS.T19
EWS.C8
EWS.C3
EWS.C2EWS.C4
EWS.C6
EWS.C9EWS.C7
EWS.C1
EWS.C11EWS.C10
BL.C5BL.C6
BL.C7
BL.C8BL.C1
BL.C2
BL.C3BL.C4
NB.C1NB.C2
NB.C3NB.C6NB.C12
NB.C7
NB.C4NB.C5
NB.C10NB.C11
NB.C9
NB.C8
RMS.C4
RMS.C3
RMS.C9RMS.C2
RMS.C5RMS.C6
RMS.C7
RMS.C8RMS.C10
RMS.C11
RMS.T1
RMS.T4
RMS.T2
RMS.T6
RMS.T7
RMS.T8RMS.T5
RMS.T3
RMS.T10
RMS.T11
PlotIndiv
0 5 10
−5
0
X−variate 1: 6% expl. var
X−
varia
te 2
: 6%
exp
l. va
r
Plot variables
g29g36 g52
g74
g85 g123g165g166
g187
g188
g190
g229
g246
g276 g335
g336g348g368
g373 g469
g509
g545
g555
g566
g585g589
g758
g779
g780 g783
g803g820
g828
g836g846
g849
g867g971
g979
g998
g1003
g1008
g1036g1042
g1049
g1067
g1074
g1089
g1090
g1093
g1099
g1110
g1116
g1158
g1194
g1206
g1207
g1279
g1283
g1295
g1298g1319g1327
g1330
g1372
g1375
g1386g1387
g1389
g1443
g1453
g1536
g1587
g1606
g1645
g1671
g1706
g1708
g1735
g1772
g1799
g1839g1884
g1888
g1915
g1916
g1917
g1954
g1955
g1974
g1980
g1991
g2050
g2116
g2117
g2127g2186
g2230
g2253
g2279
Correlation Circle Plots
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
Component 1
Com
pone
nt 2
Comments
I We can observe a clear discrimination between the BL samples andthe others on the first component (x-axis), and EWS vs the others onthe second component (y-axis).
I Remember that this discrimination spanned by the first two PLS-DAcomponents is obtained based on a subset of 100 variables (50selected on each component).
I From the plotIndiv the axis labels indicate the amount of variationexplained per component.
I Note that the interpretation of this amount is not the same as in PCA.In PLS-DA, the aim is to maximise the covariance between X and Y,not only the variance of X as it is the case in PCA!
PLS-DA
PLS-DA without variable selection can be performed as:
MyResult.plsda <- plsda(X,Y) # 1 Run the methodplotIndiv(MyResult.plsda) # 2 Plot the samples
plotVar(MyResult.plsda) # 3 Plot the variables
Customize sample plotsplotIndiv(MyResult.splsda, ind.names = FALSE, legend=TRUE,
ellipse = TRUE, star = TRUE, title = 'sPLS-DA on SRBCT',X.label = 'PLS-DA 1', Y.label = 'PLS-DA 2')
sPLS−DA on SRBCT
0 5 10 15
−5
0
5
PLS−DA 1
PLS
−D
A 2
Legend
EWS
BL
NB
RMS
Customize variable plotsplotVar(MyResult.splsda, var.names=FALSE)
Correlation Circle Plots
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
Component 1
Com
pone
nt 2
Customize variable plotsplotVar(MyResult.plsda, cutoff=0.7)
g1
g571
g719
g812
g875g906
g937g1007
g1067
g1082
g1167
g1194 g1888
g1894
g1932
g2253
g2276
Correlation Circle Plots
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
Component 1
Com
pone
nt 2
In this particular case, no variable selection was performed. Only the display wasaltered to show a subset of variables.
Other useful plots
ROC
As PLS-DA acts as a classifier, we can plot a ROC Curve to complement thesPLS-DA classification performance results detailed latter. The AUC is calculatedfrom training cross-validation sets and averaged.
auc.plsda <- auroc(MyResult.splsda)
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
100 − Specificity (%)
Sen
sitiv
ity (
%)
Outcome
BL vs Other(s): 1
EWS vs Other(s): 0.5576
NB vs Other(s): 0.518
RMS vs Other(s): 0.6814
ROC Curve Comp 1
Variable selection outputsFirst, note that the number of variables to select on each componentdoes not need to be identical on each component, for example:
MyResult.splsda2 <- splsda(X,Y, ncomp=3, keepX=c(15,10,5))
Selected variables are listed in the selectVar function:
selectVar(MyResult.splsda2, comp=1)$value
value.varg123 0.53516982g846 0.41271455g335 0.30309695g1606 0.30194141g836 0.29365241g783 0.26329876g758 0.25826903g1386 0.23702577g1158 0.15283961g585 0.13838913g589 0.12738682g1387 0.12202390g1884 0.08458869g1295 0.03150351g1036 0.00224886
plotLoadings
Selected variables can be visualised in plotLoadings with the argumentscontrib = 'max' that is going to assign to each variable bar the sample groupcolour for which the mean (method = 'mean') is maximum. Seeexample(plotLoadings) for other options (e.g. min, median)
plotLoadings(MyResult.splsda2, contrib = 'max', method = 'mean')
g123g846g335
g1606g836g783g758
g1386g1158
g585g589
g1387g1884g1295g1036
0.0 0.1 0.2 0.3 0.4 0.5
Contribution on comp 1
Outcome
EWSBLNBRMS
Comments
I Interestingly from this plot, we can see that all selectedvariables on component 1 are highly expressed in the BL(orange) class.
I Setting contrib = 'min' would highlight that those variablesare lowly expressed in the NB grey class, which makes sensewhen we look at the sample plot.
I Since 4 classes are being discriminated here, samples plots in3d may help interpretation:
plotIndiv(MyResult.splsda2, style="3d")
Tuning parameters and numerical outputs
For this set of methods, three parameters need to be chosen:
1 - The number of components to retain ncomp. The rule of thumb isusually K − 1 where K is the number of classes, but it is worthtesting a few extra components.
2 - The number of variables keepX to select on each component forsparse PLS-DA,
3 - The prediction distance to evaluate the classification andprediction performance of PLS-DA.
Tuning parameters and numerical outputs
I For item 1, the perf evaluates the performance of PLS-DA fora large number of components, using repeated k-foldcross-validation.
I For example here we use 3-fold CV repeated 10 times (notethat we advise to use at least 50 repeats, and choose thenumber of folds that are appropriate for the sample size of thedata set):
Tuning parameters and numerical outputs
MyResult.plsda2 <- plsda(X,Y, ncomp=10)set.seed(30) # for reproducibility in this vignette, otherwise increase nrepeatMyPerf.plsda <- perf(MyResult.plsda2, validation = "Mfold", folds = 3,
progressBar = FALSE, nrepeat = 10) # we suggest nrepeat = 50
# type attributes(MyPerf.plsda) to see the different outputs# slight bug in the output function currently see the quick fix below#plot(MyPerf.plsda, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")
# quick fixmatplot(MyPerf.plsda$error.rate$BER, type = 'l', lty = 1,
col = color.mixo(1:3),main = 'Balanced Error rate')
legend('topright',c('max.dist', 'centroids.dist', 'mahalanobis.dist'),lty = 1,col = color.mixo(5:7))
Tuning parameters and numerical outputs
2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
0.5
Balanced Error rate
MyP
erf.p
lsda
$err
or.r
ate$
BE
R
max.distcentroids.distmahalanobis.dist
Comments
I The plot outputs the classification error rate, or Balancedclassification error rate when the number of samples per groupis unbalanced, the standard deviation according to threeprediction distances.
I Here we can see that for the BER and the maximum distance,the best performance (i.e. low error rate) seems to be achievedfor ncomp = 3.
Prediction performanceIn addition (item 3 for PLS-DA), the numerical outputs listed herecan be reported as performance measures:
MyPerf.plsda
Call:perf.plsda(object = MyResult.plsda2, validation = "Mfold", folds = 3, nrepeat = 10, progressBar = FALSE)
Main numerical outputs:--------------------Error rate (overall or BER) for each component and for each distance: see object$error.rateError rate per class, for each component and for each distance: see object$error.rate.classPrediction values for each component: see object$predictClassification of each sample, for each component and for each distance: see object$classAUC values: see object$auc if auc = TRUE
Visualisation Functions:--------------------plot
The number of variables
I Regarding item 2, we now use tune.splsda to assess theoptimal number of variables to select on each component.
I We first set up a grid of keepX values that will be assessed oneach component, one component at a time.
I Similar to above we run 3-fold CV repeated 10 times with amaximum distance prediction defined as above.
list.keepX <- c(5:10, seq(20, 100, 10))list.keepX # to output the grid of values tested
[1] 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
set.seed(30) # for reproducbilitytune.splsda.srbct <- tune.splsda(X, Y, ncomp = 3,
validation = 'Mfold',folds = 3, dist = 'max.dist', progressBar = FALSE,measure = "BER", test.keepX = list.keepX,nrepeat = 10) # we suggest nrepeat = 50
CommentsWe can then extract the classification error rate averaged across allfolds and repeats for each tested keepX value, the optimal numberof components (see ?tune.splsda for more details), the optimalnumber of variables to select per component which is summarisedin a plot where the diamond indicated the optimal keepX value:
error <- tune.splsda.srbct$error.ratencomp <- tune.splsda.srbct$choice.ncomp$ncomp# optimal number of components based on t-tests on the error ratencomp
[1] 3
select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]# optimal number of variables to selectselect.keepX
comp1 comp2 comp350 50 70
Tune
plot(tune.splsda.srbct, col = color.jet(ncomp))
5 10 20 50 100
0.0
0.1
0.2
0.3
0.4
0.5
Number of selected features
Bal
ance
d er
ror
rate
comp1comp1 to comp2comp1 to comp3
Final model
Based on those tuning results, we can run our final and tunedsPLS-DA model:
MyResult.splsda.final <- splsda(X, Y, ncomp = ncomp, keepX =select.keepX)
plotIndiv(MyResult.splsda.final, ind.names = FALSE, legend=TRUE,ellipse = TRUE, title="SPLS-DA, Final result")
Final comment
- Additionally we can run ‘perf‘ for the final performance of thesPLS-DA model. Also note that ‘perf‘ will output ‘features‘ thatlists the frequency of selection of the variables across thedifferent folds and different repeats.
- This is a useful output to assess the confidence of your finalvariable selection, see a more [detailed examplehere](http://mixomics.org/case-studies/splsda-srbct/).
Take Home Message: PLS-DA and sPLS-DA
- Dimension Reduction approach
- supervised method for qualitative response variable
- Similar as discriminant analysis
- sparse version enables us variable selection