PLS - Discriminant Analysis (PLS-DA) and Sparse PLS-DA

PLS - Discriminant Analysis (PLS-DA) andSparse PLS-DA

B. Liquet1,2,3

1 LMAP, Université de Pau et des Pays de L’Adour’

2 UPPA-E2S

3 ACEMS, QUT, Australia

Introduction

PLSDA overview

PLS−DA

Quantitative

Qualitative

Biological question

I am analysing a single data set (e.g. transcriptomics data) andI would like to classify my samples into known groups andpredict the class of new samples. In addition, I am interested inidentifying the key variables that drive such discrimination.*

The srbct study

The data are directly available in a processed and normalised format from themixOmics package. The Small Round Blue Cell Tumours (SRBCT) datasetincludes the expression levels of 2,308 genes measured on 63 samples.

I The samples are classified into four classes as follows: 8 Burkitt Lymphoma(BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20rhabdomyosarcoma (RMS).

The srbct dataset contains the following:

gene: a data frame with 63 rows and 2308 columns. The expression levels of2,308 genes in 63 subjects.

class: a class vector containing the class tumour of each individual (4 classes intotal).

gene.name: a data frame with 2,308 rows and 2 columns containing furtherinformation on the genes.

To illustrate PLS-DA, we will analyse the gene expression levels of srbct$gene todiscriminate the 4 groups of tumours.

Principle of sparse PLS-DA

Although Partial Least Squares was not originally designed forclassification and discrimination problems, it has often been used forthat purpose.

The response matrix ‘Y‘ is qualitative and is internally recodedas a dummy block matrix that records the membership of eachobservation, i.e. each of the response categories are coded viaan indicator variable.

The PLS regression (now PLS-DA) is then run as if Y was acontinuous matrix.


Sparse PLS-DA performs variable selection and classification ina one step procedure. sPLS-DA is a special case of sparse PLSdescribed previously, where `1 penalization is applied on theloading vectors associated to the X data set.


library(mixOmics)data(srbct)X <- srbct$geneY <- srbct$classsummary(Y)

EWS BL NB RMS23 8 12 20

dim(X); length(Y)

[1] 63 2308

[1] 63

Quick start

For a quick start we arbitrarily set the number of variables to selectto 50 on each of the 3 components of PLS-DA.

# 1 Run the methodMyResult.splsda <- splsda(X, Y, keepX = c(50,50))# 2 Plot the samplesplotIndiv(MyResult.splsda)

# 3 Plot the variablesplotVar(MyResult.splsda)

# Selected variables on component 1selectVar(MyResult.splsda, comp=1)$name

Plot samples

EWS.T1EWS.T2EWS.T3EWS.T4

EWS.T6

EWS.T7

EWS.T9EWS.T11

EWS.T12

EWS.T13

EWS.T14EWS.T15

EWS.T19

EWS.C8

EWS.C3

EWS.C2EWS.C4

EWS.C6

EWS.C9EWS.C7

EWS.C1

EWS.C11EWS.C10

BL.C5BL.C6

BL.C7

BL.C8BL.C1

BL.C2

BL.C3BL.C4

NB.C1NB.C2

NB.C3NB.C6NB.C12

NB.C7

NB.C4NB.C5

NB.C10NB.C11

NB.C9

NB.C8

RMS.C4

RMS.C3

RMS.C9RMS.C2

RMS.C5RMS.C6

RMS.C7

RMS.C8RMS.C10

RMS.C11

RMS.T1

RMS.T4

RMS.T2

RMS.T6

RMS.T7

RMS.T8RMS.T5

RMS.T3

RMS.T10

RMS.T11

PlotIndiv

0 5 10

−5

0

X−variate 1: 6% expl. var

X−

varia

te 2

: 6%

exp

l. va

r

Plot variables

g29g36 g52

g74

g85 g123g165g166

g187

g188

g190

g229

g246

g276 g335

g336g348g368

g373 g469

g509

g545

g555

g566

g585g589

g758

g779

g780 g783

g803g820

g828

g836g846

g849

g867g971

g979

g998

g1003

g1008

g1036g1042

g1049

g1067

g1074

g1089

g1090

g1093

g1099

g1110

g1116

g1158

g1194

g1206

g1207

g1279

g1283

g1295

g1298g1319g1327

g1330

g1372

g1375

g1386g1387

g1389

g1443

g1453

g1536

g1587

g1606

g1645

g1671

g1706

g1708

g1735

g1772

g1799

g1839g1884

g1888

g1915

g1916

g1917

g1954

g1955

g1974

g1980

g1991

g2050

g2116

g2117

g2127g2186

g2230

g2253

g2279

Correlation Circle Plots

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Component 1

Com

pone

nt 2

Comments

I We can observe a clear discrimination between the BL samples andthe others on the first component (x-axis), and EWS vs the others onthe second component (y-axis).

I Remember that this discrimination spanned by the first two PLS-DAcomponents is obtained based on a subset of 100 variables (50selected on each component).

I From the plotIndiv the axis labels indicate the amount of variationexplained per component.

I Note that the interpretation of this amount is not the same as in PCA.In PLS-DA, the aim is to maximise the covariance between X and Y,not only the variance of X as it is the case in PCA!

PLS-DA

PLS-DA without variable selection can be performed as:

MyResult.plsda <- plsda(X,Y) # 1 Run the methodplotIndiv(MyResult.plsda) # 2 Plot the samples

plotVar(MyResult.plsda) # 3 Plot the variables

Customize sample plotsplotIndiv(MyResult.splsda, ind.names = FALSE, legend=TRUE,

ellipse = TRUE, star = TRUE, title = 'sPLS-DA on SRBCT',X.label = 'PLS-DA 1', Y.label = 'PLS-DA 2')

sPLS−DA on SRBCT

0 5 10 15

−5

0

5

PLS−DA 1

PLS

−D

A 2

Legend

EWS

BL

NB

RMS

Customize variable plotsplotVar(MyResult.splsda, var.names=FALSE)


−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Component 1

Com

pone

nt 2

Customize variable plotsplotVar(MyResult.plsda, cutoff=0.7)

g1

g571

g719

g812

g875g906

g937g1007

g1067

g1082

g1167

g1194 g1888

g1894

g1932

g2253

g2276


−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Component 1

Com

pone

nt 2

In this particular case, no variable selection was performed. Only the display wasaltered to show a subset of variables.

Other useful plots

ROC

As PLS-DA acts as a classifier, we can plot a ROC Curve to complement thesPLS-DA classification performance results detailed latter. The AUC is calculatedfrom training cross-validation sets and averaged.

auc.plsda <- auroc(MyResult.splsda)

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

100 − Specificity (%)

Sen

sitiv

ity (

%)

Outcome

BL vs Other(s): 1

EWS vs Other(s): 0.5576

NB vs Other(s): 0.518

RMS vs Other(s): 0.6814

ROC Curve Comp 1

Variable selection outputsFirst, note that the number of variables to select on each componentdoes not need to be identical on each component, for example:

MyResult.splsda2 <- splsda(X,Y, ncomp=3, keepX=c(15,10,5))

Selected variables are listed in the selectVar function:

selectVar(MyResult.splsda2, comp=1)$value

value.varg123 0.53516982g846 0.41271455g335 0.30309695g1606 0.30194141g836 0.29365241g783 0.26329876g758 0.25826903g1386 0.23702577g1158 0.15283961g585 0.13838913g589 0.12738682g1387 0.12202390g1884 0.08458869g1295 0.03150351g1036 0.00224886

plotLoadings

Selected variables can be visualised in plotLoadings with the argumentscontrib = 'max' that is going to assign to each variable bar the sample groupcolour for which the mean (method = 'mean') is maximum. Seeexample(plotLoadings) for other options (e.g. min, median)

plotLoadings(MyResult.splsda2, contrib = 'max', method = 'mean')

g123g846g335

g1606g836g783g758

g1386g1158

g585g589

g1387g1884g1295g1036

0.0 0.1 0.2 0.3 0.4 0.5

Contribution on comp 1

Outcome

EWSBLNBRMS

Comments

I Interestingly from this plot, we can see that all selectedvariables on component 1 are highly expressed in the BL(orange) class.

I Setting contrib = 'min' would highlight that those variablesare lowly expressed in the NB grey class, which makes sensewhen we look at the sample plot.

I Since 4 classes are being discriminated here, samples plots in3d may help interpretation:

plotIndiv(MyResult.splsda2, style="3d")

Tuning parameters and numerical outputs

For this set of methods, three parameters need to be chosen:

1 - The number of components to retain ncomp. The rule of thumb isusually K − 1 where K is the number of classes, but it is worthtesting a few extra components.

2 - The number of variables keepX to select on each component forsparse PLS-DA,

3 - The prediction distance to evaluate the classification andprediction performance of PLS-DA.


I For item 1, the perf evaluates the performance of PLS-DA fora large number of components, using repeated k-foldcross-validation.

I For example here we use 3-fold CV repeated 10 times (notethat we advise to use at least 50 repeats, and choose thenumber of folds that are appropriate for the sample size of thedata set):


MyResult.plsda2 <- plsda(X,Y, ncomp=10)set.seed(30) # for reproducibility in this vignette, otherwise increase nrepeatMyPerf.plsda <- perf(MyResult.plsda2, validation = "Mfold", folds = 3,

progressBar = FALSE, nrepeat = 10) # we suggest nrepeat = 50

# type attributes(MyPerf.plsda) to see the different outputs# slight bug in the output function currently see the quick fix below#plot(MyPerf.plsda, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")

# quick fixmatplot(MyPerf.plsda$error.rate$BER, type = 'l', lty = 1,

col = color.mixo(1:3),main = 'Balanced Error rate')

legend('topright',c('max.dist', 'centroids.dist', 'mahalanobis.dist'),lty = 1,col = color.mixo(5:7))


2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

0.5

Balanced Error rate

MyP

erf.p

lsda

$err

or.r

ate$

BE

R

max.distcentroids.distmahalanobis.dist

Comments

I The plot outputs the classification error rate, or Balancedclassification error rate when the number of samples per groupis unbalanced, the standard deviation according to threeprediction distances.

I Here we can see that for the BER and the maximum distance,the best performance (i.e. low error rate) seems to be achievedfor ncomp = 3.

Prediction performanceIn addition (item 3 for PLS-DA), the numerical outputs listed herecan be reported as performance measures:

MyPerf.plsda

Call:perf.plsda(object = MyResult.plsda2, validation = "Mfold", folds = 3, nrepeat = 10, progressBar = FALSE)

Main numerical outputs:--------------------Error rate (overall or BER) for each component and for each distance: see object$error.rateError rate per class, for each component and for each distance: see object$error.rate.classPrediction values for each component: see object$predictClassification of each sample, for each component and for each distance: see object$classAUC values: see object$auc if auc = TRUE

Visualisation Functions:--------------------plot

The number of variables

I Regarding item 2, we now use tune.splsda to assess theoptimal number of variables to select on each component.

I We first set up a grid of keepX values that will be assessed oneach component, one component at a time.

I Similar to above we run 3-fold CV repeated 10 times with amaximum distance prediction defined as above.

list.keepX <- c(5:10, seq(20, 100, 10))list.keepX # to output the grid of values tested

[1] 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100

set.seed(30) # for reproducbilitytune.splsda.srbct <- tune.splsda(X, Y, ncomp = 3,

validation = 'Mfold',folds = 3, dist = 'max.dist', progressBar = FALSE,measure = "BER", test.keepX = list.keepX,nrepeat = 10) # we suggest nrepeat = 50

CommentsWe can then extract the classification error rate averaged across allfolds and repeats for each tested keepX value, the optimal numberof components (see ?tune.splsda for more details), the optimalnumber of variables to select per component which is summarisedin a plot where the diamond indicated the optimal keepX value:

error <- tune.splsda.srbct$error.ratencomp <- tune.splsda.srbct$choice.ncomp$ncomp# optimal number of components based on t-tests on the error ratencomp

[1] 3

select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]# optimal number of variables to selectselect.keepX

comp1 comp2 comp350 50 70

Tune

plot(tune.splsda.srbct, col = color.jet(ncomp))

5 10 20 50 100

0.0

0.1

0.2

0.3

0.4

0.5

Number of selected features

Bal

ance

d er

ror

rate

comp1comp1 to comp2comp1 to comp3

Final model

Based on those tuning results, we can run our final and tunedsPLS-DA model:

MyResult.splsda.final <- splsda(X, Y, ncomp = ncomp, keepX =select.keepX)

plotIndiv(MyResult.splsda.final, ind.names = FALSE, legend=TRUE,ellipse = TRUE, title="SPLS-DA, Final result")

Final comment

- Additionally we can run ‘perf‘ for the final performance of thesPLS-DA model. Also note that ‘perf‘ will output ‘features‘ thatlists the frequency of selection of the variables across thedifferent folds and different repeats.

- This is a useful output to assess the confidence of your finalvariable selection, see a more [detailed examplehere](http://mixomics.org/case-studies/splsda-srbct/).

Take Home Message: PLS-DA and sPLS-DA

- Dimension Reduction approach

- supervised method for qualitative response variable

- Similar as discriminant analysis

- sparse version enables us variable selection

PLS - Discriminant Analysis (PLS-DA) and Sparse PLS-DA

Documents

Transcript of PLS - Discriminant Analysis (PLS-DA) and Sparse PLS-DA