mixOmics: an R package for ‘omics feature selection and ... · Integrative Multivariate ‘omics...

Integrative Multivariate ‘omics Analysis with mixOmics • in draft •

mixOmics: an R package for ‘omics feature selection andmultiple data integration

Florian Rohart1, Benoît Gautier1, Amrit Singh2,3, and Kim-Anh Lê Cao∗1

1The University of Queensland Diamantina Institute, Translational ResearchInstitute, Brisbane, QLD 4102, Australia,

2UBC James Hogg Research Centre for Heart Lung Innovation, St. Paul’sHospital, University of British Columbia, Vancouver, BC, Canada.

3Prevention of Organ Failure (PROOF) Centre of Excellence, Vancouver, BC,Canada.

Abstract

The advent of high throughput technologies has led to a wealth of publicly available biological datacoming from different sources, the so-called ‘omics data (transcriptomics for the study of transcripts,proteomics for proteins, metabolomics for metabolites, etc). Combining such large-scale biological datasets can lead to the discovery of important biological insights, provided that relevant information can beextracted in a holistic manner. Current statistical approaches have been focusing on identifying smallsubsets of molecules (a ‘molecular signature’) that explains or predicts biological conditions, but mainlyfor the analysis of a single data set. In addition, commonly used methods are univariate and considereach biological feature independently. In contrast, linear multivariate methods adopt a system biologyapproach by statistically integrating several data sets at once and offer an unprecedented opportunity toprobe relationships between heterogeneous data sets measured at multiple functional levels.

mixOmics is an R package which provides a wide range of linear multivariate methods for dataexploration, integration, dimension reduction and visualisation of biological data sets. The methodswe have developed extend Projection to Latent Structure (PLS) models for discriminant analysis anddata integration and include `1 penalisations to identify molecular signatures. Here we introduce themixOmics methods specifically developed to integrate large data sets, either at the N-level, where the sameindividuals are profiled using different ‘omics platforms (same N), or at the P-level, where independentstudies including different individuals are generated under similar biological conditions using the same‘omics platform (same P). In both cases, the main challenge to face is data heterogeneity, due to inherentplatform-specific artefacts (N-integration), or systematic differences arising from experiments assayedat different geographical sites or different times (P-integration). We present and illustrate those novelmultivariate methods on existing ‘omics data available from the package.

I. Introduction

The advent of novel ‘omics technologies (e.g. transcriptomics, proteomics, metabololomics, etc)has enabled new opportunities for biological and medical research discoveries. Commonly, each

∗Corresponding author [email protected]

1

.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint

mailto:[email protected]

https://doi.org/10.1101/108597

http://creativecommons.org/licenses/by-nc-nd/4.0/


feature from each technology (transcripts, proteins, metabolites, etc) is analysed independentlythrough univariate statistical methods such as ANOVA, linear model or t-tests. Such analysisignores relationships between the different features and may miss crucial biological information.Indeed, biological features act in concert to modulate and influence biological systems andsignalling pathways. Multivariate approaches, which model features as a set, can therefore providea more insightful picture of a biological system, and complement the results obtained fromunivariate methods. In mixOmics we considered multivariate projection-based methodologies for‘omics data analysis Meng et al. (2016) because of several appealing properties. Firstly, they arecomputationally efficient to handle large datasets, where the number of biological features (usuallyin the thousands) is much larger than the number of samples (usually less than 50). Secondly, theyperform dimension reduction by projecting the data into a smaller subspace while capturing andhighlighting the largest sources of variation from the data, resulting in powerful visualisation ofof the system under study. Lastly, they are highly flexible to answer various biological questions(Boulesteix and Strimmer, 2007): mixOmics multivariate methods have been successfully applied inseveral recent studies to identify biomarkers in ‘omics studies ranging from metabolomics, brainimaging to microbiome and statistically integrate data sets generated from difference biologicalsources (Labus et al., 2015; Cook et al., 2016; Guidi et al., 2016; Mahana et al., 2016; Ramanan et al.,2016; Rollero et al., 2016).

In this paper, we introduce the mixOmics multivariate methods developed for supervised analysis,where the aims are to classify or discriminate sample groups, to identify the most discriminantsubset of biological features, and to predict the class of new samples. In particular, our twonovel frameworks were implemented for the integration of multiple data sets. DIABLO enablesthe integration of the same biological N samples measured on different ‘omics platforms with(N-integration, Singh et al. 2016), MINT enables the integration of several independent data sets orstudies measured on the same P predictors (P-integration, Rohart et al. 2016a). One of the mainchallenges in N- and P-integration is to overcome the technical variance among ’omics platforms- either between different types of ‘omics, or within the same type of ‘omics but generatedfrom several laboratories, to extract common information. To date, very few statistical methodscan perform N- and P-integration in a supervised context. For instance, N-integration is oftenperformed by concatenating all the different ’omics datasets (Liu et al., 2013), thus ignoring theheterogeneity between ‘omics platforms, or by combining the molecular signatures identified fromseparate analyses of each ‘omics platform (Günther et al., 2012), thus disregarding the relationshipsbetween the different ‘omics functional levels. With P-integration, statistical methods are oftensequentially combined to accommodate for technical differences among studies or platformsbefore classifying samples. Such sequential approach is not appropriate for the prediction of newsamples as they are prone to overfitting (Rohart et al., 2016a). Our two promising frameworks havethe high potential to lead to new discoveries by either modelling relationships between differenttypes of ‘omics data (N-integration) or by enabling the integrative analysis of independent ‘omicsstudies and increasing sample size and statistical power (P-integration).

The present article introduces the main functionalities in mixOmics, presents our multivariateframeworks for the identification of molecular signatures in one and several data sets and illustrateseach framework in a case study available in the package.

II. The mixOmics R package.

mixOmics is a user-friendly R package dedicated to data exploration, mining, integration andvisualisation. It provides a wide range of innovative multivariate methods for the analysis andintegration of large data sets in several settings (sparse PLS-DA, DIABLO and MINT, Fig. 1) with

2


https://doi.org/10.1101/108597



appealing outputs such as (i) insightful visualisations of the data whose dimension has beenreduced with the use of latent components, (ii) identification of molecular signatures and (iii)improved usage with common calls to all visualisation and performance assessment methods (seea list of those S3 functions in Suppl. S1).

Multivariate projection-based methods. The multivariate dimension reduction techniques im-plemented in mixOmics perform unsupervised analyses such as Principal Component Analysis(using NonLinear Iterative Partial Least Squares, Wold 1975), Independent Component Analysis(Yao et al., 2012), Partial Least Squares regression (PLS, also known as Projection to Latent Struc-tures, Wold 1966), regularised Canonical Correlation Analysis (rCCA, González et al. 2008) andGeneralised Canonical Correlation Analysis (rGCCA, based on a PLS algorithm Tenenhaus andTenenhaus 2011), multi-group PLS (Eslami et al., 2013a) as well as supervised analyses such asPLS-Discriminant Analysis (PLS-DA, Nguyen and Rocke 2002b,a; Boulesteix 2004), and recentlyGCC-DA (Singh et al., 2016) and multi-group PLS-DA (Rohart et al., 2016a).

While each multivariate method aims at answering a specific biological question, the uniquenessof the mixOmics package is to provide novel sparse variants to enable the identification of keypredictors (e.g. genes, proteins, metabolites) in large biological data sets. Feature selection isperformed via `1 regularisation (LASSO, Tibshirani 1996), which is implemented directly intothe optimisation of the statistical criterion specific to each method. Such criterion include themaximisation of the most important source of variation in the data, of the covariance or correlationbetween different ‘omics sets, or of the segregation of a categorical outcome of interest. Solving theoptimisation criterion enables to seek for latent components and loading vectors. Latent componentsare linear combinations of the original predictors, where each predictor is assigned a coefficientindicated in the loading vectors. The ourrefore, linear multivariate methods reduce the dimensionof the data into a space spanned by a few components, by projecting the samples into a smaller,interpretable space.

In mixOmics methods, the parameters to choose include the total number of components, alsocalled dimensions H, and the `1 penalty on each dimension for all sparse methods. Contraryto other R packages implementing `1 penalisation methods (e.g. glmnet, Friedman et al. 2010,PMA, Witten et al. 2013), and it order to improve usability of the methods, the `1 parameter issolved via soft-thresholding and equivalently replaced by the number of features to select on eachdimension. In our multivariate models, the tuning of the number of features to select is performedvia repeated cross-validation. The result is a selection of a subset of correlated features that bestdiscriminate the outcome and constitute a molecular signature.

Historically, our first methods were dedicated to the integration of two ‘omics data sets(González et al., 2008; Lê Cao et al., 2008, 2009b,a), or the discriminant analysis of a single ‘omicsdata set (Lê Cao et al., 2011). The integrative methods presented in this manuscript focus onthe integration of multiple biological data sets to address cutting-edge biological and biomedicalquestions.

Implementation. mixOmics is fully implemented in the R language and exports more than 30functions for either performing a statistical analysis, tuning its parameters or visualising itsresults. mixOmics mainly depends on the R base packages (parallel, methods, grDevices, graphics,stats, utils) and recommended packages (MASS, lattice), but also imports functions from a limitednumber of other R packages (igraph, rgl, ellipse, corpcor, RColorBrewer, plyr, dplyr, tidyr, reshape2,ggplot2). In mixOmics, we provide generic R/S3 functions to assess the performance of themethods (predict, plot, print, perf, auroc, etc) and visualisation the results (plotIndiv,plotArrow, plotVar, plotLoadings, etc) as described in the next paragraph.

3


https://doi.org/10.1101/108597



DATAIN

PUT

METHO

DGR

APHICS

sparsePLS-DA DIABLO MINT

Sample

plots

Varia

ble

plots

SINGLE‘OMICS INTEGRATIVE‘OMICS

X

(N xP)

Y11231

X1

(N xP1)

X2

(NxP2)

XQ

(NxPQ)…

Y11231

(N1 xP)

(N2 xP)

XM (NM xP)

…

Y1

Y2

YM

112323122

2133

…

X2

X1

Figure 1: Overview of the mixOmics multivariate methods for single (sparse PLS-DA) and integrative (DIABLOand MINT) ‘omics supervised analyses. X denote a predictor dataset, and Y a categorical outcome response.Integrative analyses include N-integration (across studies generated on the same N samples and differenttypes of predictor features), and P-integration (the same P predictors are measured on independent studies).See also Suppl. S1 for a summary of the different method call and plot functions.

Currently, seventeen methods are implemented in mixOmics to integrate large biologicaldatasets, amongst which twelve have similar names (mint).(block).(s)pls(da) (see Table 1) asthey are wrappers of a single main hidden function of mixOmics. The wrapper functions checkand shape the input parameters before passing them to the hidden function that extends theSGCCA algorithm (Tenenhaus et al., 2014) to perform either N- or P-integration. The remainingfour statistical methods are PCA, sparse PCA, IPCA, rCCA and rGCCA. Each statistical methodimplemented in mixOmics returns a list of essential outputs which are used in our S3 visualisationfunctions.

Graphical outputs to visualise multivariate analysis results. mixOmics aims to provide insight-ful and user-friendly graphical outputs to interpret the statistical and biological results, someof which were introduced in González et al. 2012. Thanks to R/S3 functions as listed in S1, thefunction calls are identical for all multivariate methods implemented in the mixOmics package,as we illustrate in the next sections. We provide various visualisations, including sample plotsand feature plots, which are based on the component scores and the loading vectors, respectively.

4


https://doi.org/10.1101/108597



Framework sparse Function name

Single ’omicsunsupervised

- pca- ipcaX spca

supervised- plsdaX splsda

Two ’omics unsupervised- rcca- plsX spls

P-integrationunsupervised

- mint.plsX mint.spls

supervised- mint.plsdaX mint.splsda

N-integrationunsupervised

- wrapper.rgcca- block.plsX block.spls

supervised- block.plsdaX block.splsda (DIABLO)

Table 1: Seventeen statistical methods available in mixOmics

Here we list the main important visualisation functions in mixOmics.

• plotIndiv (Sample plot): represents samples by plotting the component scores. Such plotvisualises similarities between samples in the small subspace spanned by the components.For the integrative methods described in Sections V and IV, samples from each data set, oreach study are represented on separate plots, allowing to visualise the agreement betweenthe data sets at the sample level. Confidence ellipse plots for each class can be displayed.

• plotArrow (Arrow representation): plots the components scores associated to either X data(start of the arrow) or Y outcome (tip of the arrow). As such, short arrows indicate a gooddiscrimination of the classes. In the case of N-integration, the start of the arrow indicates thecentroid between all data sets for a given sample and the tips of the arrows the location ofthat sample in each data set. In that specific case, short arrows indicate a strong agreementbetween the matching data sets, long arrows a disagreement between the matching data sets.

• plotVar (Correlation circle plots): displays features selected by the multivariate method.Each feature coordinate is defined as the Pearson correlation between the original data andthe loading vector for each dimension (see González et al. (2012) for a detailed description).Correlation circle plots are particularly useful to visualise the contribution of each feature todefine each component (feature close to the large circle of radius 1), as well as the correlationstructure between features (clusters of features). The cosine angle between any two pointsrepresent the correlation (negative positive, null) between two features.

Both plotIndiv and plotVar offer usual plot arguments to display symbols, colours andlegend. Graphic styles include default ggplot2, graphics, lattice and 3D plots.

• cim (Clustered Image Maps): heatmap plots to visualise the distances between features withrespect to each sample. By default we use Euclidian distance and complete linkage method.

5


https://doi.org/10.1101/108597



For the specific case of N-integration, the function cimDiablo represents the selected featuresfrom the different data sets.

• network (Relevance networks): represents the correlation structure between features ofdifferent types. A similarity matrix representing the association between pairs of featuresacross all components is calculated as the sum of the correlations between the originalfeatures and the loading vector across all dimensions of interest h = 1, . . . , H (see Gonzálezet al. (2012) for more details). Those networks are bipartite and thus only a link between twofeatures of different types is represented.

• plotLoadings: represents the loading coefficient of each feature selected on each dimensionof the multivariate model. Features are ranked according to their contribution to thecomponent (bottom to top), colors indicate the class for which the mean (or median)expression value is the highest (or the lowest) for each feature. Such graphical output enablesmore insight into the molecular signature, especially when interpreted in conjunction withthe sample plot.

Other graphical outputs are available in mixOmics to represent classification performance ofmultivariate models using the generic function plot. The listing of the functions for eachframework presented here are summarised in Suppl. S1.

General notations. We assume each data set has been normalised using appropriate techniquesspecific for the type of ‘omics platform. Let X denote a data matrix of size N observations (rows)× P predictors (e.g. expression levels of P genes, in columns). The categorical outcome y isexpressed as a dummy matrix Y in which each column represents one outcome category andeach row indicates the class membership of each sample. Y is of size N observations (rows) ×K categories outcome (columns). We denote for all a ∈ Rn its `1 norm ||a||1 = ∑

p1 |aj| and its `2

norm ||a||2 = (∑p1 a2

j )1/2. For any matrix we denote by > its transpose.

III. Multivariate analysis of one data set

Linear Discriminant Analysis (LDA) and Projection to Latent Structure (PLS, Wold 1966) arepopular multivariate methods for supervised analyses. In mixOmics we mainly focus on PLSmethods for their flexibility to solve a variety of analytical problems (Boulesteix and Strimmer,2007). PLS regression (Wold, 1966) was originally developed for unsupervised analysis to integratetwo continuous data sets measured on the same observations. We introduce here a supervisedversion, called PLS-Discriminant Analysis (PLS-DA, (Nguyen and Rocke, 2002a; Barker and Rayens,2003), a natural extension that substitutes one of the data set for a dummy matrix Y. PLS-DA fitsa classifier multivariate model that assigns samples into known classes, with the ultimate aim topredict the classes of external test samples where the outcome is often unknown.

6


https://doi.org/10.1101/108597



componentst1 =X1 a1t2 =X2 a2

max cov(t,Y)t1 t2 t3

112312

y1….……………….……….P1....N

X(NxP)

associatedloadingvectors a2

a2

a3

……

…

Figure 2: Example of data matrix decomposition for single ‘omics analysis. The predictor matrix X is decomposedinto a set of components (t1, . . . , tH) and associated loading vectors (a1, . . . , aH), Y is the outcome coded asa dummy matrix and combined linearly (see exact formula in Equation (1)). Xh is the deflated (residual)matrix starting with X1 = X, for h = 1 . . . H the dimension of the model (number of components).

PLS-DA. Briefly, PLS-DA is an iterative method that constructs H successive artificial (latent)components th = Xhah and uh = Yhbh for h = 1, .., H, where the hth component th (respectively uh)is a linear combination of the X (Y) features. H denotes the dimension of the PLS-DA model. Theweight coefficient vector ah (bh) is the loading vector that indicates the importance of each featureto define the component. For each dimension h = 1, . . . , H PLS-DA seeks to maximize

max(ah ,bh)

cov(Xhah, Yhbh), s.t. ||ah||2 = ||bh||2 = 1 (1)

where Xh, Yh are the residual (deflated) matrices extracted from each iterative linear regression(see Lê Cao et al. 2011 for more details). The PLS-DA model assigns to each sample i a pairof H scores (ti

h, uih) which effectively represents the projection of that sample into the X- or Y-

space spanned by those PLS components. As H << P, the projection space is small, allowing fordimension reduction as well as insightful sample plot representation. Note that the projection intothe Y-space is of no use for a Discriminant Analysis as PLS-DA.

Feature selection with sparse PLS-DA. We developed a sparse version of PLS-DA (Lê Cao et al.,2011) which includes an `1 penalisation (Tibshirani, 1996) on the loading vector ah to shrink somecoefficients to zero. Thus, for each dimension h = 1, .., H, sPLS-DA solves:

max(ah ,bh)

cov(Xhah, Yhbh), s.t. ||ah||2 = ||bh||2 = 1 and ||ah||1 ≤ λh (2)

where λh is a non negative parameter that controls the amount of shrinkage in ah. The componentscores th = Xhah are now defined on a small subset of features with non-zero coefficients, leading

7


https://doi.org/10.1101/108597



to feature selection that aims to optimally maximise the discrimination between the K outcomeclasses in Y.

Prediction. Once fitted, the (sparse) PLS-DA model can be applied on an external test set X ofsize (Ntest × P) to predict the class of new samples (see Lê Cao et al. 2011 for more details). Thepredict function outputs the predicted scores for each test sample. Since the predicted scores areexpressed as continuous values, a prediction distance must be applied to obtain the final predictedclass membership. Distances such as ‘maximum distance’, ‘Malhanobis distance’ and ‘Centroiddistance’ are provided (see Figure 3B1).

Choice of parameters. One important parameter to choose in PLS-DA and sPLS-DA is thenumber of components, or dimension of the model, called ncomp in mixOmics. While this parameterhas mostly been ignored from the PLS-DA literature it plays a crucial role to ensure maximalprediction accuracy. Our experience when analysing a large number of ‘omics data sets has shownthat ncomp = K-1 was usually sufficient to summarise most of the discriminatory informationfrom the data (Lê Cao et al., 2011; Shah et al., 2016). The second parameter to choose pertains tothe λh penalisation parameters for sPLS-DA, which was replaced by the number of features toselect on each component with the argument keepX.

The tune function performs repeated cross-validation (CV) for a user-input grid of keepX valuesto assess. The keepX parameter that leads to the best prediction accuracy of the model is reportedfor each component. Prediction accuracy is evaluated according to the overall classificationerror rate, or the Balanced Error Rate (BER) for unbalanced number of samples per class. Bothmeasures are calculated on the left-out samples set during the CV procedure, and averaged acrossthe repeated CV runs. The number of folds in CV depends on the number of samples N andcan be specified in the function, with a sufficient number of run (e.g. nrepeat = 50-100). Inthe case of small N, leave-one-out validation is advised and nrepeat is set to 1. Additionaloutputs from the tune function include 1/the stability of the selected features across all CV runs,which represents a useful measure of reproducibility of the molecular signature and 2/receiveroperating characteristic (ROC) curves and Area Under the Curve (AUC) averaged using one-vs-allcomparison if K > 2. Note however that ROC and AUC criteria may not be particularly insightfulas the prediction threshold in our methods is based on a specified distance as described earlier.

An additional option that we developed and implemented for all tune function in mixOmicsis to tune and fit a constraint model. The process is as follows: once the optimal keepX value ischosen on one component, the model is fitted with the specific keepX, and the resulting featureselection is then fixed for the tuning of the following component. In other words, the tuning isperformed on the optimal list of selected features (keepX.constraint) instead of the number offeatures (keepX). Such strategy was implemented in the sister package bootPLS and successfullyapplied in our recent integrative study Rohart et al. (2016b). Our experience has shown that thecontraint tuning and models improve the performance of the methods. We illustrate an examplein Suppl. V.

The tuning step must be conducted with caution to avoid overfitting results, as widelydescribed in the literature (see for example Ambroise and McLachlan 2002). Our tune procedureperforms repeated CV, reports the frequency of selected features across all repeated CV foldsand the classification error rate for each keepX value. Once ncomp and keepX for each componentare chosen, the final PLS-DA or sPLS-DA model is fitted on the whole data set and the finalperformance can be obtained with the perf function that also performs repeated CV (see Suppl.V).

8


https://doi.org/10.1101/108597



Extensions of PLS-DA for repeated measurements and 16S microbiome data. PLS-DA andsPLS-DA were extended to account for repeated measurement designs, as described in Liquetet al. (2012) by specifying the argument multilevel in the plsda and splsda functions. Recentextensions in the package include sPLS-DA analysis to identify microbial communities for 16Sdata with an additional logratio argument to account for compositional data in microbiomeexperiment (Lê Cao et al. 2016, see also our mixMC framework in www.mixOmics.org/mixMC).

Usage in mixOmics. Figure 3 illustrates the different graphical outputs obtained when analysinga single data set from unsupervised to supervised analyses. The data set analysed is a microarraydata set available from the mixOmics package investigating Small Round Blue Cell Tumors (SRBCT,Khan et al. 2001) of 63 tumour samples with the expression levels of 2,308 genes. Samples areclassified into four classes: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma(NB), and 20 rhabdomyosarcoma (RMS). The aim of this analysis is to assess similarities betweentumour types, using Principal Component Analysis (Fig. 3A), to classify the different tumoursubtypes with PLS-DA (Fig. 3B) and to identify a molecular gene signature discriminating thetumour types with sPLS-DA (Fig. 3C). The full pipeline, results interpretation and associated Rcode is available in Electronic Suppl. V.

IV. Integration of heterogeneous data sets with DIABLO

The integration of multiple ‘omics datasets measured on the same N biological samples (Figure 1)is based on a variant of the multivariate methodology Generalised Canonical Correlation Analysis(GCCA, Tenenhaus and Tenenhaus 2011; Tenenhaus et al. 2014), which, contrary to what its namesuggests, generalises PLS for N-integration. Our recent development DIABLO further improvedthe implementation of GCCA to include feature selection in a supervised framework and in auser-friendly manner (Günther et al., 2014; Singh et al., 2016).

Method. We denote Q ‘omics data sets X(1)(N× P1), X(2)(N× P2), ..., X(Q)(N× PQ) measuringthe expression levels of Pq ‘omics features on the same N biological samples, q = 1, . . . , Q. GCCAsolves for each component h = 1, . . . , H:

maxa(1)h ,...,a(Q)

h

Q

∑q,j=1,q 6=j

cq,j cov(X(q)h a(q)h , X(j)

h a(j)h ), s.t. ||a(q)h ||2 = 1 and ||a(q)h ||1 ≤ λ(q) (3)

where λ(q) is the penalisation parameter, a(q)h is the loading vector on component h associated

to the residual matrix X(q)h of the data set X(q), and C = {cq,j}q,j is the design matrix. C is a

Q× Q matrix of zeros and ones which specifies whether datasets should be correlated; zeroswhen datasets are not connected and ones where datasets are connected. Thus, it is possibleto constraint the model to only take into account specific pairwise covariances by setting thedesign matrix (see Tenenhaus et al. (2014) for more details). Such design thus enables to model aparticular association between pairs of ‘omics data, as expected from prior biological knowledgeor experimental design. DIABLO Discriminant Analysis in mixOmics extends (3) to a supervisedframework by replacing one data matrix X(q) with the outcome dummy matrix Y.

Prediction. DIABLO includes several predictions strategies such as a majority vote and a weightedvote. Both are based on the predictions obtained from each ’omics dataset via the X(q)

h a(q)hcomponents. The majority vote consists in assigning to a sample the class that has received the

9


www.mixOmics.org/mixMC

https://doi.org/10.1101/108597



0.0

0.1

0.2

0.3

0.4

0.5

0.6

Component

Cla

ssifi

catio

n er

ror r

ate

1 2 3 4 5 6 7 8 9 10

overallBER

max.distcentroids.distmahalanobis.dist

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100100 − Specificity (%)

Se

nsi

tivity

(%

)

Outcome

BL vs Other(s): 1

EWS vs Other(s): 1

NB vs Other(s): 0.8627

RMS vs Other(s): 0.8244

ROC Curve Comp 2

−6.36 0 6.36

Color key

g820

g177

2g4

30g1

991

g170

8g1

049

g566

g368

g225

3g1

954

g131

9g1

645

g107

4g1

327

g336

g545

g138

9g2

46g1

980

g779 g52

g129

8g3

48 g29

g803

g167

1g2

117

g36

g133

0g1

917

g979

g469

g109

3g2

047

g483

g603 g2

g129

g372

g189

6g1

74g2

046

g910

g208

3g1

030

g338

g215

9g1

353

g100

3g1

955

g191

1g1

194

g111

0g2

29g1

799

g971

g188

8g2

050

g220

3g8

19g1

775

g126

3g7

19g1

091

g192

4g1

158

g758

g160

6g3

35g8

46g1

386

g123

g585

g783

g836

g143

4g2

000

g975

g422

g976

g823

g879

g742

g219

9g4

17g1

804

g215

7g1

601

g236

g176

4g2

55 g3g1

776

g119

8g1

698

g695

g136

9g1

862

g182

9g1

460

g134

8g1

730

g225

8g4

00g1

007

g956

g157

9g5

97g2

241

g162

4

EWS.T3EWS.T6EWS.T19EWS.T14EWS.T7EWS.T15EWS.T9EWS.T11EWS.C6EWS.T2EWS.C9EWS.C8EWS.T1EWS.T12EWS.T4EWS.C11EWS.C10EWS.C4EWS.C3EWS.C7EWS.C1EWS.C2EWS.T13RMS.C9RMS.T10RMS.C3RMS.T6RMS.T11RMS.T3RMS.T7RMS.T5RMS.T8RMS.T4RMS.T1RMS.T2RMS.C2RMS.C4RMS.C5RMS.C7RMS.C6RMS.C11RMS.C8RMS.C10BL.C6BL.C3BL.C4BL.C7BL.C8BL.C5BL.C2BL.C1NB.C1NB.C7NB.C2NB.C3NB.C9NB.C10NB.C4NB.C11NB.C6NB.C8NB.C12NB.C5

g1389g246g1954g545g1319g1074g1327g2050g1645g2117g1772g1888g348g2253g566g368g1980g336g1708g29g36

g1298g1330g979g971g469g820g1093g1194g1671g52

g1799g1917g1110g779g229g1991g803g1049g430

−0.2 −0.1 0.0 0.1

Loadings on comp 2

OutcomeEWSBLNBRMS

g123

g846

g335

g836

g1606

g783

g758

g1386

g1158

g585

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Loadings on comp 1

OutcomeEWSBLNBRMS

−2 0 2 4 6 8

−10

−50

5

X

Y

LegendEWSBLNBRMS

sPLS−DA on SRBCT

−10

−5

0

5

0 4 8X−variate 1: 5% expl. var

X−va

riate

2: 6

% e

xpl.

var

LegendEWSBLNBRMS

0.0

0.1

0.2

0.3

0.4

0.5

Comp 1

Features

Stability

g123

g846

g836

g335

g1606

g1386

g758

g783

g1327

g1389

g1954

g246

g545

g1319

g1074

g1387

g1772

g585

g1158

g1884

g589

g1645

g1980

g2253

g2050

g1295

g1915

g165

g1916

g336 g85

g1116

g1036

g2117

g607

g820

g153

g1894

g1932

g2144

g255 g74

g742

g937

g976

g1730

g2157

g348

g368 g52

g823

g879

g1003

g1601

g1708

g187

g1955

g2046

g276

g571

g719

g910

g1099

g1207

g174

g182

g1896

g1974

g422

g566

g1536

g2276

g36

g509

g998 g1

g1007

g1030

g1066

g1067

g1142

g1194

g1279

g1347

g1353

g1453

g1518

g166

g1662

g1723

g1758

g1804

g1822

g1862

g1911

g1949

g1962

g2127

g2159

g2162

g236 g29

g417

g430

g444

g552

g603

g655

g975

0.0

0.1

0.2

0.3

0.4

Comp 2

Features

Stability

g1003

g1954

g1389

g246

g1319

g1327

g545

g1074

g1955

g1645

g2050

g509

g2117

g187

g1980

g1207

g1772

g1884

g1194

g1888

g1911

g2253

g336

g348

g1708

g2046

g174

g368

g910

g566

g1896

g1298

g165 g29

g1030

g2159

g36

g1353

g166

g1723

g603

g971 g2

g338

g1932

g820

g979 g74

g1799

g1330

g229

g469

g758

g2083

g719

g1093

g1671

g129

g2047

g433

g1263

g1917

g52

g1110

g1991

g483

g1372

g552

g230

g779

g430

g1049

g2203

g481

g1055

g1196

g819

g1738

g1613

g1862

g803

g823

g1292

g1730

g753

g532

g1579

g400

g1198

g1606

g828

g976

g2258

g937

g123

g153

g2049

g2146

g255

g335

g437 g85

g1036

g1386

g1387

g1601

g1706

g2144

g2157

g422 g50

g585

g589

g742

g783

g836

g846

g1453

g1764

g1775

g1894

g571

g1158

g1536

g1829

g1915

g1916

g879

g1090

g1099

g1283

g1776

g182

g188

g2162

g2230

g842

g849

g1295

g1662

g975

g1804

g2240

g373

g607

g1116

g1159

g372

g812

g867

g1089

g1220

g1795

g1887

g1914

g276

g1279

g1624

g1735

g54

g597

g956

g1007

g1021

g119

g1375

g780

g1022

g1347

g1839

g236

g546

g941

g1876

g799

g1067

g1770

g444

g554

g365

g1974

g417

g707

g1746

g190

g2199

g602

g655

g2156

g744

g796

g1084

g1224

g268

g666 g67

g1301

g1621

g2099

g244

g378

g484

g555

g903

g1088

g1142

g1201

g1518

g1649

g1773

g1962

g657

g970

g1493

g1634

g169

g1924

g1964

g2279

g415

g449

g998

g2192

g510 g89

g1066

g107

g137

g1443

g1517

g1784

g1857

g2104

g2175

g2300

g504

g575

g695

g1734

g1822

g2136

g2241

g289

g604

g688

g912

g1068

g1070

g1105

g1122

g1246

g1286

g1522

g1587

g159

g1853

g1942

g1944

g2022

g2127

g2217

g2301

g256

g424

g756

g901

g965

g973

g1016

g1315

g1317

g1348

g1460

g1686

g1765

g1928

g1992

g2227

g2276

g442

g632

g639 g76 g1

g1017

g1042

g1206

g1215

g1217

g1291

g1309

g1312

g1318

g1325

g1378

g1392

g1416

g1486

g1496

g1700

g1873

g1956

g2209 g3

g315

g606

g636

g865

g972

g1002

g1187

g1203

g1221

g1226

g1273

g1302

g1351

g1370

g1384

g1405

g1424

g1436

g1479

g1489

g1524

g1548

g1565

g1608

g1655

g1710

g1716

g1751

g1909

g191

g1968

g1979

g2106

g2116

g2186

g264

g288

g384

g477

g499

g573

g590

g612

g667

g694

g733

g761

g794

g801

g832

g837

g873

g881

g891 g90

g906

g951

g1005

g1008

g1023

g1151

g1228

g1262

g1308

g1324

g139

g141

g142

g1434

g1445

g1457

g1458

g1462

g1490

g151

g1535

g1577

g1599

g1626

g1656

g1661

g1697

g1714

g1761

g1800

g1803

g1823

g1837

g1863

g1929

g1930

g1949

g1977

g1994

g2000

g2016

g2039

g205

g2053

g2080

g2081

g2114

g214

g2151

g217

g2207

g2247

g2289

g2295

g251

g258

g279 g37

g381

g388

g397

g450

g465

g490

g497

g522

g533

g558

g594

g615

g642

g643

g665

g714

g715

g736

g745

g747

g798 g80

g800

g808

g817

g861

g875

g932

g933 g94

g948

g982

0.0

0.2

0.4

0.6

0.8

Comp 3

Features

Stability

g742

g879

g255

g1804

g2157

g1601

g2199

g236

g417

g1764

g1003

g174

g1896

g1955

g603

g910

g1434

g2046

g2159

g975

g976

g2000

g422

g2144

g799

g1862

g1030

g1353

g1924

g1579

g823

g1776

g1084

g1829

g2083

g2258

g597

g1911

g153

g2047

g1263

g575

g483

g1348

g1698

g2241

g1066

g1662

g695

g819

g1088

g1007

g972

g1347

g372

g129

g1158

g123

g1295

g1386 g2

g442

g783

g836

g846

g1116

g1387

g1460

g1606

g1915

g1916

g335

g585

g589

g956

g2203

g276

g1196

g1914

g188

g849

g1198

g2156 g3

g1207

g842

g1091

g1536

g901

g1369

g1655

g325

g1730

g338

g509

g780

g1723

g187

g1624

g1944

g433

g1055

g1735

g400

g190

g2049

g2230

g758

g1036

g1309

g951 g85

g1389

g246

g545

g1327

g1974

g1980

g424

g1273

g1436

g229

g655

g1954

g554

g555

g1319

g1074

g1194

g1645

g1706

g1799

g1372

g1006

g1370

g2136

g54

g1283

g1375

g1775

g373

g1021

g336

g971

g719

g982

g380

g214

g1649

g1887

g2050

g1032

g1105

g1738

g1962

g1022

g1110

g1704

g571

g1392

g1484

g828

g1302

g1664

g2146

g532

g707

g2104

g1090

g1741

g606

g632

g1308

g1659

g2175

g566

g1330

g2117

g493 g50

g941

g1599

g1894

g36

g970

g1462

g1888

g1992

g586

g602

g1023

g1220

g25

g67

g753

g1151

g1576

g1932

g1979

g2131

g348

g469

g1286

g1656

g1909

g2228

g801

g840

g1017

g1768

g2192

g639

g1453

g1496

g845

g1217

g151

g1581

g1708

g2227

g230

g499

g1093

g1315

g142

g1538

g26

g289

g642

g666

g747

g1099

g1201

g1206

g1250

g166

g2042

g2162

g2303

g903

g933

g965

g988

g1215

g1493

g2040

g650

g733

g841

g881

g998

g1343

g1427

g1577

g1626

g1682

g191

g744

g1132

g1291

g141

g1531

g1597

g1634

g165

g169

g1714

g1884

g209

g2097

g2118

g2207

g2240

g453

g462

g501

g796 g1

g1008

g1067

g1279

g1292

g1384

g1414

g1443

g1497

g1548

g1621

g1770

g1772

g1839

g1881

g1917

g1930

g1942

g2127

g2186

g558

g667 g74

g795

g859

g867

g912

g937

g1016

g1076

g1089

g1422

g1673

g1746

g1784

g1823

g1882

g2116

g2209

g2231

g2276

g248 g46

g527

g1012

g1013

g105

g1054

g1072

g1298

g1488

g1535

g1594

g1628

g1671

g1710

g1803

g182

g1853

g2088

g2163

g2266

g262

g368

g444

g449

g474

g582

g607

g714

g731

g891

g922

g1004

g1301

g1335

g1498

g1694

g1759

g1762

g1795

g1850

g1868

g1883

g1945

g1957

g1994

g2039

g2086

g2089

g2114

g2213

g222

g2253

g2301

g294

g437

g481

g543

g544

g590

g761

g820

g873

g883

g905

g1002

g1042

g1061

g107

g108

g1112

g1157

g1187

g1224

g1261

g1264

g1266

g1268

g1287

g1318

g1324

g1341

g137

g1377

g1405

g1417

g1426

g1445

g1489

g1518

g1522

g1529

g1587

g1608

g1613

g1697

g1699

g17

g1702

g1734

g1742

g1797

g1867

g1901

g1920

g1929

g1931

g1935

g1949

g1991

g2009

g205

g2075

g2080

g2167

g2198

g2208

g2235

g2279

g2285

g2289

g242

g251

g252

g268 g29

g291

g378

g419

g430

g443 g45

g450

g455

g463

g536

g552

g615

g635

g645

g647

g653

g657

g665

g702 g77

g771

g788

g789

g794

g797

g812

g814

g857

g875 g90

PCA on SRBCT

−20

0

20

−20 −10 0 10 20PC1: 11% expl. var

PC2:

11%

exp

l. va

r

LegendEWSBLNBRMS

PLSDA on SRBCT

−20

−10

0

10

20

−20 0 20X−variate 1: 10% expl. var

X−va

riate

2: 6

% e

xpl.

var

LegendEWSBLNBRMS

1 2 3 4 5 6 7 8 9 10

Principal Components

Expl

aine

d Va

rianc

e

0.00

0.02

0.04

0.06

0.08

0.10

A1 A2 B1 B2

C3

C2

C6C1

C5

C4

Figure 3: Illustration of PLS-DA and sPLS-DA in mixOmics. A) Unsupervised preliminary analysis with PCA,A1: percentage of explained variance per component, A2: PCA sample plot. B) Supervised analysiswith PLS-DA, B1: PLS-DA sample plot with confidence ellipse plots, B2: classification performance percomponent (overall or BER) for three prediction distances using 50 * 5-fold cross-validation. C) Supervisedanalysis and feature selection with sparse PLS-DA, C1: sPLS-DA sample plot with confidence ellipseplots, C2: arrow plot representing each sample pointing towards its outcome category, C3: coefficient weightof the features selected on component 1 and component 2, with colour indicating the class with maximalmean expression value for each feature, C4: feature stability when evaluating the performance of a sPLS-DAmodel with 10, 40 and 60 features on the first three components (50 * 5-fold cross-validation), C5: ClusteredImage Map (Euclidian Distance, Complete linkage). Samples are represented in rows, selected features incolumns (10, 40 and 60 genes selected on each component respectively), C6: receiver operating characteristic(ROC) curve and Area Under the Curve (AUC) averaged using one-vs-all comparisons.

highest number of predictions over all ‘omics data set, while the weighted vote combines thepredictions of all ’omics after weighting each by the correlation between the component X(q)

h a(q)hand the outcome. In both strategies, a prediction distance is to be specified to obtain a predictedclass, as described in Section III. Ties are indicated as ‘NA’ in the predicted classes.

10


https://doi.org/10.1101/108597



Specific outputs to visualise multiple ‘omics data sets integration. Several types of graphicaloutputs are available to support interpretation of the statistical results. To represent samples,plotIndiv displays component scores from each ‘omics data set individually. Such type of plotenable to visualise the agreement between all data sets at the sample level. The plotArrow functionalso enables similar visualisation (see Section II). The function plotDiablo is a matrix scatterplotof the components from each data set for a given dimension; it enables to check whether thepairwise correlation between two ‘omics has been modelled according to the design. The functioncircosPlot shows pairwise correlations among the selected features over all data sets. Featuresare represented on the side of the circos plot, with colours indicating the type of data, and external(optional) lines display expression levels for each outcome category. It is an extension of themethod used in plotVar, cim and network (see González et al. 2012).

Parameters tuning. The first parameter to choose in DIABLO is the design matrix, which canbe specified based on either prior biological knowledge, or by using a preliminary multivariatemethod integrating two data sets at a time (e.g. PLS) to assess the potential common informationbetween data sets in an unsupervised analysis. In addition, the function plotDIABLO run on allfeatures (non sparse model) can further confirm the suitability of the design to maximise thecorrelations between data sets. By default the design links each data set to the outcome Y.Similar to PLS-DA, the number of components ncomp needs to be specified. We usually found thatK− 1 components were sufficient to discriminate the sample classes but this should be furtherassessed with the model performance and graphical outputs (see our example 4 and Suppl. V).Finally and most importantly, the number of features to select per data set and per componentneeds to be specified with the list argument keepX. The tune function evaluates the performanceof the model over a grid of different keepX parameters using repeated cross-validation, based onthe (balanced) classification error rate, with a parallelisation option (argument cpus). Note thatthis tuning step might become cumbersome as there might be numerous combinations to evaluate.A constraint tuning is also available, see Section III. Our experience shows that a minimal errorrate could be attained with a rather small number of features per component and data set (<20,Singh et al. 2016). However, the user can enlarge the search grid to ensure a sufficiently largenumber of selected features when the focus is on the downstream biological interpretation (e.g.enrichment analyses).

Usage in mixOmics. Figure 4 displays some of the graphical outputs when performing N-integration. The multi-‘omics breast cancer study analysed include mRNA (P1 = 200), miRNA(P2 = 184) and proteomics (P3 = 142) data that were normalised and drastically filtered forillustrative purpose in this manuscript. The data were divided into a training set composed ofN = 150 samples and an external test set of Ntest = 70 samples where the proteomics data aremissing (see details in Singh et al. 2016). The aim of N-integration is to identify a highly correlatedmulti-‘omics signature discriminating the breast cancer subgroups Luminal A, Her2 and Basal.Figure 4A displays the matrix design and the sample correlation between each component fromeach data set, B the sample plots for each data set, C our different feature plots and D a clusteredimage map of the multi-‘omics signature. The full pipeline, results interpretation and associated Rcode is available in Electronic Suppl. V.

11


https://doi.org/10.1101/108597



−0.87 −0.48 0.87

Color key

ZEB1

TAGLNARL4C

ASPM

KDM4B

ZNF552

FLI1

COL15A1

FUT8

PRNP

PRKCDBP

TCF4

AHR

COL6A1RAB3IL1

APBB1IP

IL1R1

CNN2

CCDC80

CXCL12

CCNA2

CTSK

JAM3

hsa−let−7c

hsa−mir−100

hsa−mir−127

hsa−mir−130b

hsa−mir−17

hsa−mir−199a−1hsa−mir−199a−2hsa−mir−199b

hsa−mir−20a

hsa−mir−379

hsa−mir−381

hsa−mir−505hsa−mir−590

hsa−mir−99a

AR

ASNS

Annexin_I

B−Raf

Bak

Caveolin−1

Claudin−7

Collagen_VI

Cyclin_B1

Cyclin_E1

E−Cadherin

ER−alpha

ER−alpha_pS118

GATA3

GSK3−alpha−beta

K−RasKu80

PKC−alpha

PKC−alpha_pS657

S6

Smad4

mTOR

Block: mrna Block: mirna

Block: prot

−5

0

5

10

−5.0

−2.5

0.0

2.5

5.0

7.5

−6

−3

0

3

6

−2 0 2 −2 0 2 4

−2 0 2 4variate 1

varia

te 2

LegendBasalHer2LumA

DIABLOA1 B

C1 C2

A2

mRNA

miRNA

protein

D

C3 C4

Index

mrna

X.label

Y.label

−3 −2 −1 0 1 2 3

X.label

Y.label

−3 −2 −1 0 1 2 3

−20

24

Index

0.83

Index

1 mirna

X.label

Y.label

−20

24

Index

0.9Index

1 0.76

Index

1 prot

Basal Her2 LumA

−2 0 1 2

Color key

Cycli

n_E1

hsa−

mir−

20a

hsa−

mir−

17Cy

clin_

B1CC

NA2

ASPM

ASNS

hsa−

mir−

505

hsa−

mir−

590

hsa−

mir−

130b

GAT

A3ER

−alp

haKD

M4B

ZNF5

52 ARFU

T8

A0ESA0EWA0EXA06PA133A0DSA0RHA0ROA0CSA18SA0JFA146A0AZA0I8A1ALA12AA0BQA0EIA0B0A0DVA0EAA0T6A0BMA0SHA0T7A0YLA1B1A0IKA0J5A0H7A0BPA0DPA0E1A0EUA0TZA0CDA12XA0ASA07ZA04AA0FDA12HA0CTA15RA0BSA140A0RMA12BA0SUA0W5A0XSA0X0A0IOA08TA0FSA0W4A1BDA18NA08AA1AVA09AA0XNA08ZA086A1APA1AKA0I9A0XWA18PA18FA1ATA12YA0RVA0RGA08OA0DKA03LA15EA1AUA12TA07IA15LA12PA04WA128A13ZA09GA12QA135A09XA0RXA152A0D1A0T1A0A7A12DA08XA0TXA1B0A0WXA0I2A12LA137A18RA0DAA150A131A0G0A0EQA0ARA0T2A0FLA04PA08LA0AVA0JLA12VA0ALA0RTA094A0EEA14PA1B6A14XA0U4A1AIA13EA0B3A124A0XUA0T0A07RA0E0A147A04UA0B9A0D2A0FJA04DA04TA0YMA1AZA0CMA0D0A0ATA143A0SXA1AYA0SKA0CE

RowsBasalHer2LumA

ColumnsmrnamirnaprotZN

F552

FUT8

KDM

4B

ZEB1

PLCD

3

CCNA

2

JAM3

ASPM

CTSK

TCF4

CXCL12

CCDC80

COL6A1

PRKCDBP

RAB3IL1

BOC

ARL4C

PRNP

COL15A1

IL1R1

FLI1

TAGLN

AHR

CNN2

APBB1IP

mrna

AR

GATA3

ER−alphaASNS

Cyclin_B1HER2Cyclin_E1

Caveolin−1

ER−alpha_pS118

Smad4

Pea−15

4E−BP1_pS65

ACC1S6

p27_

pT15

7

SCD1

LckBa

k

Colla

gen_

VI

BaxACC_p

S79PKC−alp

haSTAT3_pY705

Fibronectin

B−Raf

Annexin_IGSK3−alpha−beta

E−Cadherin

Ku80

mTOR

Src_pY527

Claudin−7

NF2

PKC−alpha_pS657

K−Ras

prot

hsa−mir−590

hsa−mir−130b

hsa−mir−17

hsa−mir−505

hsa−mir−20a

hsa−mir−379hsa−mir−127

hsa−mir−100hsa−mir−199a−1

hsa−mir−381

hsa−mir−199a−2

hsa−let−7c

hsa−mir−199b

hsa−mir−99amirna

CorrelationsPositive CorrelationNegative Correlation

Correlation cut−offr=0.7

Comp 1−2

CCDC80CXCL12PRNPBOC

TAGLNARL4CCOL6A1CNN2PLCD3CTSK

RAB3IL1TCF4AHRFLI1ZEB1IL1R1JAM3

PRKCDBPAPBB1IPCOL15A1

−0.4 −0.3 −0.2 −0.1 0.0

Contribution on comp 2Block 'mrna'


hsa−mir−199b


hsa−let−7c

hsa−mir−381

hsa−mir−100

hsa−mir−99a

hsa−mir−379

hsa−mir−127

−0.4 −0.3 −0.2 −0.1 0.0

Contribution on comp 2Block 'mirna'

Claudin−7Annexin_I

PKC−alpha_pS657Caveolin−1PKC−alpha

mTORNF2

ER−alpha_pS118STAT3_pY7054E−BP1_pS65E−Cadherin

ACC1Lck

Pea−15Collagen_VI

Smad4Ku80S6

HER2Src_pY527

BakGSK3−alpha−beta

B−RafK−Ras

FibronectinACC_pS79

SCD1Bax

p27_pT157Cyclin_B1

−0.4 −0.2 0.0 0.2 0.4

Contribution on comp 2Block 'prot'

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Component 1

Com

pone

nt 2

Correlation Circle Plots

Figure 4: Illustration of DIABLO analysis in mixOmics. A1, design and A2, sample scatterplot from plotDiablodisplaying the first component in each data set (upper plot) and Pearson correlation between each component(lower plot). B: sample plot per data set (block), C) feature outputs, C1: Correlation Circle plot representingeach type of selected features, C2: relevance network visualisation of the selected features, C3: Circos plotshows the positive (negative) correlation (r > 0.7) between selected features as indicated by the brown (black)links, feature names appear in the quadrants, C4: coefficient weight of the features selected on component1 in each data set, with color indicating the class with a maximal mean expression value for each feature.D Clustered Image Map (Euclidian distance, Complete linkage) of the multi-omics signature. Samples arerepresented in rows, selected features on the first component in columns.

V. P-integration across independent data sets with MINT

The integration of independent data sets measured on the same common P features under similarconditions or treatments (Figure 1) is a useful approach to increase sample size and gain statisticalpower. In this context, the challenge is to accommodate for systematic differences that arise due todifferences between protocols, geographical sites or the use of different technological platforms togenerate the same type of ‘omics data (e.g. transcriptomics). The systematic unwanted variation,also called ‘batch-effect’, often acts as a strong confounder in the statistical analysis and may leadto spurious results and conclusions if it is not accounted for in the statistical model.

12


https://doi.org/10.1101/108597



Method. MINT (Rohart et al., 2016a) is an extension of the multi-group PLS framework (mg-PLS, Eslami et al. 2013b, 2014), where ‘groups’ represent independent studies, to a supervisedframework with feature selection. MINT seeks to identify a common projection space for allstudies that is defined on a small subset of discriminative features and that display an analogousdiscrimination of the samples across studies.

We combine M datasets denoted X(1)(N1 × P), X(2)(N2 × P), ..., X(M)(NM × P) measured onthe same P predictors but from independent studies, with N = ∑M

m=1 Nm. Each data set X(m),m = 1, . . . , M, has an associated dummy outcome Y(m) in which all K classes are represented. Wedenote X (N × P) and Y (N × K) the concatenation of all X(m) and Y(m) respectively. In the MINTparticular framework, each feature of the datasets X(m) and Y(m) is centered and scaled. For eachcomponent h, MINT solves :

maxah ,bh

M

∑m=1

Nm cov(X(m)h ah, Y(m)

h bh), s.t. ||ah||2 = 1 and ||ah||1 ≤ λ (4)

where ah and bh are the global loadings vectors common to all studies, t(m)h = X(m)

h ah and

u(m)h = Y(m)

h bh are the partial PLS-components that are study specific. Residual (deflated) matricesare calculated for each iteration of the algorithm based on the global components and loadingvectors (see Rohart et al. 2016a). Thus the MINT algorithm models the study structure during theintegration process. The penalisation parameter λ controls the amount of shrinkage and thus thenumber of non zero weights in the global loading vector a. Similarly to sPLS-DA (Section III) MINTselects a combination of features on each PLS-component.

Specific graphical outputs. The set of partial components t(m)h , h = 1, ..., H provides study-

specific outputs in plotIndiv. These graphics can act as a quality control step to detect studiesthat cluster outcome classes differently to other studies (i.e. ‘outlier’ studies). The functionplotLoadings displays the coefficients weights of the features globally selected by the model butrepresented individually in each study. Visualisation of the global loading vectors is also available.Note the projection into the Y-space is of not useful in MINT.

Parameters tuning. We take advantage of the independence between studies to evaluate the per-formance based on a novel CV technique called ‘Leave-One-Group-Out Cross-Validation’ (Rohartet al., 2016a). LOGOCV performs CV where each group m is left out once. The aim is to reflecta realistic prediction of independent external studies. The tune function implements LOGOCVto choose the optimal number of features keepX or the optimal set of features keepX.constraintto select in X, as described in the earlier Sections. Note that LOGOCV cannot be repeated (nonrepeat argument) as this type of cross-validation is not random.

Usage in mixOmics. Figure 5 displays some of the graphical outputs when performing P-integration with mixOmics. We combined four independent transcriptomics stem cell studiesthat measure the expression levels of 400 genes across 125 samples (cells). The data were nor-malised and drastically filtered for illustrative purpose in this manuscript. The cells were classifiedinto Fibroblasts, hESC and hiPSC. The aim of this P− integration analysis is to identify a robustmolecular signature across all studies to discriminate the three different cell types. After apply-ing MINT via the mint.plsda and mint.splsda functions, generic visualisations functions of themixOmics R-package like plotIndiv, cim and plotLoadings can be used. Figure 5 displays some

13


https://doi.org/10.1101/108597



outputs easily obtained by calls to those functions. Figure 5A displays a MINT PLS-DA sampleplot, Figure 5B the tuning and performance evaluation of the MINT sPLS-DA analysis and Figure5C the different sample and feature with MINT sPLS-DA. The full pipeline, results interpretationand associated R code is available in Electronic Suppl. V.

Global

−5

0

5

−2 0 2 4 6X−variate 1: 25% expl. var

X−va

riate

2: 5

% e

xpl.

var

LegendFibroblasthESChiPS

Study1234

MINT sPLS−DAStudy 1 Study 2

Study 3 Study 4

−4

−2

0

2

4

6

−5.0

−2.5

0.0

2.5

−2

0

2

−2

−1

0

1

2

−2 0 2 4 −2 0 2

0 2 4 6 −2 0 2 4X−variate 1

X−va

riate

2

MINT sPLS−DAstem cell study

−10

−5

0

5

10

−10 0 10 20X−variate 1: 26% expl. var

X−va

riate

2: 5

% e

xpl.

var

LegendFibroblasthESChiPS

Study1234

MINT PLS−DA

1 2 5 10 20 50 100

0.32

0.34

0.36

0.38

0.40

Number of selected genes

Bala

nced

erro

r rat

e

comp1comp1 to comp 2

A C1

B1C3

ENSG00000181449

ENSG00000123080

ENSG00000110721

ENSG00000176485

ENSG00000184697

ENSG00000102935

−20 0 20

Study 1

ENSG00000181449

ENSG00000123080

ENSG00000110721

ENSG00000176485

ENSG00000184697

ENSG00000102935

−60 −20 0 20 40 60

Study 2

ENSG00000181449

ENSG00000123080

ENSG00000110721

ENSG00000176485

ENSG00000184697

ENSG00000102935

−20 −10 0 10

Study 3

ENSG00000181449

ENSG00000123080

ENSG00000110721

ENSG00000176485

ENSG00000184697

ENSG00000102935

−15 −5 0 5 10

Study 4

Contribution on comp 1

BER Fib hESC hiPS

Comp1 0.50 0.00 0.92 0.57

Comp2 0.15 0.03 0.22 0.19

B2

C2

C4

−3.28 0 3.28

Color key

ENSG

0000

0102

935

ENSG

0000

0184

697

ENSG

0000

0110

721

ENSG

0000

0181

449

ENSG

0000

0176

485

ENSG

0000

0123

080

hiPShiPShiPShiPShESChiPShiPShiPShiPShiPShiPShESChESChiPShESChESChiPShESChESChiPShiPShiPShiPShESChESChiPShESChESChiPShESChESChESChiPShiPShiPShESChiPShiPShESChiPShiPShESChESChESChiPShiPShiPShiPShiPShiPShESChiPShiPShESChiPShiPShiPShiPShiPShiPShiPShiPShiPShESChiPShiPShiPShiPShiPShESChiPShESChiPShiPShESChESChESChESChESChESChiPShiPShESChESChESChiPShESChiPShiPShESChESChiPShiPShESChiPSFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblast

Figure 5: Illustration of MINT analysis in mixOmics. A: Preliminary analysis with MINT PLS-DA (no featureselection), sample plot displays the sample cell types. B Parameter tuning and performance with MINTsPLS-DA, B1: BER (y-axis) with respect to number of selected features (x-axis) when 1 and 2 componentsare successively added in the model. Full diamond represents the optimal number of features to select on eachcomponent using Leave-One-Group-Out cross-validation and the maximum distance, B2: Final performanceof the MINT sPLS-DA model for a selection of 6 and 55 transcripts on each component: overall BER and errorrate per cell type with the maximum distance, C) MINT sPLS-DA graphical outputs using plotIndiv,cim and plotLoadings. C1: Global sample plot with confidence ellipse plots. C2: Study specific sampleplot. C3: Clustered Image Map (Euclidian Distance, Complete linkage). Samples are represented in rows,selected features on the first component in columns. C4: Coefficient weight of the features selected oncomponent 1 in each study, with color indicating the class with a maximal mean expression value for eachtranscript.

14


https://doi.org/10.1101/108597



Conclusions

The technological race in high-throughput biology lead to increasingly complex biological problemswhich require innovative statistical and analytical tools. Our package mixOmics focuses on dataexploration and data mining, that are crucial steps for a first understanding of large data sets. Inthis article we presented our latest methods to answer cutting-edge integrative and multivariatequestions in biology. In particular, our supervised frameworks DIABLO and MINT substantiallyextend the key PLS-DA method to perform N- and P- integration of multiple data sets, classificationand class prediction of external studies. Combined together, those two framework bear the promiseof NP-integration (combine multiple studies that each have several type of data). The sparseversion of our methods are particularly insightful to identify molecular signatures across thosemultiple data sets.

Feature selection resulting from our methods help to refine biological hypotheses, suggestdownstream analyses including statistical inference analyses, and may propose biological ex-perimental validations. Indeed, multivariate methods include appealing properties to mine andanalyse large and complex biological data, as they allow more relaxed assumptions about datadistribution, data size (n << p) and data range than univariate methods, and provide insightfulvisualisations. In addition, the identification of a combination of discriminative features meetbiological assumptions that cannot be addressed with univariate methods. Nonetheless, we believethat combining different types of statistical methods (univariate, multivariate, machine learning)is the key to answer complex biological questions. However, such questions must be well stated,in order for those exploratory integrative methods to provide meaningful results, and especiallyfor the non trivial case of multiple data integration.

We illustrated our different frameworks on classical ‘omics data, however, mixOmics methodscan also be applied to data beyond the realm of ‘omics as long as they are expressed as continuousvalues. Our future work will include extensive development for other types of data, such asgenotypic as well as time course biological data. Finally, while our manuscript focused mainly onsupervised methodologies, the package also include their unsupervised counterparts to investigaterelationships and associations between features with no prior phenotypic or response information.

Availability and requirements

The R package mixOmics is available from the CRAN (R Core Team, 2016), with tutorials andnewsletter updates available from our website www.mixOmics.org.

Conflict of Interest

The authors declare that they have no competing interests.

Availability of supporting data

The data sets supporting the results of this article are available from the mixOmics R package in aprocessed format. R scripts, full tutorials and reports to reproduce the results from the proposedframework are available as Sweave code from our website www.mixOmics.org.

15


www.mixOmics.org

www.mixOmics.org

https://doi.org/10.1101/108597



Author’s contributions

FR implemented the MINT method, FR, BG and AS implemented the DIABLO method, FR wasthe main developer of the mixOmics package from v6.0.0. KALC manages and supervises themixOmics project. FR and KALC edited the manuscript.

Acknowledgements

FR was supported, in part, by the Australian Cancer Research Foundation (ACRF) for the Dia-mantina Individualised Oncology Care Centre at The University of Queensland DiamantinaInstitute. KALC was supported, in part, by and the National Health and Medical Research Council(NHMRC) Career Development fellowship (APP1087415). The authors would like to thank thenumerous mixOmics users to continuously helping us improving the package.

References

Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis ofmicroarray gene-expression data. Proceedings of the national academy of sciences, 99(10):6562–6566.

Barker, M. and Rayens, W. (2003). Partial least squares for discrimination. Journal of chemometrics,17(3):166–173.

Boulesteix, A.-L. (2004). Pls dimension reduction for classification with microarray data. Statisticalapplications in genetics and molecular biology, 3(1):1–30.

Boulesteix, A.-L. and Strimmer, K. (2007). Partial least squares: a versatile tool for the analysis ofhigh-dimensional genomic data. Brief. Bioinform., 8(1):32–44.

Cook, J. A., Chandramouli, G. V., Anver, M. R., Sowers, A. L., Thetford, A., Krausz, K. W., Gonzalez,F. J., Mitchell, J. B., and Patterson, A. D. (2016). Mass spectrometry–based metabolomics identifieslongitudinal urinary metabolite profiles predictive of radiation-induced cancer. Cancer research,76(6):1569–1577.

Eslami, A., Qannari, E. M., Kohler, A., and Bougeard, S. (2013a). Multi-group pls regression:Application to epidemiology. In New Perspectives in Partial Least Squares and Related Methods,pages 243–255. Springer.

Eslami, A., Qannari, E. M., Kohler, A., and Bougeard, S. (2013b). Multi-group PLS Regression:Application to Epidemiology. In New Perspectives in Partial Least Squares and Related Methods,pages 243–255. Springer.

Eslami, A., Qannari, E. M., Kohler, A., and Bougeard, S. (2014). Algorithms for multi-group PLS. J.Chemometrics, 28(3):192–201.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linearmodels via coordinate descent. Journal of statistical software, 33(1):1.

González, I., Déjean, S., Martin, P. G., Baccini, A., et al. (2008). CCA: An R package to extendcanonical correlation analysis. Journal of Statistical Software, 23(12):1–14.

González, I., Lê Cao, K.-A., Davis, M. J., Déjean, S., et al. (2012). Visualising associations betweenpaired ’omics’ data sets. BioData mining, 5(1):19.

16


https://doi.org/10.1101/108597



Guidi, L., Chaffron, S., Bittner, L., Eveillard, D., Larhlimi, A., Roux, S., Darzi, Y., Audic, S., Berline,L., Brum, J. R., et al. (2016). Plankton networks driving carbon export in the oligotrophic ocean.Nature.

Günther, O. P., Chen, V., Freue, G. C., Balshaw, R. F., Tebbutt, S. J., Hollander, Z., Takhar, M.,McMaster, W. R., McManus, B. M., Keown, P. A., et al. (2012). A computational pipeline for thedevelopment of multi-marker bio-signature panels and ensemble classifiers. BMC bioinformatics,13(1):326.

Günther, O. P., Shin, H., Ng, R. T., McMaster, W. R., McManus, B. M., Keown, P. A., Tebbutt,S. J., and Lê Cao, K.-A. (2014). Novel multivariate methods for integration of genomics andproteomics data: applications in a kidney transplant rejection study. Omics: a journal of integrativebiology, 18(11):682–695.

Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M.,Antonescu, C. R., Peterson, C., et al. (2001). Classification and diagnostic prediction of cancersusing gene expression profiling and artificial neural networks. Nature medicine, 7(6):673–679.

Labus, J. S., Van Horn, J. D., Gupta, A., Alaverdyan, M., Torgerson, C., Ashe-McNalley, C.,Irimia, A., Hong, J.-Y., Naliboff, B., Tillisch, K., et al. (2015). Multivariate morphological brainsignatures predict patients with chronic abdominal pain from healthy control subjects. Pain,156(8):1545–1554.

Lê Cao, K., Rossouw, D., Robert-Granié, C., Besse, P., et al. (2008). A sparse PLS for variableselection when integrating omics data. Statistical applications in genetics and molecular biology,7:Article–35.

Lê Cao, K.-A., Boitard, S., and Besse, P. (2011). Sparse PLS Discriminant Analysis: biologicallyrelevant feature selection and graphical displays for multiclass problems. BMC bioinformatics,12(1):253.

Lê Cao, K.-A., González, I., and Déjean, S. (2009a). integrOmics: an R package to unravelrelationships between two omics datasets. Bioinformatics, 25(21):2855–2856.

Lê Cao, K.-A., Lakis, V. A., Bartolo, F., Costello, M.-E., Chua, X.-Y., Brazeilles, R., and Rondeau, P.(2016). Mixmc: Multivariate insights into microbial communities. PloS one, 11(8):e0160169.

Lê Cao, K.-A., Martin, P. G., Robert-Granié, C., and Besse, P. (2009b). Sparse canonical methods forbiological data integration: application to a cross-platform study. BMC bioinformatics, 10(1):34.

Liquet, B., Lê Cao, K.-A., Hocini, H., and Thiébaut, R. (2012). A novel approach for biomarker selec-tion and the integration of repeated measures experiments from two assays. BMC bioinformatics,13:325.

Liu, Y., Devescovi, V., Chen, S., and Nardini, C. (2013). Multilevel omic data integration in cancercell lines: advanced annotation and emergent properties. BMC systems biology, 7(1):14.

Mahana, D., Trent, C. M., Kurtz, Z. D., Bokulich, N. A., Battaglia, T., Chung, J., Müller, C. L., Li, H.,Bonneau, R. A., and Blaser, M. J. (2016). Antibiotic perturbation of the murine gut microbiomeenhances the adiposity, insulin resistance, and liver disease associated with high-fat diet. Genomemedicine, 8(1):1.

17


https://doi.org/10.1101/108597



Meng, C., Zeleznik, O. A., Thallinger, G. G., Kuster, B., Gholami, A. M., and Culhane, A. C. (2016).Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings inbioinformatics, page bbv108.

Nguyen, D. V. and Rocke, D. M. (2002a). Multi-class cancer classification via partial least squareswith gene expression profiles. Bioinformatics, 18(9):1216–1226.

Nguyen, D. V. and Rocke, D. M. (2002b). Tumor classification by partial least squares usingmicroarray gene expression data. Bioinformatics, 18(1):39–50.

R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria.

Ramanan, D., Bowcutt, R., Lee, S. C., San Tang, M., Kurtz, Z. D., Ding, Y., Honda, K., Gause, W. C.,Blaser, M. J., Bonneau, R. A., et al. (2016). Helminth infection promotes colonization resistancevia type 2 immunity. Science, 352(6285):608–612.

Rohart, F., Eslami, A., Matigian, N., Bougeard, S., and Le Cao, K.-A. (2016a). Mint: A multi-variate integrative method to identify reproducible molecular signatures across independentexperiments and platforms. bioRxiv, page 070813.

Rohart, F., Mason, E. A., Matigian, N., Mosbergen, R., Korn, O., Chen, T., Butcher, S., Patel, J.,Atkinson, K., Khosrotehrani, K., Fisk, N. M., Lê Cao, K., and Wells, C. A. (2016b). A molecularclassification of human mesenchymal stromal cells. PeerJ, 4:e1845.

Rollero, S., Mouret, J.-R., Sanchez, I., Camarasa, C., Ortiz-Julien, A., Sablayrolles, J.-M., and Dequin,S. (2016). Key role of lipid management in nitrogen and aroma metabolism in an evolved wineyeast strain. Microbial cell factories, 15(1):1.

Shah, A. K., Lê Cao, K.-A., Choi, E., Chen, D., Gautier, B., Nancarrow, D., Whiteman, D. C.,Baker, P. R., Clauser, K. R., Chalkley, R. J., et al. (2016). Glyco-centric lectin magnetic bead array(lemba)- proteomics dataset of human serum samples from healthy, barrettŒs s esophagus andesophageal adenocarcinoma individuals. Data in Brief, 7:1058–1062.

Singh, A., Gautier, B., Shannon, C. P., Vacher, M., Rohart, F., Tebutt, S. J., and Le Cao, K.-A. (2016).Diablo-an integrative, multi-omics, multivariate method for multi-group classification. bioRxiv,page 067611.

Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.-A., Grill, J., and Frouin, V. (2014). Variableselection for generalized canonical correlation analysis. Biostatistics, page kxu001.

Tenenhaus, A. and Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis.Psychometrika, 76(2):257–284.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety. Series B (Methodological), pages 267–288.

Witten, D., Tibshirani, R., Gross, S., and Narasimhan, B. (2013). PMA: Penalized MultivariateAnalysis. R package version 1.0.9.

Wold, H. (1966). Estimation of principal components and related models by iterative least squares.J. Multivar. Anal., pages 391–420.

Wold, H. (1975). Path models with latent variables: The NIPALS approach. Acad. Press.

18


https://doi.org/10.1101/108597



Yao, F., Coquery, J., and Lê Cao, K.-A. (2012). Independent Principal Component Analysis forbiologically meaningful dimension reduction of large biological data sets. BMC bioinformatics,13(1):24.

Supplementary Material

functions PLS-DA sPLS-DA DIABLO sparseDIABLO MINT sparseMINT

functioncall plsda splsda block.plsda block.splsda mint.plsda mint.splsda

parameters ncomp ncompkeepX

designncomp

designncompkeepX

ncomp ncompkeepX

performance

tune, plot.tune

✓ ✓ ✓

perf, plot.perf

✓ ✓ ✓ ✓ ✓ ✓

auroc ✓ ✓ ✓ ✓ ✓ ✓

sample plot

plotIndiv ✓ ✓ ✓ ✓ ✓ ✓

plotArrow ✓ ✓ ✓ ✓ ✓ ✓

plotDiablo ✓ ✓

variableplot

plotVar ✓ ✓ ✓ ✓ ✓ ✓

plotLoadings ✓ ✓ ✓ ✓ ✓ ✓

circosPlot ✓ ✓

cim ✓ ✓ ✓ ✓ ✓ ✓

network ✓ ✓ ✓ ✓ ✓ ✓

variablelist selectVar ✓ ✓ ✓ ✓ ✓ ✓

Figure S1: List of the main mixOmics functions for supervised analyses.

Electronic Supporting Information

Sweave and R code for PLS-DA analysis are available on our website at this link.

19


https://doi.org/10.1101/108597



Sweave and R code for DIABLO analysis are available on our website at this link.Sweave and R code for MINT analysis are available on our website at this link.

20


https://doi.org/10.1101/108597


mixOmics: an R package for ‘omics feature selection and ... · Integrative Multivariate ‘omics...

Documents

Transcript of mixOmics: an R package for ‘omics feature selection and ... · Integrative Multivariate ‘omics...