Kim Anhle Cao  Multivariate models for dimension reduction and biomarker selection in omics data

Upload
australianbioinformaticsnetwork 
Category
Science

view
448 
download
19
Embed Size (px)
description
Transcript of Kim Anhle Cao  Multivariate models for dimension reduction and biomarker selection in omics data
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Multivariate models for dimensionreduction and biomarker selection in
omics data
KimAnh Lê Cao
The University of Queensland Diamantina InstituteBrisbane, Australia
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Challenges
Challenges
Close interaction between statisticians, bioinformaticians andmolecular biologists
Understand the biological problemTry answer complex biological questionsIrrelevant (noisy) variablesn << p and n very small→ limited statistical validationBiological validation neededKeep up with new technologiesAnticipate computational issues
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Multivariate analysis
Linear multivariate approaches
Linear multivariate approaches use latent variables (e.g. variablesthat are not directly observed to reduce the dimensionality of thedata).A large number of observable variables are aggregated in a modelto represent an underlying concept, making it easier to understandthe data.
Dimension reduction→ project the data in a smaller subspaceTo handle multicollinear, irrelevant, missing variablesTo capture experimental and biological variation
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Outline
Outline of this talk and research questions
1 Principal Component Analysis The workhorse for linear multivariate statistical analysis Detection and correction of batch effects Variable selection
2 One data set Looking for a linear combination of multiple biomarkers froma single experiment Improving the prediction of the model
3 Integration of multiple data sets Looking for a linear combination of multiple ‘omicsbiomarkers from multiple ‘omics experiments Improving the prediction of the model
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
Principal Component Analysis
PCA: the workhorse for linear multivariate statistical analysis is an(almost) compulsory first step in exploratory data analysis to:
Understand the underlying data structureIdentify bias, experimental errors, batch effects
Insightful graphical outputs to:to visualise the samples in a smaller subspace (components‘scores’)to visualise the relationship between variables ... and samples(correlation circles)
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
Maximize the varianceIn PCA, we think of the variance as the‘ information’ contained inthe data. We replace original variables by artificial variables(principal components) which explain as much information aspossible from the original data. Toy example with data (n = 3 × p = 3):
Cov(data) =1.3437 0.16015 0.18640.01601 0.6192 0.12660 .1864 0.1266 1.4855
We have 1.3437 + 0.6192 + 1.4855 = 3.4484
Replace the data by the principal components, which areorthogonal to each other (covariance = 0).
Cov(pc1, pc2, pc3) =1.6513 0 0
0 1.2202 00 0 0.5769
We have 1.6513 + 1.2202 + 0.5769 = 3.4484KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
PCA as an iterative algorithm
1 Seek for the first PC for which thevariance is maximised
2 Project the data points onto the first PC3 Seek for the second PC that is orthogonal
to the first one4 Project the residual data points onto the
second PC5 And so on until we explain enough
variance in the data
−1.0 −0.5 0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
x1
x2
Seek max. variance
−1.0 −0.5 0.0 0.5 1.0
−0.
4−
0.2
0.0
0.2
0.4
z1
z2
Project into new subspace
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
A linear combination of variablesSeek for the best directions in the data that account for most ofthe variability. Objective function:
maxv =1
var(Xv)
Each principal component u is a linear combinations of the originalvariables:
u = v1x1 + v2x2 + · · ·+ vpxp
X is a n × p data matrix with {x1, . . . , xp} the p variableprofiles.u is the first principal component with max. variance{v1, . . . , vp} are the weights in the linear combination
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
The data are projected into a smaller subspace
Each principal component is orthogonal to each other toensure that no redundant information is extracted.The new PCs form a a smaller subspace of dimension << p.
Each value in the principal component corresponds to a scorefor each sample→ this is how we project each sample into a new subspacespanned by the principal components→ approximate representation of the data points in a lowerdimensional space→ summarize the information related to the variance
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
Computing PCA
Several ways of solving PCAEigenvalue decomposition: the old way, does not work well inhigh dimension Sv = λv ; u = XvS = variance covariance matrix or correlation matrix if X is scaled
NIPALS algorithm: long to compute but works for missingvalues and enables least squares regression framework (usefulfor variable selection)Singular Value Decomposition (SVD): the easiest and fastest,implemented in most softwares.
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
PCA is simply a matrix decomposition
Singular Value Decomposition:X = UΛV T
Λ diagonal matrix with√λh
U = UΛ, U contains the PCs uh
V contains the loading vectors vh
h = 1..H
The variance of a principal component u1 is equal to its associatedeigenvalue λ1, etc.. The obtained eigenvalues λh are decreasing.
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
Tuning PCAHow many principal components to choose to summarize most ofthe information?Note: we can have as many components as the rank of the matrix X
Look at proportion of explainedvarianceLook at the screeplot ofeigenvalues. Any elbow?Look at sample plot. Makes sense?Some stat tests exist to estimatethe ‘intrinsic’ dimension
Proportion of explained variance for the first 8 principal components:PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
0.5876 0.1587 0.0996 0.0848 0.0159 0.0098 0.0052 0.0051
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
PCA is useful to visualise batch effects
‘Batch effects are subgroups of measurements that havequalitatively different behaviour across conditions and are unrelatedto the biological or scientific variables in a study.
For example, batch effects may occur if a subset of experimentswas run on Monday and another set on Tuesday, if two technicianswere responsible for different subsets of the experiments, or if twodifferent lots of reagents, chips or instruments were used’.
(Leek et al. (2010), Tackling the widespread and critical impact of batcheffects in highthroughput data. Nat. Rev. Genet. )
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
Cell lines case study (www.stemformatics.org, AIBN)Five microarray data set studies including n = 82 samples, p = 9Kgenes and K = 3 groups of cell lines including 12 fibroblasts, 22hPSC and 48 hiPSC.This is a complex case of crossplatform comparison where weexpect to see a batch effect (the study).p ( yy)y
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
PCA in actionWhat we would really like to see is a separation between thedifferent types of cell lines. PCA is a good exploratory method tohighlight the sources of variation in the data and highlight batcheffects if present.effects if present.
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
PCA to visualise batch effect correction
Remove known batch effects
Combat: uses empirical Bayesianframework
linear (mixed) model
Remove unknown batch effects
sva: estimates surrogate variablesfor unknown sources of variation
YuGene: works directly on thedistribution of each sample
combat: Johnson WE. et al (2007), Biostatistics;sva: Leek JT and Storey JD. (2007), PLoS Genetics;YuGene: Lê Cao KA, Rohart F et al. (2014), Genomics.
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
Summary on PCA
PCA is a matrix decomposition technique that allowsdimension reductionRun a PCA first to understand the sources of variation in yourdataNote: I illustrated the batch effect on the very topical issue ofmetaanalysis. But batch effects happen very often!→ What did Alec say? Experimental design is important!If clear batch effect: try to remove/ accommodate for itIf no clear separation between your groups:→ Biological question may not be related to the highestvariance in the data, try ICA instead→ Mostly due to a large amount of noisy variables that needto be removed
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Dealing with high dimensional data
The curse of dimensionalityFact: in high throughput experiments, data are noisy or irrelevant.They make a PCA analyse difficult to understand.
→ clearer signal if some of the variable weights {v1, . . . , vp} set to0 for the ’irrelevant’ variables:
u = 0 ∗ x1 + v2x2 + · · ·+ 0 ∗ xp
Important weights = important contribution to define the principalcomponents.
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sparse PCA
The curse of dimensionality
Since the ’90, many statisticians have been trying to address theissue of high (and now ‘ultra high’) dimension.
Let’s face it, classical statistics cannot cope!
Univariate analysis: multiple testing to adress # false positivesMachine Learning approachesParsimonious statistical models to make valid inferences→ variable selection
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sparse PCA
sparse Principal Component AnalysisAim: sparse loading vectors to remove irrelevant variables whichdetermine the principal components. The idea is to apply LASSO(least absolute shrinkage and selection operator, Tibshirani, R. (1996)) orsome softthresholding approach in a PCA framework
(least squares regression framework by usingthe close link between SVD and low rankmatrix approximations).
→ obtain sparse loading vectors, with very few nonzero elements→ perform internal variable selection
Shen, H., Huang, J.Z. 2008. Sparse principal component analysis via regularized low
rank matrix approximation, J. Multivariate Analysis.
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sparse PCA
Parameters to tune
1 The number of principal components (as seen earlier)2 The number of variables to select. → tricky!
At this point we can already sense that separating hPSC vs hiPSC is going tobe a challenge
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sparse PCA
Correlation circles enable to understand the relationshipbetween variables (correlation structure)between variables and samples
Samples VariablesVariables
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Linear Discriminant analysis
Definition
Linear Discriminant analysis seeks for a linear combination offeatures which characterizes or separates two or more classes ofobjects or events.
The resulting combination may be used as a linear classifier, or,more commonly, for dimensionality reduction prior to classification.
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Linear Discriminant analysis
Fisher’s Linear Discriminant
We can decompose the total variation intoV = B + W
(W the Withingroups and B the Betweengroups variance)
The problem to solve is:
maxa
a′Baa′W a
→ maximise the Betweengroups variance→ minimise the Withingroups variance Image source:
http://pvermees.andropov.org/
Number of components to choose: ≤ min(p,K − 1), K = # classes
Prediction of a new observation x is based on the discriminant score
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Linear Discriminant analysis
Pros and Cons of LDA
LDA has often been shown to produce very good classificationresultsAssumptions: equal variance in each group independent variables to be normally distributed independent variables to be linearly relatedBut with too many correlated predictors, LDA does not workso well
→ regularize or introduce sparsity, using a cousin of LDA calledPartial Least Squares Discriminant analysis
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
PLS Discriminant Analysis
PLS  Discriminant Analysis
Partial Least Squares DiscriminantAnalysis seeks for the PLS componentsfrom X which best explain the outcomevector Y.
Objective function based on the general PLS:
maxu=1,v =1
cov(Xu,Y v)
Y is the qualitative response matrix (dummy block matrix)KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
A linear combination of biomarkers
sparse PLS  DA
Include softthresholding or LASSO penalization on the loadingvectors (v1, . . . , vp) similar to sPCA.
Parameters to tune:Number of components often equals to K − 1Number of variables to select on each dimension→ classification error, sensitivity, specificity
Lê Cao KA., Boitard S. and Besse P. (2011) Sparse PLS Discriminant Analysis:biologically relevant feature selection and graphical displays for multiclass problems,BMC Bioinformatics, 12:253.
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
A linear combination of biomarkers
What is k fold crossvalidation?→ estimate how accurately a predictive model will perform inpractice
Partition the data into k equalsize subsets
Train a classifier on k − 1 subsets
Test the classifier on the kth
remaining set
Estimate classification error rate
Average across k folds
Repeat the process many times
Be careful of selection bias when performing variable selectionduring the process!
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: transcriptomics data
Kidney transplant study (PROOF Centre, UBC)
Genomics assay (p = 27K) of 40 patients with kidneytransplant, with acute rejection (AR, n1 = 20) or no rejection (NR,n2 = 20) of the transplant.
Training step:
5 fold crossvalidation (100times) on a training set ofntrain = 26Selection of 90 probe sets
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: transcriptomics data
Testing step:
NR
−0.2 −0.1 0.0 0.1 0.2 0.3
−0.
4−
0.2
0.0
0.2
0.4
0.6
Comp 1
Com
p 2
Genomics
Train
ARNR
Test
AR
Training performed: model with90 probes
Genes mostly related toinflammation and innate immuneresponses
External test set of ntest = 14
Project the test samples onto thetraining set subspace of sPLSDA
Classifier # probes Error rate Sensitivity Specificity AUCsPLSDA 90 0.14 0.71 1 0.90PLSDA 27K (all) 0.21 0.57 1 0.82
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: oesophageal cancer
Oesophageal cancer study (UQDI)Proteomics assay (p = 129) of 40 patients with Barrett’soesophagus benign (n1 = 20) or oesophageal adenocarcinomacancer (n2 = 20). Aim: develop blood tests for detection andpersonalised treatment.
sPLSDA model and selection of9 proteins
Individual proteins biomarkersgive a poor performance (AUC0.460.79)
A combined signature fromsPLSDA model improves theAUC up to 0.94
Roc curve
SpecificityS
ensi
tivity
0.0
0.2
0.4
0.6
0.8
1.0
1.0 0.8 0.6 0.4 0.2 0.0
Combined markersP1P2P3P4P5P6P7P8
Acknowledgements: Benoit Gautier
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Framework
Biomarker discovery when integrating multiple data sets
Model relationships between different data setsSelect relevant biological entities which are correlated acrossthe different data setsUse multivariate integrative approaches: Partial Least Squaresregression (PLS) and Canonical Correlation Analysis (CCA)variantsSame idea as before: maximize covariance between latentcomponents associated to 2 data sets at a time
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sGCCA
sparse Generalised Canonical Correlation Analysis
sGCCA generalizes sPLS to more than 2 data sets andmaximizes the sum of pairwise covariances between twocomponents at a time
maxa1,...,aJ
J∑
j ,k=1,j �=k
ckjg(Cov(X jaj ,X kak))
s.t. τj aj 2 + (1 − τj )Var(X jaj ), with j = 1, . . . , J, where the aj are the
loading vectors associated to each block j , g(x) = x .sGCCA: variable selection is performed on each data set
Tenenhaus, A., Philippe, C., Guillemot, V., Lê Cao KA., Grill, J., Frouin, V. et al.2014, Variable selection for generalized canonical correlation analysis, Biostatistics
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: asthma study
Asthma study (UBC, Vancouver)Multiple assay study on asthma patients pre (n1 = 14) and post(n2 = 14) challenge.Three data sets: complete blood count (p1 = 9), mRNA(p2 = 10, 863) and metabolites (p3 = 146).
AimsIdentify a multi omic biomarker signature to differentiateasthmatic response.Can we improve the prediction with biomarkers of differenttypes?
Acknowledgements: Amrit Singh
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: asthma study
MethodsGCCA components score predict the class of the samples per dataset.→Use the average score to get an ‘Ensemble’ prediction
Results‘Ensemble’ prediction outperforms each data set’s prediction
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: asthma study
Results: Selection of 4 cells, 30 genes and 16 metabolites, some ofwhich are very relevant to Asthma.
Cells: Relative Neutrophils; Relative Eosinophils; T reg; T cellsGenes:
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
mixOmics
It is all about mixOmics
mixOmics is an R package implementing multivariate methods suchas PCA, ICA, PLS, PLSDA, CCA (work in progress), the sparsevariants we have developed (and much more).
Website with tutorial Web interface
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Conclusions
To put it in a nutshell
Multivariate linear methods enables to answer a wide range ofbiological questions data exploration classification integration of multiple data setsLeast squares framework for variable selection
Future of mixOmicsTime course modellingMeta analysis / multi group analysisSeveral workshops coming up! (Toulouse 2014, Brisbane 2015,Auckland 2015)
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Acknowledgements
mixOmics developmentSébastien Déjean Univ. ToulouseIgnacio González Univ. ToulouseXin Yi Chua QFABBenoit Gauthier UQDI
Data providersOliver Günther Kidney PROOF, VancouverMichelle Hill EAC UQDIAlok Shah EAC UQDIChristine Wells Stemformatics AIBN, Univ. Glasgow
Methods developmentAmrit Singh UBC, VancouverFlorian Rohart AIBN, UQJasmin Straube QFABKathy Ruggiero Univ. AucklandChristèle Robert INRA Toulouse
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Questions?
Questions?
Further information and registration on www.di.uq.edu.au/statisticsPlease forward on to colleagues who may be interested in attending
EnquiriesDr KimAnh Lê Cao (DI)Tel: 3343 7069 Email: [email protected]
Dr Nick Hamilton (IMB)Tel: 3346 2033 Email: [email protected]
With a focus on real world examples, notes and resources you can take home, and quick exercises to test your understanding of the material, this short course will cover many of the essential elements of biostatistics and problems that commonly arise in analysis in the lab.
* Presentations will be given simultaneously live at UQDITRI and remotely at IMB
Statistics for frightened bioresearchersA short course organised by UQDI and IMB
10.30am 12pm Monday28 July, 1 September, 29 September, 3 November> Room 2007, Translational Research Institute, 37 Kent Street, Princess Alexandra Hospital, Woolloongabba, QLD 4102 *
> Large Seminar Room, Institute for Molecular Bioscience, Queensland Biosciences Precinct, The University of Queensland, St Lucia Campus *
You are invited to a short course in biostatisticsFour presentations over four months will cover
1. Summarizing and presenting data2. Hypothesis testing and statistical inference3. Comparing two groups4. Comparing more than two groups
IMB Career Advantage 2014
Free registration:www.di.uq.edu.au/statistics
k.l[email protected]://www.math.univtoulouse.fr/∼biostat/mixOmics
KimAnh Lê Cao
2014 Winter School in Mathematical & Computational Biology