Cox model in high dimensional and low sample size … 7/Bastien.pdf · Cox model in high...
Transcript of Cox model in high dimensional and low sample size … 7/Bastien.pdf · Cox model in high...
Cox model in high dimensional
and low sample size settings
Philippe BASTIEN
L’Oréal Recherche, Aulnay, France ([email protected])
Introduction
• 2001 : The PLS Cox regression was proposed at the “Second International Symposium on PLS” as an alternative to the Cox regression model for censored data in the presence of highly-correlated covariates.
• Since 2001 numerous developments have appeared in the analysis of censoredsurvival data with high-dimensional covariates, mainly due to the growing interest of researchers in pharmacogenomics to develop, based on gene expression in cancer tissues, molecular predictors of patient’s time to clinical outcome such as time to cancer recurrence or death after treatment.
• It allows the discovery of new markers and new therapeutic targets and opens the way to more subject-specific treatments with greater efficacy and safety.
• This lecture aims to present an overview of the latest developments in this domain.
Bastien P., Tenenhaus M., 2001. Proceedings of the PLS’01 International Symposium, 131-140..
The Cox proportional hazard model
• The model assumes the following hazard function for the occurrence of an event at time t in the presence of censoring:
• The Cox’s partial likelihood can be written as :
• When p > n, there is no unique ß to maximize this partial likelihood.
• Even when p ≤ n , covariates could be highly correlated and regularization may still be required in order to reduce the variance of the estimates and to improve the prediction performance.
Cox D.R. 1972. Journal of the Royal Statistical Society B.; 74: 187-220
0( ) ( ) exp( )t t Xλ λ β=
exp( ' )( )
exp( ' )k
k
k D jj R
xPL
x
ββ
β∈ ∈
= ∏∑
Regularization with the Cox model
• A common remedy is to regularize, through a bias-variance trade-off, in order
to stabilize the estimates.
• There are two main approaches when dealing with regularization
– Dimension reduction
– Penalization
Dimension reduction
• Prediction in high dimensional and low sample size settings already arose in Chemistry in the eighties. The PLS regression was developed as aChemometric tool in an attempt to find reliable predictive model with spectral data.
• Nowadays the difficulties encountered with the use of transcriptomic data for classification or prediction, using very large matrices, is of comparable nature.
• Gene expression have a similar data structure : covariates are highly collinear and the number of covariates far exceed the number of observations.
• It was thus natural to use PLS regression principles in this new context.
Wold S., Martens H., and Wold H., 1983, Lecture notes in Mathematics, Springer Verlag 286-293.
PLS-Cox regression
• Computation of the first PLS component t1
1,1
11
1
11
1 1
( ) , i pi i
a Coxfit x
aw
a
Xwt
w w
==
=
=′
.
1,1, 1
1,2, 1, 1
22
2
1 22
2 2
( / ) ,
( , ) ,
i pi i
i pi i
x linfit x t
a Coxfit x t
aw
a
X wt
w w
=
=
=
=
=
=′
• Computation of the second PLS component t2
PLS Cox regression
• Computation of the hth PLS component th
1,1, 1 2 1
1,, 1, 1 2 1
1
( / , ,..., ) ,
( , , ,..., ) ,
i ph i i h
i ph i h i h
hh
h
h hh
h h
x linfit x t t t
a Coxfit x t t t
aw
a
X wt
w w
=− −
=− −
−
=
=
=
=′.
Bastien P., Esposito Vinzi V., and Tenenhaus M., 2005. CDSA, 48:17-46.
Other alternatives of the Cox model
in the PLS regression setting
• Park et al. (2002)
• Nguyen and Rocke (2002)
• Li and Gui (2004)
Park. P., Tian L., and Kohane I., 2002. Bioinformatics, 18:120-127
Nguyen D.V. and Rocke D., 2002. Bioinformatics, 18:1625-1632.
Li H. and Gui J., 2004. Bioinformatics, 20:208-215.
PC-PLS and PC-PLS Cox for high dimensional setting
• For matrices with an excessively large number of variables, as it is the
case for example with gene expression where the number of genes can
reach several tens of thousands, the PLS-Cox algorithm becomes
computer-intensive and technical problems may arise.
• PLS, as a dot product algorithm, is invariant under orthogonal
transformation of the X and/or Y variables.
• Thus, PLS based on the X principal components Z is equivalent to PLS
based on the original descriptor matrix X. This provides a very efficient
solution for very large data sets, PC-PLS Cox, by dramatically speeding-
up the calculations.
1 =ZZ yt XX y′ ′∝
de Jong and Ter Braak, 1994. Journal of Chemometrics; 8:169-174
Linear kernel PLS algorithms
• Computational problems posed by very large matrices already arose in
Chemometrics and solutions were proposed in the nineties using linear
kernel variants of PLS algorithms.
• The objective of these methods was to obtain PLS components by
working on a condensed kernel matrix of considerably smaller size than
the original matrices X and/or Y.
• The term kernel is used here more in an attempt to reduce the
computational complexity in the input space rather than a nonlinear
mapping into a highly dimensional feature space, as it is the case with non
linear kernel PLS.
Lindgren F., Geladi P., and Wold S., 1993. Journal of chemometrics; 7:45-59.
Rännar S., Geladi P., Lindgren F., Wold S., 1994. Journal of Chemometrics; 9:459-470.
A simple linear kernel PLS algorithm
2
1 ( )i i i
i
t ULU y u y uλ′ ′∝ =∑
• Performing PLS regression in the canonical space brings also a better
understanding of the PLS regression and the trade-off realised by PLS
between fit and stability.
2
1 1 , 1 , 1( )h h h i i h i h
i
t U LU y u y uλ− − − −′ ′∝ =∑
1/ 2 'let X UL V=
1 1 1 2 ( )h h h hwith U I t t U− − − −′= −
1/ 2 ' 'UL V TT yβ −=
Sample-based PLS
• Bush R. and Nachbar B. (1993), were certainly the first ones to propose a linear kernel PLS algorithm to speed-up the calculation in the case of p >> n.
• They showed that PLS depends only on the distances between observations rather than of individual descriptors values
( ) ( )2
2 2 2 2 2
. . .. 1/ 2ij ik jk ij ij i j
k
D X X C D D D D= − ⇒ = − − − +∑
• They thus proposed, with their SAMPLS algorithm, to perform PLS on the variance-covariance C matrix.
Bush R. and Nachbar B., 1993. J. Comput.-Aided Mol. Design 1993, 7, 587
Sample-based PLS
• There is a duality between sample-based and property-based methods similar to the one that exists between the analyses of rows and columns of a matrix, the latent variables being the link between these two paradigms.
• The sample-based approach is worth it when the samples are difficult or even impossible to be characterized, but the distance between samples are available.
• This approach of the PLS regression allowed Bush R. and NachbarB. to see the limit of the PLS linear regression, specially in biology, and paved the way towards future non linear developments.
Duality sample-based / property -based
Lewi P., 1995. Chemometrics and Intelligent Laboratory Systems, 28 (1):23-33
• “Sample-based analysis can be considered as a descriptive ( or Aristotelian)
view of the world which produces taxonomies of its elements. Property-based
analysis considers fundamental properties of the elements, that are often
hidden from the direct observation. It constitutes an explanatory (or Platonic)
interpretation of the world which unveils simple structures that lie behind it.”
Non linear Kernel methods
• The underlying philosophy of the kernel methods is the capacity to express a non linear complex problem by using simple and efficient tools of the linear algebra.
• A non linear problem in the input space can become linear after a mapping in a space of higher dimension, possibly infinity.
• What makes these methods useful in practice is that it is not necessary to characterize the images in the hyper space, but only to be able to perform their dot products.
• A remarkable property of the reproducing kernel Hilbert space called " kernel trick " allows to perform this calculation by using simple functions defined on pairs of samples in the input space.
• This is a general method which allows the formulation of non linear version of any algorithm which can be expressed in term of dot products as ACP, PLS regression, or SVM.
Non linear Kernel PLS
• Assuming a nonlinear transformation of the input variables into a feature space F, i.e. a non linear mapping:
: ( )N
i ix R x FΦ ∈ → Φ ∈
( ) ( ) ' ( , )i j i j
x x K x xΦ Φ =
the goal is to construct a PLS linear regression in F
• It amounts to replace, in the expression of PLS components, the product XX’by Φ(X)Φ(X)’ using the so-called « kernel trick » which permits thecomputation of dot products in high-dimensional feature space using simple functions defined on pairs of input patterns.
• Using the kernel functions corresponding to the canonical dot product in the feature space allows avoiding nonlinear optimization and the use of simple linear algebra. It is the core of the KPLS algorithm of Rosipal and Trejo (2001) but can not be extended to the PLS Cox regression.
Rosipal R. and Trejo L. 2001. Journal of machine Learning research 2, 97-123.
• Bennett and Embrecht (2003) with their DKPLS algorithm proposed to perform PLS regression on the kernel matrix K instead of Φ(x). This corresponds to a low rank approximation of the kernel matrix.
.
• Moreover, A. Tenenhaus et al demonstrated that for one dimensional output response PLS of Φ(x) (KPLS) is equivalent to PLS on K1/2 (DKPLS).
• Following Bennett and Embrechts, A. Tenenhaus et al. (2006) proposed KLPLS, a kernelized version of generalized PLS regression in the framework of logistic regression, as an extension of PLS-logistic regression to non linear settings.
• Based on the previous results it is thus straightforward to derive a (non linear) kernel PLS Cox algorithm by replacing in the PLS Cox algorithm the X matrix by the kernel matrix K.
Kernel PLS Cox regression
Bennett K.P. and Embrechts M.J. 2003. Advances in learning theory, NATO sciences series 227-249.
Tenenhaus A., Giron A., Viennet E., Béra M., Saporta G., and Fertil B., 2006. CSDA .
L1 penalized Cox regression
• In the context of censored data, Tibshirani (1997) extended the LASSO
procedure to variable selection with the Cox’proportional hazard model.
Tibshirani R. 1997. Statistics in Medicine, 16:385-395
p
j
j=1
arg max ( ), l subject to sβ β β= ≤∑
• Tibshirani proposed an iterative procedure which require the minimization
of a one-term Taylor series expansion of le log partial likelihood.
p
j=1
arg min( ) ( ) T
jz X A z X subject to sβ β β− − ≤∑2
1 , =X , , =-l l
with z A Aη µ η β µη ηη
− ∂ ∂= + =
′∂ ∂
However the quadratic programming procedure involve in the minimization cannot be
applied in high dimensional and low sample settings.
LARS procedure
Efron B., Johnston I., Hastie T., and Tibshirani R., 2004. Annals of Statistics, 32:407-499
• Efron et al. (2004), proposed a highly efficient procedure, called Least
Angle Regression for variable Selection which can be used to perform
variable selection with very large matrices.
• This procedure starts with all the coefficients equal to zero, and finds the
predictor most correlated with the response. It then takes the largest step in
the direction of this predictor until another predictor becomes as much
correlated with the current residual. LARS then proceeds in a direction
equiangular between the two predictors until a third variable in turn becomes
as much correlated with the current residual, as the previously selected ones.
LARS-Cox procedure
p
j=1
ˆ ˆarg min( ) ( ) T
jy X y X subject to sβ β β− − ≤∑
ˆ , , where y Tz X TX and A TT ′= = =
Gui J. and Li H., 2005. Bioinformatics, (21) 13:3001-3008
• LARS can be modified to provide solution for the Lasso procedure.
• Using the connection between LARS and Lasso, Gui and Li (2005)
proposed LARS-Cox for gene selection in high-dimension and low sample
settings.
• Using a Choleski factorization they transform the minimization in a
constrained version of OLS which can be solved by the LARS-Lasso
procedure.
p
j=1
arg min( ) ( ) T
jz X A z X subject to sβ β β− − ≤∑
Cox-Lasso procedure on deviance residuals
Segal M.R., 2005., Biostatistics 2006 7(2):268-285
• However, the IRWLS iterations performed in the LARS-Cox procedure
counter balanced the efficiency of the LARS-Lasso algorithm and render
the Gui and Li algorithm computationally costly.
• Segal (2005) showed that the formula to be minimized in the Cox-Lasso
procedure can be approximate, at a first order Taylor approximation by the
deviance residual sum of squares :
ˆ( ) ( ) RSS( )Tz X A z X Dβ β− − ≈
• The deviance residual is a measure of excess of death and can therefore be
interpreted as a measure of hazard.
• Segal thus proposed to speed-up the calculations by replacing the survival
times by the deviance residuals, a normalized version of Martingal residuals
that result from fitting a null (intercept only) Cox regression model.
PLS on deviance residual
• We propose to use the same idea in the setting of Partial Least Squares.
• An alternative formulation of the PLS-Cox model could be derived by
fitting the deviance residuals with a simple univariate PLS regression.
• Both models have been compared on published data (Rosenwald 2002)
based on the evaluation of their predictive performance using time
dependant ROC curves, a predictive accuracy criteria for the case of binary
data which as been extended to censored data (Heagerty et al. 2000).
• Larger AUC at time t indicates better predictability of time to event at time
t as measured by sensitivity and specificity evaluated at time t.
• Both methods showed quite similar predictive performance. This confirms
the close agreement shown by Segal between the LARS-Lasso Cox
procedure of Gui and Li and his method based on deviance residuals.
Time-dependant AUC for the comparison between
PLS-Cox and PLS based on deviance residuals
( Rosenwald dataset)
Rosenwald A. et al., 2002. The new England Journal of medicine, 346:1937-1947
Heagerty P.J., Lumley T., and Pepe M. 2000. Biometrics 56,337-344
Limit of the methods
• One limitation of the LARS-Lasso procedure is that the number of genes
selected cannot be greater than the sample size. Moreover, if there is a group
of highly correlated variables, as with genes sharing the same biological
“pathway”, the LARS-Lasso procedure tends to select only one variable
(gene) from the group while an ideal gene selection method should perform a
grouped selection.
• At the reverse, the PLS-Cox regression takes all the genes into account which
can reduce the predictive performance of the model if one expects that only a
small subset of the genes is relevant to predict survival.
• The solution could be in between.
Threshold Gradient Descent
• Friedman and Popescu (2004) proposed a stepwise optimization method
called Threshold Gradient Descent which approximate the estimates of
– PLS when using small threshold values,
– LARS-lasso when using large threshold values.
• Gui and Li (2005) extended the TGD method to the Cox regression
model for censored survival data.
Friedman J.H.and Popescu B.E.2004. Technical Report, Statistics Department, Standford
Gui .J. and Li H. 2005, Pacific Symposium on Biocomputing 10:272-283.
Threshold PLS Cox regression
• A similar approach could be used with the PLS Cox regression by
selecting, in the construction of the hth PLS component only the variables
xi with a significant Wald test in the multivariate Cox regression on xi and
t1,…th-1.
• The significant threshold α of the testing procedure could be used to limit
the number of selected predictor variables with 1- α acting as in the TGD
procedure.
• There are some arguments to motivate the use of such shrinkage procedure
when dealing with transcriptomic data.
• From a biological point of view, one should expect that only a small subset
of the genes is relevant to predict survival. Many of the thousands of genes
are not informative with regard to the hazard function and contribute only
to reduce the predictive performance by introducing noise in the data.
τ