Cox model in high dimensional and low sample size … 7/Bastien.pdf · Cox model in high...

26
Cox model in high dimensional and low sample size settings Philippe BASTIEN L’Oréal Recherche, Aulnay, France ([email protected])

Transcript of Cox model in high dimensional and low sample size … 7/Bastien.pdf · Cox model in high...

Cox model in high dimensional

and low sample size settings

Philippe BASTIEN

L’Oréal Recherche, Aulnay, France ([email protected])

Introduction

• 2001 : The PLS Cox regression was proposed at the “Second International Symposium on PLS” as an alternative to the Cox regression model for censored data in the presence of highly-correlated covariates.

• Since 2001 numerous developments have appeared in the analysis of censoredsurvival data with high-dimensional covariates, mainly due to the growing interest of researchers in pharmacogenomics to develop, based on gene expression in cancer tissues, molecular predictors of patient’s time to clinical outcome such as time to cancer recurrence or death after treatment.

• It allows the discovery of new markers and new therapeutic targets and opens the way to more subject-specific treatments with greater efficacy and safety.

• This lecture aims to present an overview of the latest developments in this domain.

Bastien P., Tenenhaus M., 2001. Proceedings of the PLS’01 International Symposium, 131-140..

The Cox proportional hazard model

• The model assumes the following hazard function for the occurrence of an event at time t in the presence of censoring:

• The Cox’s partial likelihood can be written as :

• When p > n, there is no unique ß to maximize this partial likelihood.

• Even when p ≤ n , covariates could be highly correlated and regularization may still be required in order to reduce the variance of the estimates and to improve the prediction performance.

Cox D.R. 1972. Journal of the Royal Statistical Society B.; 74: 187-220

0( ) ( ) exp( )t t Xλ λ β=

exp( ' )( )

exp( ' )k

k

k D jj R

xPL

x

ββ

β∈ ∈

= ∏∑

Regularization with the Cox model

• A common remedy is to regularize, through a bias-variance trade-off, in order

to stabilize the estimates.

• There are two main approaches when dealing with regularization

– Dimension reduction

– Penalization

Dimension reduction

• Prediction in high dimensional and low sample size settings already arose in Chemistry in the eighties. The PLS regression was developed as aChemometric tool in an attempt to find reliable predictive model with spectral data.

• Nowadays the difficulties encountered with the use of transcriptomic data for classification or prediction, using very large matrices, is of comparable nature.

• Gene expression have a similar data structure : covariates are highly collinear and the number of covariates far exceed the number of observations.

• It was thus natural to use PLS regression principles in this new context.

Wold S., Martens H., and Wold H., 1983, Lecture notes in Mathematics, Springer Verlag 286-293.

PLS-Cox regression

• Computation of the first PLS component t1

1,1

11

1

11

1 1

( ) , i pi i

a Coxfit x

aw

a

Xwt

w w

==

=

=′

.

1,1, 1

1,2, 1, 1

22

2

1 22

2 2

( / ) ,

( , ) ,

i pi i

i pi i

x linfit x t

a Coxfit x t

aw

a

X wt

w w

=

=

=

=

=

=′

• Computation of the second PLS component t2

PLS Cox regression

• Computation of the hth PLS component th

1,1, 1 2 1

1,, 1, 1 2 1

1

( / , ,..., ) ,

( , , ,..., ) ,

i ph i i h

i ph i h i h

hh

h

h hh

h h

x linfit x t t t

a Coxfit x t t t

aw

a

X wt

w w

=− −

=− −

=

=

=

=′.

Bastien P., Esposito Vinzi V., and Tenenhaus M., 2005. CDSA, 48:17-46.

Other alternatives of the Cox model

in the PLS regression setting

• Park et al. (2002)

• Nguyen and Rocke (2002)

• Li and Gui (2004)

Park. P., Tian L., and Kohane I., 2002. Bioinformatics, 18:120-127

Nguyen D.V. and Rocke D., 2002. Bioinformatics, 18:1625-1632.

Li H. and Gui J., 2004. Bioinformatics, 20:208-215.

PC-PLS and PC-PLS Cox for high dimensional setting

• For matrices with an excessively large number of variables, as it is the

case for example with gene expression where the number of genes can

reach several tens of thousands, the PLS-Cox algorithm becomes

computer-intensive and technical problems may arise.

• PLS, as a dot product algorithm, is invariant under orthogonal

transformation of the X and/or Y variables.

• Thus, PLS based on the X principal components Z is equivalent to PLS

based on the original descriptor matrix X. This provides a very efficient

solution for very large data sets, PC-PLS Cox, by dramatically speeding-

up the calculations.

1 =ZZ yt XX y′ ′∝

de Jong and Ter Braak, 1994. Journal of Chemometrics; 8:169-174

Linear kernel PLS algorithms

• Computational problems posed by very large matrices already arose in

Chemometrics and solutions were proposed in the nineties using linear

kernel variants of PLS algorithms.

• The objective of these methods was to obtain PLS components by

working on a condensed kernel matrix of considerably smaller size than

the original matrices X and/or Y.

• The term kernel is used here more in an attempt to reduce the

computational complexity in the input space rather than a nonlinear

mapping into a highly dimensional feature space, as it is the case with non

linear kernel PLS.

Lindgren F., Geladi P., and Wold S., 1993. Journal of chemometrics; 7:45-59.

Rännar S., Geladi P., Lindgren F., Wold S., 1994. Journal of Chemometrics; 9:459-470.

A simple linear kernel PLS algorithm

2

1 ( )i i i

i

t ULU y u y uλ′ ′∝ =∑

• Performing PLS regression in the canonical space brings also a better

understanding of the PLS regression and the trade-off realised by PLS

between fit and stability.

2

1 1 , 1 , 1( )h h h i i h i h

i

t U LU y u y uλ− − − −′ ′∝ =∑

1/ 2 'let X UL V=

1 1 1 2 ( )h h h hwith U I t t U− − − −′= −

1/ 2 ' 'UL V TT yβ −=

Sample-based PLS

• Bush R. and Nachbar B. (1993), were certainly the first ones to propose a linear kernel PLS algorithm to speed-up the calculation in the case of p >> n.

• They showed that PLS depends only on the distances between observations rather than of individual descriptors values

( ) ( )2

2 2 2 2 2

. . .. 1/ 2ij ik jk ij ij i j

k

D X X C D D D D= − ⇒ = − − − +∑

• They thus proposed, with their SAMPLS algorithm, to perform PLS on the variance-covariance C matrix.

Bush R. and Nachbar B., 1993. J. Comput.-Aided Mol. Design 1993, 7, 587

Sample-based PLS

• There is a duality between sample-based and property-based methods similar to the one that exists between the analyses of rows and columns of a matrix, the latent variables being the link between these two paradigms.

• The sample-based approach is worth it when the samples are difficult or even impossible to be characterized, but the distance between samples are available.

• This approach of the PLS regression allowed Bush R. and NachbarB. to see the limit of the PLS linear regression, specially in biology, and paved the way towards future non linear developments.

Duality sample-based / property -based

Lewi P., 1995. Chemometrics and Intelligent Laboratory Systems, 28 (1):23-33

• “Sample-based analysis can be considered as a descriptive ( or Aristotelian)

view of the world which produces taxonomies of its elements. Property-based

analysis considers fundamental properties of the elements, that are often

hidden from the direct observation. It constitutes an explanatory (or Platonic)

interpretation of the world which unveils simple structures that lie behind it.”

Non linear Kernel methods

• The underlying philosophy of the kernel methods is the capacity to express a non linear complex problem by using simple and efficient tools of the linear algebra.

• A non linear problem in the input space can become linear after a mapping in a space of higher dimension, possibly infinity.

• What makes these methods useful in practice is that it is not necessary to characterize the images in the hyper space, but only to be able to perform their dot products.

• A remarkable property of the reproducing kernel Hilbert space called " kernel trick " allows to perform this calculation by using simple functions defined on pairs of samples in the input space.

• This is a general method which allows the formulation of non linear version of any algorithm which can be expressed in term of dot products as ACP, PLS regression, or SVM.

Non linear Kernel PLS

• Assuming a nonlinear transformation of the input variables into a feature space F, i.e. a non linear mapping:

: ( )N

i ix R x FΦ ∈ → Φ ∈

( ) ( ) ' ( , )i j i j

x x K x xΦ Φ =

the goal is to construct a PLS linear regression in F

• It amounts to replace, in the expression of PLS components, the product XX’by Φ(X)Φ(X)’ using the so-called « kernel trick » which permits thecomputation of dot products in high-dimensional feature space using simple functions defined on pairs of input patterns.

• Using the kernel functions corresponding to the canonical dot product in the feature space allows avoiding nonlinear optimization and the use of simple linear algebra. It is the core of the KPLS algorithm of Rosipal and Trejo (2001) but can not be extended to the PLS Cox regression.

Rosipal R. and Trejo L. 2001. Journal of machine Learning research 2, 97-123.

• Bennett and Embrecht (2003) with their DKPLS algorithm proposed to perform PLS regression on the kernel matrix K instead of Φ(x). This corresponds to a low rank approximation of the kernel matrix.

.

• Moreover, A. Tenenhaus et al demonstrated that for one dimensional output response PLS of Φ(x) (KPLS) is equivalent to PLS on K1/2 (DKPLS).

• Following Bennett and Embrechts, A. Tenenhaus et al. (2006) proposed KLPLS, a kernelized version of generalized PLS regression in the framework of logistic regression, as an extension of PLS-logistic regression to non linear settings.

• Based on the previous results it is thus straightforward to derive a (non linear) kernel PLS Cox algorithm by replacing in the PLS Cox algorithm the X matrix by the kernel matrix K.

Kernel PLS Cox regression

Bennett K.P. and Embrechts M.J. 2003. Advances in learning theory, NATO sciences series 227-249.

Tenenhaus A., Giron A., Viennet E., Béra M., Saporta G., and Fertil B., 2006. CSDA .

L1 penalized Cox regression

• In the context of censored data, Tibshirani (1997) extended the LASSO

procedure to variable selection with the Cox’proportional hazard model.

Tibshirani R. 1997. Statistics in Medicine, 16:385-395

p

j

j=1

arg max ( ), l subject to sβ β β= ≤∑

• Tibshirani proposed an iterative procedure which require the minimization

of a one-term Taylor series expansion of le log partial likelihood.

p

j=1

arg min( ) ( ) T

jz X A z X subject to sβ β β− − ≤∑2

1 , =X , , =-l l

with z A Aη µ η β µη ηη

− ∂ ∂= + =

′∂ ∂

However the quadratic programming procedure involve in the minimization cannot be

applied in high dimensional and low sample settings.

LARS procedure

Efron B., Johnston I., Hastie T., and Tibshirani R., 2004. Annals of Statistics, 32:407-499

• Efron et al. (2004), proposed a highly efficient procedure, called Least

Angle Regression for variable Selection which can be used to perform

variable selection with very large matrices.

• This procedure starts with all the coefficients equal to zero, and finds the

predictor most correlated with the response. It then takes the largest step in

the direction of this predictor until another predictor becomes as much

correlated with the current residual. LARS then proceeds in a direction

equiangular between the two predictors until a third variable in turn becomes

as much correlated with the current residual, as the previously selected ones.

LARS-Cox procedure

p

j=1

ˆ ˆarg min( ) ( ) T

jy X y X subject to sβ β β− − ≤∑

ˆ , , where y Tz X TX and A TT ′= = =

Gui J. and Li H., 2005. Bioinformatics, (21) 13:3001-3008

• LARS can be modified to provide solution for the Lasso procedure.

• Using the connection between LARS and Lasso, Gui and Li (2005)

proposed LARS-Cox for gene selection in high-dimension and low sample

settings.

• Using a Choleski factorization they transform the minimization in a

constrained version of OLS which can be solved by the LARS-Lasso

procedure.

p

j=1

arg min( ) ( ) T

jz X A z X subject to sβ β β− − ≤∑

Cox-Lasso procedure on deviance residuals

Segal M.R., 2005., Biostatistics 2006 7(2):268-285

• However, the IRWLS iterations performed in the LARS-Cox procedure

counter balanced the efficiency of the LARS-Lasso algorithm and render

the Gui and Li algorithm computationally costly.

• Segal (2005) showed that the formula to be minimized in the Cox-Lasso

procedure can be approximate, at a first order Taylor approximation by the

deviance residual sum of squares :

ˆ( ) ( ) RSS( )Tz X A z X Dβ β− − ≈

• The deviance residual is a measure of excess of death and can therefore be

interpreted as a measure of hazard.

• Segal thus proposed to speed-up the calculations by replacing the survival

times by the deviance residuals, a normalized version of Martingal residuals

that result from fitting a null (intercept only) Cox regression model.

PLS on deviance residual

• We propose to use the same idea in the setting of Partial Least Squares.

• An alternative formulation of the PLS-Cox model could be derived by

fitting the deviance residuals with a simple univariate PLS regression.

• Both models have been compared on published data (Rosenwald 2002)

based on the evaluation of their predictive performance using time

dependant ROC curves, a predictive accuracy criteria for the case of binary

data which as been extended to censored data (Heagerty et al. 2000).

• Larger AUC at time t indicates better predictability of time to event at time

t as measured by sensitivity and specificity evaluated at time t.

• Both methods showed quite similar predictive performance. This confirms

the close agreement shown by Segal between the LARS-Lasso Cox

procedure of Gui and Li and his method based on deviance residuals.

Time-dependant AUC for the comparison between

PLS-Cox and PLS based on deviance residuals

( Rosenwald dataset)

Rosenwald A. et al., 2002. The new England Journal of medicine, 346:1937-1947

Heagerty P.J., Lumley T., and Pepe M. 2000. Biometrics 56,337-344

Limit of the methods

• One limitation of the LARS-Lasso procedure is that the number of genes

selected cannot be greater than the sample size. Moreover, if there is a group

of highly correlated variables, as with genes sharing the same biological

“pathway”, the LARS-Lasso procedure tends to select only one variable

(gene) from the group while an ideal gene selection method should perform a

grouped selection.

• At the reverse, the PLS-Cox regression takes all the genes into account which

can reduce the predictive performance of the model if one expects that only a

small subset of the genes is relevant to predict survival.

• The solution could be in between.

Threshold Gradient Descent

• Friedman and Popescu (2004) proposed a stepwise optimization method

called Threshold Gradient Descent which approximate the estimates of

– PLS when using small threshold values,

– LARS-lasso when using large threshold values.

• Gui and Li (2005) extended the TGD method to the Cox regression

model for censored survival data.

Friedman J.H.and Popescu B.E.2004. Technical Report, Statistics Department, Standford

Gui .J. and Li H. 2005, Pacific Symposium on Biocomputing 10:272-283.

Threshold PLS Cox regression

• A similar approach could be used with the PLS Cox regression by

selecting, in the construction of the hth PLS component only the variables

xi with a significant Wald test in the multivariate Cox regression on xi and

t1,…th-1.

• The significant threshold α of the testing procedure could be used to limit

the number of selected predictor variables with 1- α acting as in the TGD

procedure.

• There are some arguments to motivate the use of such shrinkage procedure

when dealing with transcriptomic data.

• From a biological point of view, one should expect that only a small subset

of the genes is relevant to predict survival. Many of the thousands of genes

are not informative with regard to the hazard function and contribute only

to reduce the predictive performance by introducing noise in the data.

τ