BIOINFORMATICS

ORIGINAL PAPER

Vol. 24 no. 14 2008, pages 1632–1638 doi:10.1093/bioinformatics/btn253

Data and text mining

Sparse kernel methods for high-dimensional survival data Ludger Evers 1,∗ and Claudia-Martina Messow 2 1 Department

of Mathematics, University of Bristol, University Walk, Bristol, BS8 1TW, United Kingdom and 2 Institut für Medizinische Biometrie, Epidemiologie und Informatik, Johannes Gutenberg-Universität, 55101 Mainz, Germany

Received on December 26, 2007; revised on May 08, 2008; accepted on May 29, 2008 Advance Access publication May 30, 2008 Associate Editor: Limsoon Wong

ABSTRACT Sparse kernel methods like support vector machines (SVM) have been applied with great success to classification and (standard) regression settings. Existing support vector classification and regression techniques however are not suitable for partly censored survival data, which are typically analysed using Cox’s proportional hazards model. As the partial likelihood of the proportional hazards model only depends on the covariates through inner products, it can be ‘kernelized’. The kernelized proportional hazards model however yields a solution that is dense, i.e. the solution depends on all observations. One of the key features of an SVM is that it yields a sparse solution, depending only on a small fraction of the training data. We propose two methods. One is based on a geometric idea, where—akin to support vector classification—the margin between the failed observation and the observations currently at risk is maximised. The other approach is based on obtaining a sparse model by adding observations one after another akin to the Import Vector Machine (IVM). Data examples studied suggest that both methods can outperform competing approaches. Availability: Software is available under the GNU Public License as an R package and can be obtained from the first author’s website http://www.maths.bris.ac.uk/∼maxle/software.html Contact: [email protected]

1

INTRODUCTION

A significant proportion of medical data is time-to-event data, a situation different from, and more complex than other data situations. Ideally, time-to-event data records the time from the beginning of the observation (e.g. the enrolment in a clinical study) until a certain event (e.g. death, occurrence of metastases, complete remission) occurs. Often it is not possible to follow each individual until the event occurs. For some individuals it might occur after the end of the study, for others it might not occur at all, others again might be lost to follow up. Those are called censored observations. In those cases, it is usually known when the last information was available, and that until then the individual had no event. Whether an event is recorded for a certain individual is thus heavily influenced by the follow-up duration and by competing risks and events, which typically all differ for the individuals.



To whom correspondence should be addressed.

1632

If we want to analyse the link between a large number of covariates1 , like in the analysis of gene expression microarray data, and the time until a certain event occurs, we encounter several problems. The standard regression methods for survival analysis, such as proportional hazards regression (Cox, 1972), all require more individuals than covariates (genes, etc.) in the model. Most gene expression microarray studies however include far less individuals than covariates (genes). As there is a large number of powerful methods which are able to deal with classification problems in this setting (like e.g. support vector classification machines), a common approach is to consider survival up to a certain time as a dichotomous variable. This device clearly simplifies the task. However, it leads to a significant loss of information. First, in order to use as many events as possible a rather late cut-off seems advisable, but all individuals lost to follow-up before the cut-off will be discarded, which in turn suggests choosing a rather early cut-off. Second, all events occurring before the cut-off are treated the same, irrespective of whether the event occurred right at the begin of the study or just before the cut-off, which again suggests using a rather early cut-off. Thus dichotomization is only viable, if (almost) all individuals can be followed up to a cut-off point at which all events should have occurred and if the order of the events occurring before the cut-off is considered uninformative. These conditions however are rather unrealistic. Take the example of a cardiac mortality study in a low or intermediate risk population, where individuals are not under permanent medical surveillance and, therefore, many patients are lost to follow-up, and many competing risks and events can occur over a large period of time. Therefore, methods are needed that can deal with a large number of covariates and that can take into account the specific nature of time-to-event data. Several methods have been proposed for this topic (van ’t Veer et al., 2002; Wang et al., 2005), most of which are based on ad hoc heuristic approaches. In the following, we propose two new methods which, like kernel Cox regression (Li and Luan, 2003) and partial Cox regression (Li and Gui, 2004), are closely related to Cox’s proportional hazards model.

2

DATA

The survival information for individual i is contained in two variables, the observed survival time ti , and an indicator variable 1 Even though all the examples considered here deal with gene-expression microarray data, we believe that our methods can be useful in other areas, where a large number of potentially highly correlated variables occurs. For this reason we use the more generic term ‘covariate’ instead of ‘gene’.

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

[16:55 4/7/03 Bioinformatics-btn253.tex]

Page: 1632

1632–1638

Sparse kernel methods for survival data

for censoring ci ,  ci =

0 1

censored event has occurred.

Thus, the time variable ti contains the time to event if an event has occurred, or the time up to the point of the last information that the event has not occurred for censored observations. The vector of covariates for individual i is denoted by xi = (xi1 ,..., xip ), i = 1,...,n. We use two data sets to evaluate the predictive performance of our methods. One is the DLBCL data set of Rosenwald et al. (2002) containing gene expression data of 240 patients with untreated diffuse large-B-cell lymphoma. Patients were followed for a median of 2.8 years. During the follow-up 57% of the patients died. The expression of 7399 genes was measured using a cDNA microarray. The other data set is the one of Bhattacharjee et al. (2001) containing gene expression data of 125 patients with lung cancer. Patients were followed for a median of 35.4 months. During the follow-up 57% of the patients died. The expression of 12 600 genes was measured using an mRNA microarray. We only use the 3312 most volatile genes. We do not make use of the clustering results of Bhattacharjee et al. (2001). In addition, we consider simulated data of 1600 genes for 200 patients. The gene expressions are drawn from a multivariate normal distribution. Except for the subgroup structure described below, the mean expression profile is set to 0. The covariance matrix is set to the sum of a block-diagonal√matrix and a small perturbation drawn from 0.01×Wishart (1600)/ 1600. The block diagonal matrix consists of blocks of size 100 each with an equi-correlation ranging from ±16/32 to ±31/32. In order to represent a structure of biologically different subgroups, we simulate five binary grouping variables. Of the samples 50% are randomly assigned to Group A, and independently 50% of the samples are assigned to Group B. Groups C, D and E are constructed similarly, each consisting of 33% of the samples. The mean expression of genes 1–10 is 2 for patients in Group A. The mean expressions of genes 11–30 is −3 for patients in Group B. The mean expression of genes 31–50, 51–75 and 76–100 is determined by Groups C (set to 4), D (set to 5) and E (set to −4), respectively. A similar set-up was used by Bair and Tibshirani (2004). The survival times depend on the genes through a score of the form α(x)+x101:1600 β, where x β only takes every 10th gene into account. Non-zero βj are drawn from an N(0,1) distribution. α(x) reflects the effect of the different subgroups and is set to 3 for patients belonging to both Group A and Group B, and 0 otherwise. The survival times are drawn (from an exponential distribution) with the hazard λ set to the aforementioned score. The observation times are drawn from a uniform distribution on [0,2].

right before time t and with R+ t the set of individuals at risk right after time t. Our simplifying assumptions correspond to Dt = {it }, + R− t = {it ,...,n} and Rt = {it +1,...,n},  where it is the individual failing at time t. Denote with D = t Dt the set of all individuals who perished. Let fi be the density of the life time of the ith individual, Fi its cumulative distribution function, and hi (t) = fi (t)/(1−Fi (t)) the corresponding hazard rate. The assumption of the proportional hazards model is hi (t) = h0 (t)exp(xi ,β),

i.e. the hazard rates are proportional to each other. The core benefit of this assumption is that it allows to decompose the likelihood into two parts. One part that is mostly influenced by the baseline hazard rate h0 (·); the other part is independent of h0 (·), called the partial likelihood (Cox, 1975). Cox proposed to estimate β by maximizing the partial likelihood, whose logarithm is      exp xi ,β   . (2) log  exp xι ,β ι∈R− t i∈D

i

The models proposed in the sequel are all based on the sequential classification view of the proportional hazards model. The proportional hazards model can be seen as training a sequence of classifiers predicting which of the individuals at risk will fail. It corresponds to a logistic model for the probability   exp xi ,β   P(Ind. i fails|R− are at risk) =  ti exp xι ,β ι∈R− t i

using the same regression coefficient β for all individuals and for all event times. The partial loglikelihood (2) is the sum of the logarithms of these conditional probabilities. When using a large number of covariates it is often necessary to consider a penalized likelihood      exp xi ,β λ   − β2 P (β) = log  (3) 2 − exp xι ,β ι∈Rt i∈D

i

In some circumstances, it is overly restrictive to assume that the relationship between the logarithm of the hazard ratio and the covariates is linear. In this case it seems more realistic to assume hi (t) = h0 (t)exp(φ(xi )),

COX’S PROPORTIONAL HAZARDS MODEL

This section quickly reviews the proportional hazards model, mostly in order to fix notation and to allow a more formal comparison with the methods proposed in the sequel. In order to simplify the presentation, we will assume that all event times are different, i.e. there are no ties. Further, we will assume without loss of generality that the observations are ordered according to the event/censoring times. Denote with Dt the set of individuals failing at time t, with R− t the set of individuals at risk

(4)

instead of (1). φ(·) is then assumed to be from a reproducing kernel Hilbert space H (Kimeldorf and Wahba, 1971) defined by a kernel function K(·,·).2 Equation (3) then becomes      exp φ(xi ) λ   − φ2H . log  (5) P (φ) = 2 − exp φ(xι ) ι∈R i∈D

3

(1)

ti

This model was proposed by Li and Luan (2003). The representer theorem (Kimeldorf and Wahba, 1971) guarantees that the maximizer of (5) has a finite-dimensional representation: ˆ = φ(·)

n 

αi K(xi ,·).

(6)

i=1 2 We obtain the special case (1) by using the standard dot product K(x1 ,x2 ) = x1 ,x2  .

1633

[16:55 4/7/03 Bioinformatics-btn253.tex]

Page: 1633

1632–1638

L.Evers and C.-M.Messow

The optimal α = (α1 ,...,αn ) can be found by plugging (6) into (5), yielding      exp K i ,α λ   − α,Kα P (α) = log  (7) 2 − exp K ι ,α ι∈R 

i∈D

ti

  with K i = K(xi ,xj ) j and K = K(xi ,xj ) ij . The resulting optimization problem of maximizing (7) over α is convex in α, and the solution can be found by any standard unconstrained optimization method.3 This dual formulation of the problem is even useful when using a linear kernel K(x1 ,x2 ) = x1 ,x2 . For gene expression data the number of genes is typically much larger than the number of observations, so solving (7) involves optimizing over less parameters than solving (4). In contrast to support vector machines (SVMs), this approach does neither perform margin maximization, nor does it yield a sparse solution only depending on a subset of the input data. The latter is an important property as it makes computing future predictions cheaper and helps obtaining good generalization performance when using non-linear kernels. This suggests approximating (7) by a model that only depends on a subset of the training data, containing only the most ‘important’ cases. Such a model can be obtained by generalizing the Import Vector Machine (IVM, Zhu and Hastie, 2005) to the proportional hazards model. The basic idea of the Survival IVM is to keep the penalized partial loglikelihood (7) as a criterion, but to restrict most of the αi to zero. The αi which will be allowed to be non-zero are selected in a process very similar to forward selection, except that we wish to select important observations and not variables, and that we use the penalized partial loglikelihood as a criterion. The observations with non-zero αi will be called import vectors. The model fitting starts with the ‘empty’ model, i.e. α = (α1 ,...,αn ) = 0. We use a greedy ‘one step look-ahead’ strategy for adding non-zero αi . At every step, the αi is added to the model which allows the largest increase of the penalized partial loglikelihood. As optimizing the penalized partial loglikelihood over every candidate αi is prohibitive, we use a second order Taylor approximation of the penalized partial loglikelihood. As the number of import vectors typically grows slower than the number of observations, computing the optimal dual parameter for an IVM can be a lot faster than computing the optimal dual parameter for the kernelized proportional hazards model (7). Thus IVMs are well suited for large datasets. On the other hand, the Survival IVM does not have an interpretation in terms of margin maximization. The following section proposes a generalization of the SVM to survival data, which performs margin maximization.

4



SURVIVAL SVM

Support vector classification machines are a classification algorithm aiming at maximizing the margin4 between two classes (see e.g. Burges, 1998, for an introduction). SVMs have been applied with great success to various classification problems, including 3

see Li and Luan (2003) for details. The margin of a separating hyperplane is the sum of the smallest distance between the hyperplane and the closest observation to its left and the one between the hyperplane and the closest observation to its right. Figure 1 illustrates this idea.

Fig. 1. Illustration of the idea of margin maximization applied to survival data.

classification based on gene expression data. Both theoretical and empirical investigation suggests that margin maximization is one of the reasons for the outstanding generalization performance of support vector classification machines. Starting from the conditional classification view of the proportional hazards model, it is easy to come up with a margin maximization algorithm for survival data. At every event time t, we construct a hyperplane {x β = −γt } separating the individual(s) deceased at time t from the individuals leaving the collective after time t. In contrast to the proportional hazards model, we do not use a logistic model to find the hyperplane, but rather maximize the margin as in support vector classification machines. Note that for different event times t the hyperplanes are just translated; their orientation (determined by β, which does not depend on t) stays the same. This is in analogy to the proportional hazards model where the same β is used for all events as well. Using the same orientation for all events is justified by the proportional hazards assumptions. In other words, the Survival SVM makes an (implicit) assumption of proportional hazards, too.5 Figure 1 illustrates this idea. The first hyperplane separates D1 := {i1 } from R+ 1 := {i2 ,i3 ,i4 ,i5 ,i6 }, i.e. it separates the first individual, which dies first, from the other individuals which are still at risk right after t = 1. The second hyperplane separates D2 := {i2 } from R+ 2 := {i3 ,i4 ,i5 ,i6 }. The third hyperplane separates D5 := {i5 } from R+ 5 := {i6 }. The hyperplanes shown are the hyperplanes with common orientation which maximize the margin. The parameters β and γt defining the hyperplanes are only identifiable up to a multiplicative constant, so we can impose β2 = 1. Our further developments however simplify if we restrict the margin to 1 and minimize the squared norm β2 instead. The two approaches are equivalent as they only differ in a change of scale. Mathematically, a margin of 1 corresponds to xi ,β−xj ,β ≥ 1 for all pairs of perishing individuals i (i.e. i ∈ D) and ‘survivors’ j (i.e. j ∈ R+ ti ). Note that there is no guarantee that a hyperplane with the aforementioned separation properties exists. Thus it seems desirable to relax the condition that the hyperplanes achieve perfect

4

Note that using different orientations β t for the separating hyperplanes would yield very poor generalization performance unless a tight regularization of the β t is used.

5

1634

[16:55 4/7/03 Bioinformatics-btn253.tex]

Page: 1634

1632–1638

Sparse kernel methods for survival data

2

with respect to xi ,β−xj ,β ≥ 1−ξij , ξij ≥ 0

for all (i,j) ∈ (i,j) : i ∈ D ∧ j ∈ R+ ti =: I. The first part in (8) corresponds to maximizing the margin, whereas the double sum corresponds to penalizing observations that lie on the ‘wrong’ side of the margin. The second sum relates the objective function (8) to the concordance index of Harrell et al. (1996, Section  5.5), which is based on the number of discordant pairs 1 . The second sum in (8) can thus be i∈D j∈R+ ti {xi ,β>xj ,β} seen as a piecewise linear and convex approximation to this sum of indicators. As in Equation (3), the parameter λ > 0 governs the trade-off between fitting the data and a smooth solution. The corresponding Lagrangian with multipliers αij ,µij ≥ 0 is

3

δ

δ

3

δ

δ

i∈D j∈Rti

loss

loss

separation. As in soft-margin SVMs, we allow some observations to lie on the ‘wrong’ side of the margin, however this incurs a penalty that is proportional to the distance ξij between the observation and the corresponding margin separating the deceased individual i from a survivor j. This leads to the minimization problem   λ P(β,ξ ) = β2 + ξij (8) 2 +

2

(a) Proportional hazards model

(b) Survival SVM



L(β,ξ ,α,µ) = −λ

 (i,j)∈I

λ ξij + β2 2

(9)

   αij xi ,β−xj ,β−1+ξij −λ µij ξij .

 (i,j)∈I

(i,j)∈I

The corresponding dual problem is to maximize   λ αij αkl Kijkl +λ αij D(α) = − 2 (i,j),(k,l)∈I

(10)

(i,j)∈I

over 0 ≤ αij ≤ 1/λ, where

Fig. 2. Illustration of the unregularised loss functions for the proportional hazards model and the Survival SVM (see Equations (3) and (8) with λ/2· β2 omitted).

Equation (10) is a positive-definite quadratic program and is thus straightforward to solve. The simplest approach is using sequential minimal optimization (SMO) (Platt, 1999). As there are no joint constraints on the αij , SMO corresponds in our case to coordinate ascent and yields updates:  1 1− (k,l)∈I , (k,l) =(i,j) αkl Kijkl αij ← max 0,min , . λ Kijij Even though the model has a large number of parameters, the optimization is rather quick as one can take advantage of the fact that most of the αij will be zero. Due to this sparsity, the solution will, as in support vector classification or regression machines, only depend on a subset of the training points, called the support vectors. It is worth comparing the loss function used by the Survival SVM to the one used by the proportional hazards model. Consider the simple example of three individuals. The first individual perishes before the two other individuals are censored. The unregularized loss function of the proportional hazards model is

Kijkl = xi ,xk −xi ,xl −xj ,xk +xj ,xl . The optimal βˆ can be obtained from the dual parameters:  αij (xi −xj ) βˆ =

log(1+exp(δ2 )+exp(δ3 )) with δ2 := φ(x2 )−φ(x1 ) and δ3 := φ(x3 )−φ(x1 ). The unregularized loss function of the Survival SVM is (1−δ2 )+ +(1−δ3 )+

(i,j)∈I

In analogy to Section 3, we might want to consider a non-linear relationship φ(x) instead of the linear x,β. This corresponds to allowing for non-linear separation between perished individuals and the survivors. In this case, we will assume that φ(·) is from a Reproducing Kernel Hilbert Space H with kernel K. Then φ(xi ),φ(xj ) = K(xi ,xj ), ˆ has a finite dimensional representation and the optimal φ(·)    ˆ = αij K(xi ,·)−K(xj ,·) . φ(·) (i,j)∈I

When setting αi =

For δ2 → −∞ (or δ3 → −∞) both loss functions tend to 0. For δ2 → +∞ (or δ3 → +∞) both loss functions tend to +∞. For δ2 → +∞ (or δ3 → +∞) and δ3 kept fixed (or δ2 kept fixed, respectively) the derivative of both loss functions with respect to δ2 (or δ3 , respectively) tends to 1. The limits of the derivatives of the loss functions however differ if both δ2 and δ3 tend to +∞. Figure 2 shows both loss functions. This suggests that the regularized proportional hazards model and the Survival SVM behave somewhat similar, which is confirmed by the examples studied in Section 5.

5



αij − j∈R+ ti



 

αij , j∈D : i∈R+ tj

αij , j∈D : i∈R+ tj

ˆ can be written as in Equation (6). φ(·)

i∈D i ∈ D

EMPIRICAL RESULTS

This section compares the predictive performance of Survival SVMs and Survival IVMs to the performance of competing methods using the two data sets from Section 2. Three kernels are considered: the linear kernel and the Gaussian kernel Kγ (x1 , x2 ) = exp(−γ x1 −x2 2 ), once using all genes and once only using the

1635

[16:55 4/7/03 Bioinformatics-btn253.tex]

Page: 1635

1632–1638

L.Evers and C.-M.Messow

Table 1. Results obtained for the DLBCL data (mean and SD of the five performance measures), best performing method underlined Method

Performance on validation data (DLBCL) 2 

C index

AUC

iAUC

S

Linear kernel (all genes) SVM IVM KCR

10.89 (4.97) 10.92 (5.15) 10.29 (5.55)

0.6195 (0.0288) 0.6194 (0.0300) 0.5613 (0.190)

0.6498 (0.0456) 0.6565 (0.0443) 0.5945 (0.203)

0.6567 (0.0422) 0.6590 (0.0401) 0.5989 (0.0804)

0.2113 (0.0873) 0.1899 (0.0757) 0.1739 (0.0864)

Gaussian kernel (all genes) SVM IVM KCR

10.22 (4.97) 10.77 (4.95) 10.20 (4.59)

0.6153 (0.0312) 0.6343 (0.0345) 0.6324 (0.0292)

0.6479 (0.0465) 0.6733 (0.0467) 0.6735 (0.0394)

0.6546 (0.0432) 0.6748 (0.0424) 0.5950 (0.0533)

0.2037 (0.0914) 0.2120 (0.0884) 0.2075 (0.0872)

Gaussian kernel (subset) SVM IVM KCR

11.81 (5.35) 10.43 (4.68) 8.432 (3.63)

0.6222 (0.0331) 0.6328 (0.0341) 0.6117 (0.0206)

0.6560 (0.0502) 0.6783 (0.0429) 0.6477 (0.0198)

0.6759 (0.0412) 0.6749 (0.0408) 0.6738 (0.0366)

0.2151 (0.100) 0.1985 (0.087) 0.1743 (0.0769)

Supervised principal components Partial Cox regression Prinicpal component Cox regression LASSO van ’t Veer et al. Wang et al.

5.988 (4.00) 10.58 (4.64) 6.944 (4.57) 6.232 (4.42) 6.575 (3.86) 5.325 (4.33)

0.5698 (0.106) 0.6171 (0.0271) 0.5895 (0.0323) 0.5863 (0.0364) 0.6011 (0.0409) 0.5772 (0.0331)

0.5994 (0.117) 0.6584 (0.0445) 0.6220 (0.0517) 0.6158 (0.0548) 0.6417 (0.0586) 0.6021 (0.0499)

0.5986 (0.116) 0.6591 (0.0378) 0.6238 (0.0445) 0.6169 (0.0518) 0.6385 (0.054) 0.6038 (0.0472)

0.1349 (0.0763) 0.1918 (0.0746) 0.1409 (0.0777) 0.1334 (0.0859) 0.1705 (0.0911) 0.1180 (0.0827)

25% of the genes with the largest Cox score. The competitors considered are: kernel Cox regression (KCR, Li and Luan, 2003), as described in Section 3; partial Cox regression (Li and Gui, 2004), a generalization of the partial least squares (PLS) algorithm to the proportional hazards model; Cox regression using the principal components as covariates; the supervised principal components algorithm of Bair and Tibshirani (2004), which includes only the features with the highest Cox scores; and the LASSO (Tibshirani, 1997). We further consider the following two methods. The method proposed by van ’t Veer et al. (2002) forms a gene signature by computing the mean expression of the genes most correlated with the outcome over all patients who survived until a certain cut-off. The correlation between the expression profile of a new patient and the signature is then used as a risk score. The second method (Wang et al., 2005) consists of adding the genes one by one to a proportional hazards model based on their bootstrapped univariate P-values. The number of genes is selected by maximizing the area under the ROC curve. Both real data sets were split into a random training set consisting of 75% of the observations, and a validation set consisting of the remaining 25% of the observations. For the simulated data we created a training set, a testing set (for tuning the hyperparameters) and a validation set of 200 observations each. We carried out 100 replications of each experiment. Hyperparameters6 were chosen using 5-fold cross-validation in the training set for the two real data sets and by using an additional testing set for the simulated data.

6

The following hyperparameters were chosen using cross-validation: the penalty λ for SVM, IVM and KCR as well as the Gaussian kernel parameter γ ; the number of components for partial Cox regression and principal component Cox regression; the l1 upper bound for the LASSO; the correlation threshold for the method by van ’t Veer et al. (2002).

The following five criteria were used for assessing the performance of the different methods: • Twice the difference in the loglikelihood between a Cox model using the predicted score as only covariate and one using no covariates at all (denoted by 2 l). • The concordance index (denoted by C index) of Harrell et al. (1996, Section 5.5). It is the proportion of comparable patient pairs for which the predicted score and the outcome is concordant. • The area under the ROC curve (see e.g. Hand, 1997, Ch. 7), when survival is dichotomised at the median survival time (denoted by AUC ). • The area under ROC curve integrated over time (denoted by iAUC ), which corresponds to the area under the ROC surface (as a function of the predicted score and time). • The last criterion is based on estimating the survivor function using a Kaplan–Meier estimate (Kaplan and Meier, 1958), once for the good prognosis group, and once for the bad prognosis group. The criterion (denoted by S) is defined as the integral of the difference between the two estimates of the survivor

functions, i.e. S = tt1n S good (t)−S bad (t) dt. The two prognosis groups are defined by comparing the predicted score to its average. Thus this criterion is based on a dichotomization of the predicted score. Tables 1–3 show the results obtained on unseen validation data. The best method with respect to each performance criterion is underlined. For all five performance criteria considered, the best method is, with one exception, always either an SVM or an IVM. However, the performance of the SVM, the IVM, and kernel Cox regression is very similar, as the theory suggests.

1636

[16:55 4/7/03 Bioinformatics-btn253.tex]

Page: 1636

1632–1638

Sparse kernel methods for survival data

Table 2. Results obtained from the lung cancer data (mean and SD of the five performance measures), best performing method underlined Method

Performance on validation data (lung cancer) 2 

C index

AUC

iAUC

S

Linear kernel (all genes) SVM IVM KCR

2.027 (2.06) 1.873 (1.84) 1.479 (1.81)

0.6032 (0.0441) 0.5967 (0.0461) 0.5533 (0.121)

0.6276 (0.075) 0.6191 (0.0741) 0.5812 (0.140)

0.6278 (0.0596) 0.6333 (0.0623) 0.5642 (0.128)

0.1343 (0.0905) 0.1192 (0.0966) 0.0898 (0.139)

Gaussian kernel (all genes) SVM IVM KCR

2.304 (2.09) 1.840 (1.76) 1.894 (1.85)

0.6103 (0.043) 0.5957 (0.0451) 0.5988 (0.0425)

0.6392 (0.0761) 0.6290 (0.0722) 0.6268 (0.0697)

0.6428 (0.0554) 0.6289 (0.063) 0.6299 (0.0563)

0.1510 (0.095) 0.1305 (0.104) 0.1143 (0.117)

Gaussian kernel (subset) SVM IVM KCR

2.121 (2.08) 2.137 (2.19) 1.929 (1.97)

0.6081 (0.0453) 0.6074 (0.0471) 0.6052 (0.0435)

0.6438 (0.0672) 0.6425 (0.0674) 0.6400 (0.0649)

0.6425 (0.0566) 0.6415 (0.0622) 0.6368 (0.0552)

0.1405 (0.105) 0.1380 (0.097) 0.1397 (0.091)

Supervised principal components Partial Cox regression Principcal component Cox regression LASSO van ’t Veer et al. Wang et al.

1.779 (1.77) 1.513 (1.57) 0.866 (1.29) 0.959 (0.92) 1.629 (1.98) 1.080 (1.30)

0.5898 (0.0536) 0.5869 (0.0463) 0.5455 (0.058) 0.5514 (0.0624) 0.5626 (0.0754) 0.5567 (0.0553)

0.6197 (0.0934) 0.6105 (0.0722) 0.5768 (0.093) 0.5896 (0.0953) 0.5882 (0.109) 0.5918 (0.0924)

0.6150 (0.0649) 0.6091 (0.0634) 0.5545 (0.0789) 0.5368 (0.0745) 0.5705 (0.0933) 0.5729 (0.0743)

0.1251 (0.114) 0.1021 (0.0976) 0.0708 (0.146) 0.1066 (0.157) 0.1097 (0.142) 0.1179 (0.134)

Table 3. Results obtained for the simulated data (mean and SD of the five performance measures), best performing method underlined Method

Performance on validation data (simulated data) 2 

C index

AUC

iAUC

S

Linear kernel (all genes) SVM IVM KCR

137.7 (36.5) 134.5 (38.3) 138.2 (37.4)

0.6646 (0.0180) 0.6614 (0.0206) 0.6651 (0.0194)

0.7150 (0.0244) 0.7098 (0.0272) 0.7157 (0.0261)

0.7012 (0.0245) 0.6976 (0.0280) 0.7019 (0.0266)

0.2655 (0.0322) 0.2582 (0.0375) 0.2696 (0.0366)

Gaussian kernel (all genes) SVM IVM KCR

136.1 (39.7) 133.3 (36.2) 115.4 (37.6)

0.6640 (0.0205) 0.6626 (0.0204) 0.6466 (0.0337)

0.7134 (0.0290) 0.7112 (0.0279) 0.6912 (0.0448)

0.7010 (0.0272) 0.6986 (0.0274) 0.6873 (0.0456)

0.2667 (0.0323) 0.2647 (0.0424) 0.1941 (0.0611)

Gaussian kernel (subset) SVM IVM KCR

142.9 (43.6) 132.7 (35.3) 134.4 (41.2)

0.6671 (0.0221) 0.6668 (0.0253) 0.6680 (0.0209)

0.7288 (0.0310) 0.7153 (0.0332) 0.7201 (0.0294)

0.7048 (0.0296) 0.7010 (0.0345) 0.7049 (0.0281)

0.2639 (0.0634) 0.2516 (0.0409) 0.2634 (0.0634)

0.6021 (0.129) 0.5978 (0.0293) 0.5888 (0.0296) 0.2276 (0.0102) 0.5726 (0.0321) 0.6103 (0.0254)

0.6395 (0.140) 0.6286 (0.0409) 0.6206 (0.0404) 0.2442 (0.0159) 0.5968 (0.0405) 0.6438 (0.0376)

0.6292 (0.137) 0.6235 (0.0381) 0.6123 (0.0385) 0.2406 (0.0127) 0.5929 (0.0409) 0.6350 (0.0312)

0.1997 (0.0772) 0.1568 (0.0483) 0.1425 (0.0479) 0.1086 (0.0229) 0.1167 (0.0526) 0.1864 (0.0528)

Supervised principal components Partial Cox regression Prinicpal component Cox regression LASSO van’t Veer et al. Wang et al.

84.38 (49.5) 56.53 (31.4) 46.47 (29.8) 48.56 (20.1) 32.39 (25.2) 64.06 (30.5)

1637

[16:55 4/7/03 Bioinformatics-btn253.tex]

Page: 1637

1632–1638

L.Evers and C.-M.Messow

When using other correlation structures or a lognormal survival time in the simulations, we obtained results similar to the ones reported in Table 3. The latter suggests that the models based on the proportional hazards model are robust with respect to mild violations of the proportional hazards assumption. Note that prediction of the survival time on the unseen validation data is a hard problem: even the best methods considered here are far from obtaining perfect predictions.

6

DISCUSSION

This article presented two novel methods for analysing highdimensional survival data. The Survival IVM is based on enforcing that the dual parameter of a the kernel Cox model is sparse. This is achieved by adding the observations one after another to the model. The other method proposed, the Survival SVM, is based on generalizing the idea of margin maximisation to censored survival data. Maximizing the margin leads to a solution that is sparse in the observation space. Two examples and simulations studied showed very promising results: the methods outperformed the other algorithms considered. Both methods proposed in this article are sparse in the observation space, however they are not sparse in the variable space, i.e. they depend on all covariates (genes). Whilst sparseness in the observation space can help achieving a good generalization performance and allows for very efficient computation, sparseness in the variable space is often desired from a practioner’s point of view; it guranatees that a score can be interpreted more easily, further it allows for reducing the cost of further predictions, as less variables have to be measured for future predictions. In our simulations, we have used the Cox score for selecting which genes were used with the Gaussian kernel. A direction for further research is to investigate how the methods can be modified so that they yield a solution that is both sparse in the observation space and in the variable space. Such an approach will be very computer-intensive. However, when using models that are sparse in the observation space like the SVM or the IVM, such computations become cheaper to carry out as the model depends only on a few observations.

ACKNOWLEDGEMENT The authors would like to thank Anja Victor for the helpful discussions. Conflict of Interest: none declared.

REFERENCES Bair,E. and Tibshirani,R. (2004) Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology, 2, 511–522. Bhattacharjee,A. et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Nat. Acad. Sci., 98, 13790–13795. Burges,C.J.C. (1998) A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc., 2, 121–167. Cox,D.R. (1972) Regression models and life-tables (with discussion). J. Roy. Stat. Soc. B, 34, 187–220. Cox,D.R. (1975) Partial likelihood. Biometrika, 62, 269–276. Hand,D.J. (1997) Construction and assessment of classification rules. Wiley, Chichester. Harrell,F.E. et al. (1996) Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med., 15, 361–387. Kaplan,E.L. and Meier,P. (1958) Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc., 53, 457–481. Kimeldorf,G. and Wahba,G. (1971) Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33, 82–95. Li,H. and Gui,J. (2004) Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics, 20, i208–i215. Li,H. and Luan,Y. (2003) Kernel Cox regression models for linking gene expression profiles to censored survival data. Pac. Symp. Biocomput., 8, 65–76. Platt,J. (1999) Fast training of support vector machines using sequential minimal optimization. In Schölkopf,B., Burges,C. and Smola,A. (eds) Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, MA. Rosenwald,A. et al. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med., 346, 1937–1947. Tibshirani,R. (1997) The Lasso method for variable selection in the Cox model. Stat. Med., 16, 385–395. van ’t Veer,L.J. et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 6871, 530–536. Vapnik,V. and Lerner,A.J. (1963) Generalized portrait method for pattern recognition. Automat. Rem. Contr., 24, 774–780. Wang,Y. et al. (2005) Gene-expression profiles to predict distant metastasis of lymphnode-negative primary breast cancer. Lancet, 365, 671–679. Zhu,J. and Hastie,T. (2005) Kernel logistic regression and the import vector machine. J. Comput. Graph. Stat., 14(1), 185–205.

1638

[16:55 4/7/03 Bioinformatics-btn253.tex]

Page: 1638

1632–1638