# gaussian_processes_for_survival_analysis__ab908552.pdf Gaussian Processes for Survival Analysis Tamara Fernández Department of Statistics, University of Oxford. Oxford, UK. fernandez@stats.ox.ac.uk Nicolás Rivera Department of Informatics, King s College London. London, UK. nicolas.rivera@kcl.ac.uk Yee Whye Teh Department of Statistics, University of Oxford. Oxford, UK. y.w.teh@stats.ox.ac.uk We introduce a semi-parametric Bayesian model for survival analysis. The model is centred on a parametric baseline hazard, and uses a Gaussian process to model variations away from it nonparametrically, as well as dependence on covariates. As opposed to many other methods in survival analysis, our framework does not impose unnecessary constraints in the hazard rate or in the survival function. Furthermore, our model handles left, right and interval censoring mechanisms common in survival analysis. We propose a MCMC algorithm to perform inference and an approximation scheme based on random Fourier features to make computations faster. We report experimental results on synthetic and real data, showing that our model performs better than competing models such as Cox proportional hazards, ANOVA-DDP and random survival forests. 1 Introduction Survival analysis is a branch of statistics focused on the study of time-to-event data, usually called survival times. This type of data appears in a wide range of applications such as failure times in mechanical systems, death times of patients in a clinical trial or duration of unemployment in a population. One of the main objectives of survival analysis is the estimation of the so-called survival function and the hazard function. If a random variable has density function f and cumulative distribution function F, then its survival function S is 1 F, and its hazard λ is f/S. While the survival function S(t) gives us the probability a patient survives up to time t, the hazard function λ(t) is the instant probability of death given that she has survived until t. Due to the nature of the studies in survival analysis, the data contains several aspects that make inference and prediction hard. One important characteristic of survival data is the presence of many covariates. Another distinctive flavour of survival data is the presence of censoring. A survival time is censored when it is not fully observable but we have an upper or lower bound of it. For instance, this happens in clinical trials when a patient drops out the study. There are many methods for modelling this type of data. Arguably, the most popular is the Kaplan Meier estimator [13]. The Kaplan-Meier estimator is a very simple, nonparametric estimator of the survival function. It is very flexible and easy to compute, it handles censored times and requires no-prior knowledge of the nature of the data. Nevertheless, it cannot handle covariates naturally and no prior knowledge can be incorporated. A well-known method that incorporates covariates is the Cox proportional hazard model [3]. Although this method is very popular and useful in applications, a drawback of it, is that it imposes the strong assumption that the hazard curves are proportional and non-crossing, which is very unlikely for some data sets. There is a vast literature of Bayesian nonparametric methods for survival analysis [9]. Some examples include the so-called Neutral-to-the-right priors [5], which models survival curves as e µ((0,t]), where µ is a completely random measure on R+. Two common choices for µ are the Dirichlet process 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. [8] and the beta-Stacy process [20], the latter, being a bit more tractable due its conjugacy. Other alternatives place a prior on the hazard function, one example of this, is the extended gamma process [7]. The weakness of the above methods is that there is no natural nor direct way to incorporate covariates and thus, they have not been extensively used by practitioners of survival analysis. More recently, [4] developed a new model called ANOVA-DDP which mixes ideas from ANOVA and Dirichlet processes. This method successfully incorporates covariates without imposing strong constraints, though it is not clear how to incorporate expert knowledge. Within the context of Gaussian process, a few models has been considered, for instance [14] and [12]. Nevertheless these models fail to go beyond the proportional hazard assumption, which corresponds to one of the aims of this work. Another option is [11], which describes a survival model with non-proportional hazard and time-dependent covariates. Recently, we became aware of the work of [2], which uses a so-called accelerated failure times model. Here, the dependence of the failure times on covariates is modelled by rescaling time, with the rescaling factor modelled as a function of covariates with a Gaussian process prior. This model is different from our proposal, and is more complex to study and to work with. Lastly, another well-known method is Random Survival Forest [10]. This can be seen as a generalisation of Kaplan Meier estimator to several covariates. It is fast and flexible, nevertheless it cannot incorporate expert knowledge and lacks interpretation which is fundamental for survival analysis. In this paper we introduce a new semiparametric Bayesian model for survival analysis. Our model is able to handle censoring and covariates. Our approach models the hazard function as the multiplication of a parametric baseline hazard and a nonparametric part. The parametric part of our model allows the inclusion of expert knowledge and provides interpretability, while the nonparametric part allows us to handle covariates and to amend incorrect or incomplete prior knowledge. The nonparametric part is given by a non-negative function of a Gaussian process on R+. Giving the hazard function λ of a random variable T, we sample from it by simulating the first jump of a Poisson process with intensity λ. In our case, the intensity of the Poisson process is a function of a Gaussian process, obtaining what is called a Gaussian Cox process. One of the main difficulties of working with Gaussian Cox processes is the problem of learning the true intensity given the data because, in general, it is impossible to sample the whole path of a Gaussian process. Nevertheless, exact inference was proved to be tractable by [1]. Indeed, the authors developed an algorithm by exploiting a nice trick which allows them to make inference without sampling the whole Gaussian process but just a finite number of points. In this paper, we study basic properties of our prior. We also provide an inference algorithm based in a sampler proposed by [18] which is a refined version of the algorithm presented in [1]. To make the algorithm scale we introduce a random Fourier features to approximate the Gaussian process and we supply the respective inference algorithm. We demonstrate the performance of our method experimentally by using synthetic and real data. Consider a continuous random variable T on R+ = [0, ), with density function f and cumulative distribution function F. Associated with T, we have the survival function S = 1 F and the hazard function λ = f/S. The survival function S(t) gives us the probability a patient survives up to time t, while the hazard function λ(t) gives us the instant risk of patient at time t. We define a Gaussian process prior over the hazard function λ. In particular, we choose λ(t) = λ0(t)σ(l(t)), where λ0(t) is a baseline hazard function, l(t) is a centred stationary Gaussian process with covariance function κ, and σ is a positive link function. For our implementation, we choose σ as the sigmoidal function σ = (1 + e x) 1, which is a quite standard choice in applications. In this way, we generate T as the first jump of the Poisson process with intensity λ, i.e. T has distribution λ(t)e R t 0 λ(s)ds. Our model for a data set of i.i.d. Ti, without covariates, is l( ) GP(0, κ), λ(t)|l, λ0(t) = λ0(t)σ(l(t)), Ti|λ iid λ(t)e R Ti 0 λ(s)ds, (1) which can be interpreted as a baseline hazard with a multiplicative nonparametric noise. This is an attractive feature as an expert may choose a particular hazard function and then the nonparametric noise amends an incomplete or incorrect prior knowledge. The incorporation of covariates is discussed later in this section, while censoring is discussed in section 3. Notice that E(σ(X)) = 1/2 for a zero-mean Gaussian random variable. Then, as we are working with a centred Gaussian process, it holds that E(λ(t)) = λ0(t)E(σ(l(t))) = λ0(t)/2. Hence, we can imagine our model as a random hazard centred in λ0(t)/2 with a multiplicative noise. In the simplest scenario, we may take a constant baseline hazard λ0(t) = 2Ωwith Ω> 0. In such case, we obtain a random hazard centred in Ω, which is simply the hazard function of a exponential random variable with mean 1/Ω. Another choice might be λ0(t) = 2βtα 1, which determines a random hazard function centred in βtα 1, which corresponds to the hazard function of the Weibull distribution, a popular default distribution in survival analysis. In addition to the hierarchical model in (1), we include hyperparameters to the kernel κ and to the baseline hazard λ0(t). In particular for the kernel, it is common to include a length scale parameter and an overall variance. Finally, we need to ensure the model we proposed defines a well-defined survival function, i.e. S(t) 0 as t tends to infinity. This is not trivial as our random survival function is generated by a Gaussian process. The next proposition, proved in the supplemental material, states that under suitable regularity conditions, the prior defines proper survival functions. Proposition 1. Let (l(t))t 0 GP(0, κ) be a stationary continuous Gaussian process. Suppose that κ(s) is non-increasing and that lims κ(s) = 0. Moreover, assume it exists K > 0 and α > 0 such that λ0(t) Ktα 1 for all t 1. Let S(t) be the random survival function associated with (l(t))t 0, then limt S(t) = 0 with probability 1. Note the above proposition is satisfied by the hazard functions of the Exponential and Weibull distributions. 2.1 Adding covariates We model the relation between time and covariates by the kernel of the Gaussian process prior. A simple way to generate kernels in time and covariates is to construct kernels for each covariate and time, and then perform basic operation of them, e.g. addition or multiplication. Let (t, X) denotes a time t and with covariates X Rd. Then for pairs (t, X) and (s, Y ) we can construct kernels like ˆK((t, X), (s, Y )) = ˆK0(t, s) + Pd j=1 ˆKj(Xj, Yj), or, the following kernel, which is the one we use in our experiments, K((t, X), (s, Y )) = K0(t, s) + Pd j=1 Xj Yj Kj(t, s). (2) Observe that the first kernel establishes an additive relation between time and covariates while the second creates an interaction between the value of the covariates and time. More complicated structures that include more interaction between covariates can be considered. We refer to the work of [6] for details about the construction and interpretation of the operations between kernels. Observe the new kernel produces a Gaussian process from the space of time and covariates to the real line, i.e it has to be evaluated in a pair of time and covariates. The new model to generate Ti, assuming we are given the covariates Xi, is l( ) GP(0, K), λi(t)|l, λ0(t), Xi = λ0(t)σ(l(t, Xi)), Ti|λi indep λ(Ti)e R Ti 0 λi(s)ds, (3) In our construction of the kernel K, we choose all kernels Kj as stationary kernels (e.g. squared exponential), so that K is stationary with respect to time, so proposition 1 is valid for each fixed covariate X, i.e. giving a fix covariate X, we have SX(t) = P(T > t|X) 0 as t . 3 Inference 3.1 Data augmentation scheme Notice that the likelihood of the model in equation (3) has to deal with terms of the form λi(t) exp R t 0 λi(s)ds as these expressions come from the density of the first jump of a nonhomogeneous Poisson process with intensity λi. In general the integral is not analytically tractable since λi is defined by a Gaussian process. A numerical scheme can be used, but it is approximate and computationally expensive. Following [1] and [18], we develop a data augmentation scheme based on thinning a Poisson process that allows us to efficiently avoid a numerical method. If we want to sample a time T with covariate X, as given in equation (3), we can use the following generative process. Simulate a sequence of points g1, g2, . . . of points distributed according a Poisson process with intensity λ0(t). We assume the user is using a well-known parametric form, then sampling the points g1, g2, . . . is tractable (in the Weibull case this can be easily done). Starting from k = 1 we accept the point gk with probability σ(l(gk, X)). If it is accepted we set T = gk, otherwise we try the point gk+1 and repeat. We denote by G the set of rejected point, i.e. if we accepted gk, then G = {g1, . . . , gk 1}. Note the above sampling procedure needs to evaluate the Gaussian process in the points (gk, X) instead the whole space. Following the above scheme to sample T, the following proposition can be shown. Proposition 2. Let Λ0(t) = R T 0 λ0(t)dt, then p(G, T|λ0, l(t)) = g G (1 σ(l(g))) Proof sketch. Consider a Poisson process on [0, ) with intensity λ0(t). Then, the first term in the RHS of equation (4) is the density of putting points exactly in G {T}. The second term is the probability of putting no points in [0, T] \ (G {T}), i.e. e Λ0(T ). The second term is independent of the first one. The last term comes from the acceptance/rejection part of the process. The points g G are rejected with probability 1 σ(g), while the point T is accepted with probability σ(T). Since the acceptance/rejection of points is independent of the Poisson process we get the result. Using the above proposition, the model of equation (1) can be reformulated as the following tractable generative model: l( ) GP(0, K), (G, T)|λ0(t), l(t) e Λ0(T )(σ(l(T))λ0(T)) Y g G (1 σ(l(g)))λ0(g). (5) Our model states a joint distribution for the pair (G, T) where G is the set of rejected jump point of the thinned Poisson process and T is the first accepted one. To perform inference we need data (Gi, Ti, Xi), whereas we only receive points (Ti, Xi). Thus, we need to sample the missing data Gi given (Ti, Xi). The next proposition gives us a way to do this. Proposition 3. [18] Let T be a data point with covariate X and let G be its set of rejected points. Then the distribution of G given (T, X, λ0, l) is distributed as a non-homogeneous Poisson process with intensity λ0(t)(1 σ(l(t, X))) on the interval [0, T]. 3.2 Inference algorithm The above data augmentation scheme suggests the following inference algorithm. For each data point (Ti, Xi) sample Gi|(Ti, Xi, λ0, l), then sample l|((Gi, Ti, Xi)n i=1, λ0), where n is the number of data points. Observe that the sampling of l given (Gi, Ti, Xi)n i=1, λ0) can be seen as a Gaussian process binary classification problem, where the points Gi and Ti represent two different classes. A variety of MCMC techniques can be used to sample l, see [15] for details. For our algorithm we use the following notation. We denote the dataset as (Ti, Xi)n i=1. The set Gi refers to the set of rejected points of Ti. We denote G = Sn i=1 Gi and T = {T1, . . . , Tn} for the whole set of rejected and accepted points, respectively. For a point t Gi {Ti} we denote l(t) instead of l(t, Xi), but remember that each point has an associated covariate. For a set of points A we denote l(A) = {l(a) : a A}. Also Λ0(t) refers to R t 0 λ0(s)ds and Λ0(t) 1 denotes its inverse function (it exists since Λ0(t) is increasing). Finally, N denotes the number of iterations we are going to run our algorithm. The pseudo code of our algorithm is given in Algorithm 1. Lines 2 to 11 sample the set of rejected points Gi for each survival time Ti. Particularly lines 3 to 5 use the Mapping theorem, which tells us how to map a homogeneous Poisson process into a non-homogeneous with the appropriate intensity. Observe it makes uses of the function Λ0 and its Algorithm 1: Inference Algorithm. Input: Set of times T and the Gaussian proces l instantiated in T and other initial parameters 1 for q=1:N do 2 for i=1:n do 3 ni Poisson(1; Λ0(Ti)); 4 Ci U(ni; 0, Λ0(Ti)); 5 Set Ai = Λ 1 0 ( Ai); 6 Set A = n i=1Ai 7 Sample l(A)|l(G T ), λ0 8 for i=1:n do 9 Ui U(ni; 0, 1) 10 set G(i) = {a Ai such that Ui < 1 σ(l(a))} 11 Set G = n i=1Gi 12 Update parameters of λ0(t) 13 Update l(G T ) and hyperparameter of the kernel. inverse function, which shall be provided or be easily computable. The following lines classify the points drawn from the Poisson process with intensity λ0 in the set Gi as in proposition 3. Line 7 is used to sample the Gaussian process in the set of points A given the values in the current set G T . Observe that at the beginning of the algorithm, we have G = . 3.3 Adding censoring Usually, in Survival analysis, we encounter three types of censoring: right, left and interval censoring. We assume each data point Ti is associated with an (observable) indicator δi, denoting the type of censoring or if the time is not censored. We describe how the algorithm described before can easily handle any type of censorship. Right censorship: In presence of right censoring, the likelihood for a survival time Ti is S(Ti). The related event in terms of the rejected points correspond to do not accept any location [0, Ti). Hence, we can treat right censorship in the same way as the uncensored case, by just sampling from the distribution of the rejected jump times prior Ti. In this case, Ti is not an accepted location, i.e. Ti is not considered in the set T of line 7 nor 13. Left censorship: In this set-up, we know the survival time is at most Ti, then the likelihood of such time is F(Ti). Treating this type of censorship is slightly more difficult than the previous case because the event is more complex. We ask for accepting at least one jump time prior Ti, which might leads us to have a larger set of latent variables. In order to avoid this, we proceed by imputing the true survival time T i by using its truncated distribution on [0, Ti]. Then we proceed using T i (uncensored) instead of Ti. We can sample T i as following: we sample the first point of a Poisson process with the current intensity λ, if such point is after Ti we reject the point and repeat the process until we get one. The imputation step has to be repeated at the beginning of each iteration. Interval censorship: If we know that survival time lies in the interval I = [Si, Ti] we can deal with interval censoring in the same way as left censoring but imputing the survival time T i in I. 4 Approximation scheme As shown is algorithm 1, in line 7 we need to sample the Gaussian process (l(t))t 0 in the set of points A from its conditional distribution, while in line 13, we have to update (l(t))t 0 in the set G T . Both lines require matrix inversion which scales badly for massive datasets or for data T that generates a large set G. In order to help the inference we use a random feature approximation of the Kernel [17]. We exemplify the idea on the kernel we use in our experiment, which is given by K((t, X), (s, Y )) = K0(t, s) + Pd j=1 Xj Yj Kj(t, s), where each Kj is a square exponential kernel, with overall variance σ2 j and length scale parameter φj Hence, for m 0, the approximation of our Gaussian process is given by gm(t, X) = gm 0 (t) + Pd j=1 Xjgm j (t) (6) where each gm j (t) = Pm k=1 aj k cos(sj kt) + bj k sin(sj kt), and each aj k and bj k are independent samples of N(0, σ2 j ) where σ2 j is the overall variance of the kernel Kj. Moreover, sj k are independent samples of N(0, 1/(2πφj)) where φj is the length scale parameter of the kernel Kj. Notice that gm(t, X) is a Gaussian process since each gm j (t) is the sum of independent normally distributed random variables. It is know that as m goes to infinity, the kernel of gm(t, X) approximates the kernel Kj. The above approximation can be done for any stationary kernel and we refer the reader to [17] for details. The inference algorithm for this scheme is practically the same, except for two small changes. The values l(A) in line 7 are easier to evaluate because we just need to know the values of the aj k and bj k, and no matrix inversion is needed. In line 13 we just need to update all values ak j and bk j . Since they are independent variables there is no need for matrix inversion. 5 Experiments All the experiments are performed using our approximation scheme of equation (6) with a value of m = 50. Recall that for each Gaussian process, we used a squared exponential kernel with overall variance σ2 j and length scale parameter φj. Hence for a set of d covariates we have a set of 2(d + 1) hyper-parameters associated to the Gaussian processes. In particular, we follow a Bayesian approach and place a log-Normal prior for the length scale parameter φj, and a gamma prior (inverse gamma is also useful since it is conjugate) for the variance σ2 j . We use elliptical slice sampler [16] for jointly updating the set of coefficients {aj k, bj k} and length-scale parameters. With respect the baseline hazard we consider two models. For the first option, we choose the baseline hazard 2βtα 1 of a Weibull random variable. Following a Bayesian approach, we choose a gamma prior on β and a uniform U(0, 2.3) on α. Notice the posterior distribution for β is conjugate and thus we can easily sample from it. For α, use a Metropolis step to sample from its posterior. Additionally, observe that for the prior distribution of α, we constrain the support to (0, 2.3). The reason for this is because the expected size of the set G increases with respect to α and thus slow down computations. As second alternative is to choose the baseline hazard as λ0(t) = 2Ω, with gamma prior over the parameter Ω. The posterior distribution of Ωis also gamma. We refer to both models as the Weibull model (W-SGP) and the Exponential model (E-SGP) respectively. The implementation for both models is exactly the same as in Algorithm 1 and uses the same hyperparameters described before. As the tuning of initial parameters can be hard, we use the maximum likelihood estimator as initial parameters of the model. 5.1 Synthetic Data In this section we present experiments made with synthetic data. Here we perform the experiment proposed in [4] for crossing data. We simulate n = 25, 50, 100 and 150 points from each of the following densities, p0(t) = N(3, 0.82) and p1(t) = 0.4N(4, 1) + 0.6N(2, 0.82), restricted to R+. The data contain the sample points and a covariate indicating if such points were sampled from the p.d.f p0 or p1. Additionally, to each data point, we add 3 noisy covariates taking random values in the interval [0, 1]. We report the estimations of the survival functions for the Weibull model in figure 1 while the results for the Exponential model are given in the supplemental material. It is clear that for the clean data (without extra noisy covariates), the more data the better the estimation. In particular, the model perfectly detects the cross in the survival functions. For the noisy data we can see that with few data points the noise seems to have an effect in the precision of our estimation in both models. Nevertheless, the more points the more precise is our estimate for the survival curves. With 150 points, each group seems to be centred on the corresponding real survival function, independent of the noisy covariates. We finally remark that for the W-SGP and E-SGP models, the prior of the hazards are centred in a Weibull and a Exponential hazard, respectively. Since the synthetic data does not come from those G GG G G G G G G G G G G G G G G G G G G G GG G 0.00 G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G GG G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G GG G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G GG G 0.00 G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G GG G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G GG G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G Figure 1: Weibull Model. First row: clean data, Second row: data with noise covariates. Per columns we have 25, 50, 100 and 150 data points per each group (shown in X-axis) and data is increasing from left to right. Dots indicate data is generated from p0, crosses, from p1. In the first row a credibility interval is shown. In the second row each curve for each combination of noisy covariate is given. distributions, it will be harder to approximate the true survival function with few data. Indeed, we observe our models have problems at estimating the survival functions for times close to zero. 5.2 Real data experiments To compare our models we use the so-called concordance index. The concordance index is a standard measure in survival analysis which estimates how good the model is at ranking survival times. We consider a set of survival times with their respective censoring indices and set of covariates (T1, δ1, X1), . . . , (Tn, δn, Xn). On this particular context, we just consider right censoring. To compute the C-index, consider all possible pairs (Ti, δi, Xi; Tj, δj, Xj) for i = j. We call a pair admissible if it can be ordered. If both survival times are right-censored i.e. δi = δj = 0 it is impossible to order them, we have the same problem if the smallest of the survival times in a pair is censored, i.e. Ti < Tj and δi = 0. All the other cases under this context will be called admissible. Given just covariates Xi, Xj and the status δi, δj, the model has to predict if Ti < Tj or the other way around. We compute the C-index by considering the number of pairs which were correctly sorted by the model, given the covariates, over the number of admissible pairs. A larger C-index indicates the model is better at predicting which patient dies first by observing the covariates. If the C-index close to 0.5, it means the prediction made by the model is close to random. We run experiments on the Veteran data, avaiable in the R-package survival package [19]. Veteran consists of a randomized trial of two treatment regimes for lung cancer. It has 137 samples and 5 covariates: treatment indicating the type of treatment of the patients, their age, the Karnofsky performance score, and indicator for prior treatment and months from diagnosis. It contains 9 censored times, corresponding to right censoring. In the experiment we run our Weibull model (W-SGP) and Exponential model (E-SGP), ANOVA DDP, Cox Proportional Hazard and Random Survival Forest. We perform 10-fold cross validation and compute the C-index for each fold. Figure 2 reports the results. For this dataset the only significant variable corresponds to the Karnofsky performance score. In particular as the values of this covariate increases, we expect an improved survival time. All the studied models achieve such behaviour and suggest a proportionality relation between the hazards. This is observable in the C-Index boxplot we can observe good results for proportional hazard rates. ANOVA DDP COX E SGP RSF W SGP G G GG G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G GG G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G 0 250 500 750 1000 time All data Score 30, treatment 1 Score 30, treatment 2 Score 90, treatment 1 Score 90, treatment 2 COX (step) ANOVA DDP ESGP (smooth) G GG G G G G G G G GG G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GGG GG G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G G G G G G G G G G G G 0 250 500 750 1000 time Score 10 Score 20 Score 30 Score 40 Score 50 Score 60 Score 70 Score 75 Score 80 Score 85 Score 90 Score 99 Figure 2: Left: C-Index for ANOVA-DDP,COX,E-SGP,RSF,W-SGP; Middle: Survival curves obtained for the combination of score: 30, 90 and treatments: 1 (standard) and 2 (test); Right: Survival curves, using W-SGP, across all scores for fixed treatment 1, diagnosis time 5 moths, age 38 and no prior therapy. (Best viewed in colour) G GG G G G G G G G GG G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GGG GG G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G G G G G G G G G G G G 0 250 500 750 1000 time Score 10 Score 20 Score 30 Score 40 Score 50 Score 60 Score 70 Score 75 Score 80 Score 85 Score 90 Score 99 G GG G G G G G G G GG G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GGG GG G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G G G G G G G G G G G G 0 250 500 750 1000 time Score 10 Score 20 Score 30 Score 40 Score 50 Score 60 Score 70 Score 75 Score 80 Score 85 Score 90 Score 99 G GG G G G G G G G GG G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GGG GG G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G G G G G G G G G G G G 0 250 500 750 1000 time Score 10 Score 20 Score 30 Score 40 Score 50 Score 60 Score 70 Score 75 Score 80 Score 85 Score 90 Score 99 Figure 3: Survival curves across all scores for fixed treatment 1, diagnosis time 5 months, age 38 and no prior therapy. Left: ANOVA-DDP; Middle: Cox proportional; Right: Random survival forests. Nevertheless, our method detect some differences between the treatments when the Karnofsky performance score is 90, as it can be seen in figure 2. For the other competing models we observe an overall good result. In the case of ANOVA-DDP we observe the lowest C-INDEX. In figure 3 we see that ANOVA-DDP seems to be overestimating the Survival function for lower scores. Arguably, our survival curves are more visually pleasant than Cox proportional hazards and Random Survival Trees. 6 Discussion We introduced a Bayesian semiparametric model for survival analysis. Our model is able to deal with censoring and covariates. In can incorporate a parametric part, in which an expert can incorporate his knowledge via the baseline hazard but, at the same time, the nonparametric part allows the model to be flexible. Future work consist in create a method to choose initial parameter to avoid sensitivity problems at the beginning. Construction of kernels that can be interpreted by an expert is something desirable as well. Finally, even though the random features approximation is a good approach and helped us to run our algorithm in large datasets, it is still not sufficient for datasets with a massive number of covariates, specially if we consider a large number of interactions between covariates. Acknowledgments YWT s research leading to these results has received funding from the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. Tamara Fernández and Nicolás Rivera were supported by funding from Becas CHILE. [1] Ryan Prescott Adams, Iain Murray, and David JC Mac Kay. Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 9 16. ACM, 2009. [2] James E Barrett and Anthony CC Coolen. Gaussian process regression for survival data with competing risks. ar Xiv preprint ar Xiv:1312.1591, 2013. [3] DR Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187 220, 1972. [4] Maria De Iorio, Wesley O Johnson, Peter Müller, and Gary L Rosner. Bayesian nonparametric nonproportional hazards survival modeling. Biometrics, 65(3):762 771, 2009. [5] Kjell Doksum. Tailfree and neutral random probabilities and their posterior distributions. The Annals of Probability, pages 183 201, 1974. [6] David K Duvenaud, Hannes Nickisch, and Carl E Rasmussen. Additive gaussian processes. In Advances in neural information processing systems, pages 226 234, 2011. [7] RL Dykstra and Purushottam Laud. A bayesian nonparametric approach to reliability. The Annals of Statistics, pages 356 367, 1981. [8] Thomas S Ferguson. A bayesian analysis of some nonparametric problems. The annals of statistics, pages 209 230, 1973. [9] Nils Lid Hjort, Chris Holmes, Peter Müller, and Stephen G Walker. Bayesian nonparametrics, volume 28. Cambridge University Press, 2010. [10] Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, and Michael S Lauer. Random survival forests. The annals of applied statistics, pages 841 860, 2008. [11] Heikki Joensuu, Peter Reichardt, Mikael Eriksson, Kirsten Sundby Hall, and Aki Vehtari. Gastrointestinal stromal tumor: a method for optimizing the timing of ct scans in the follow-up of cancer patients. Radiology, 271(1):96 106, 2013. [12] Heikki Joensuu, Aki Vehtari, Jaakko Riihimäki, Toshirou Nishida, Sonja E Steigen, Peter Brabec, Lukas Plank, Bengt Nilsson, Claudia Cirilli, Chiara Braconi, et al. Risk of recurrence of gastrointestinal stromal tumour after surgery: an analysis of pooled population-based cohorts. The lancet oncology, 13(3):265 274, 2012. [13] Edward L Kaplan and Paul Meier. Nonparametric estimation from incomplete observations. Journal of the American statistical association, 53(282):457 481, 1958. [14] Sara Martino, Rupali Akerkar, and Håvard Rue. Approximate bayesian inference for survival models. Scandinavian Journal of Statistics, 38(3):514 528, 2011. [15] Iain Murray and Ryan P Adams. Slice sampling covariance hyperparameters of latent gaussian models. In Advances in Neural Information Processing Systems, pages 1732 1740, 2010. [16] Iain Murray, Ryan Prescott Adams, and David JC Mac Kay. Elliptical slice sampling. In AISTATS, volume 13, pages 541 548, 2010. [17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2007. [18] Vinayak Rao and Yee W. Teh. Gaussian process modulated renewal processes. In Advances in Neural Information Processing Systems, pages 2474 2482, 2011. [19] Terry M Therneau and Thomas Lumley. Package survival , 2015. [20] Stephen Walker and Pietro Muliere. Beta-stacy processes and a generalization of the pólya-urn scheme. The Annals of Statistics, pages 1762 1780, 1997.