{"title": "Gaussian Processes for Survival Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 5021, "page_last": 5029, "abstract": "We introduce a semi-parametric Bayesian model for survival analysis. The model is centred on a parametric baseline hazard, and uses a Gaussian process to model variations away from it nonparametrically, as well as dependence on covariates. As opposed to many other methods in survival analysis, our framework does not impose unnecessary constraints in the hazard rate or in the survival function. Furthermore, our model handles left, right and interval censoring mechanisms common in survival analysis. We propose a MCMC algorithm to perform inference and an approximation scheme based on random Fourier features to make computations faster. We report experimental results on synthetic and real data, showing that our model performs better than competing models such as Cox proportional hazards, ANOVA-DDP and random survival forests.", "full_text": "Gaussian Processes for Survival Analysis\n\nTamara Fern\u00e1ndez\n\nDepartment of Statistics,\n\nUniversity of Oxford.\n\nOxford, UK.\n\nfernandez@stats.ox.ac.uk\n\nnicolas.rivera@kcl.ac.uk\n\ny.w.teh@stats.ox.ac.uk\n\nLondon, UK.\n\nNicol\u00e1s Rivera\n\nDepartment of Informatics,\n\nKing\u2019s College London.\n\nYee Whye Teh\n\nDepartment of Statistics,\n\nUniversity of Oxford.\n\nOxford, UK.\n\nAbstract\n\nWe introduce a semi-parametric Bayesian model for survival analysis. The model\nis centred on a parametric baseline hazard, and uses a Gaussian process to model\nvariations away from it nonparametrically, as well as dependence on covariates.\nAs opposed to many other methods in survival analysis, our framework does not\nimpose unnecessary constraints in the hazard rate or in the survival function. Fur-\nthermore, our model handles left, right and interval censoring mechanisms common\nin survival analysis. We propose a MCMC algorithm to perform inference and an\napproximation scheme based on random Fourier features to make computations\nfaster. We report experimental results on synthetic and real data, showing that our\nmodel performs better than competing models such as Cox proportional hazards,\nANOVA-DDP and random survival forests.\n\n1\n\nIntroduction\n\nSurvival analysis is a branch of statistics focused on the study of time-to-event data, usually called\nsurvival times. This type of data appears in a wide range of applications such as failure times\nin mechanical systems, death times of patients in a clinical trial or duration of unemployment in\na population. One of the main objectives of survival analysis is the estimation of the so-called\nsurvival function and the hazard function. If a random variable has density function f and cumulative\ndistribution function F , then its survival function S is 1 \u2212 F , and its hazard \u03bb is f /S. While the\nsurvival function S(t) gives us the probability a patient survives up to time t, the hazard function\n\u03bb(t) is the instant probability of death given that she has survived until t.\nDue to the nature of the studies in survival analysis, the data contains several aspects that make\ninference and prediction hard. One important characteristic of survival data is the presence of many\ncovariates. Another distinctive \ufb02avour of survival data is the presence of censoring. A survival time\nis censored when it is not fully observable but we have an upper or lower bound of it. For instance,\nthis happens in clinical trials when a patient drops out the study.\nThere are many methods for modelling this type of data. Arguably, the most popular is the Kaplan-\nMeier estimator [13]. The Kaplan-Meier estimator is a very simple, nonparametric estimator of the\nsurvival function. It is very \ufb02exible and easy to compute, it handles censored times and requires\nno-prior knowledge of the nature of the data. Nevertheless, it cannot handle covariates naturally and\nno prior knowledge can be incorporated. A well-known method that incorporates covariates is the\nCox proportional hazard model [3]. Although this method is very popular and useful in applications,\na drawback of it, is that it imposes the strong assumption that the hazard curves are proportional and\nnon-crossing, which is very unlikely for some data sets.\nThere is a vast literature of Bayesian nonparametric methods for survival analysis [9]. Some examples\ninclude the so-called Neutral-to-the-right priors [5], which models survival curves as e\u2212\u02dc\u00b5((0,t]), where\n\u02dc\u00b5 is a completely random measure on R+. Two common choices for \u02dc\u00b5 are the Dirichlet process\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f[8] and the beta-Stacy process [20], the latter, being a bit more tractable due its conjugacy. Other\nalternatives place a prior on the hazard function, one example of this, is the extended gamma process\n[7]. The weakness of the above methods is that there is no natural nor direct way to incorporate\ncovariates and thus, they have not been extensively used by practitioners of survival analysis. More\nrecently, [4] developed a new model called ANOVA-DDP which mixes ideas from ANOVA and\nDirichlet processes. This method successfully incorporates covariates without imposing strong\nconstraints, though it is not clear how to incorporate expert knowledge. Within the context of\nGaussian process, a few models has been considered, for instance [14] and [12]. Nevertheless these\nmodels fail to go beyond the proportional hazard assumption, which corresponds to one of the aims\nof this work. Another option is [11], which describes a survival model with non-proportional hazard\nand time-dependent covariates. Recently, we became aware of the work of [2], which uses a so-called\naccelerated failure times model. Here, the dependence of the failure times on covariates is modelled\nby rescaling time, with the rescaling factor modelled as a function of covariates with a Gaussian\nprocess prior. This model is different from our proposal, and is more complex to study and to work\nwith.\nLastly, another well-known method is Random Survival Forest [10]. This can be seen as a generali-\nsation of Kaplan Meier estimator to several covariates. It is fast and \ufb02exible, nevertheless it cannot\nincorporate expert knowledge and lacks interpretation which is fundamental for survival analysis.\nIn this paper we introduce a new semiparametric Bayesian model for survival analysis. Our model is\nable to handle censoring and covariates. Our approach models the hazard function as the multiplication\nof a parametric baseline hazard and a nonparametric part. The parametric part of our model allows\nthe inclusion of expert knowledge and provides interpretability, while the nonparametric part allows\nus to handle covariates and to amend incorrect or incomplete prior knowledge. The nonparametric\npart is given by a non-negative function of a Gaussian process on R+.\nGiving the hazard function \u03bb of a random variable T , we sample from it by simulating the \ufb01rst jump\nof a Poisson process with intensity \u03bb. In our case, the intensity of the Poisson process is a function of\na Gaussian process, obtaining what is called a Gaussian Cox process. One of the main dif\ufb01culties of\nworking with Gaussian Cox processes is the problem of learning the \u2018true\u2019 intensity given the data\nbecause, in general, it is impossible to sample the whole path of a Gaussian process. Nevertheless,\nexact inference was proved to be tractable by [1]. Indeed, the authors developed an algorithm by\nexploiting a nice trick which allows them to make inference without sampling the whole Gaussian\nprocess but just a \ufb01nite number of points.\nIn this paper, we study basic properties of our prior. We also provide an inference algorithm based in\na sampler proposed by [18] which is a re\ufb01ned version of the algorithm presented in [1]. To make\nthe algorithm scale we introduce a random Fourier features to approximate the Gaussian process\nand we supply the respective inference algorithm. We demonstrate the performance of our method\nexperimentally by using synthetic and real data.\n\n2 Model\nConsider a continuous random variable T on R+ = [0,\u221e), with density function f and cumulative\ndistribution function F . Associated with T , we have the survival function S = 1 \u2212 F and the hazard\nfunction \u03bb = f /S. The survival function S(t) gives us the probability a patient survives up to time t,\nwhile the hazard function \u03bb(t) gives us the instant risk of patient at time t.\nWe de\ufb01ne a Gaussian process prior over the hazard function \u03bb. In particular, we choose \u03bb(t) =\n\u03bb0(t)\u03c3(l(t)), where \u03bb0(t) is a baseline hazard function, l(t) is a centred stationary Gaussian process\nwith covariance function \u03ba, and \u03c3 is a positive link function. For our implementation, we choose \u03c3 as\nthe sigmoidal function \u03c3 = (1 + e\u2212x)\u22121, which is a quite standard choice in applications. In this\nway, we generate T as the \ufb01rst jump of the Poisson process with intensity \u03bb, i.e. T has distribution\n\n\u03bb(t)e\u2212(cid:82) t\n\n0 \u03bb(s)ds. Our model for a data set of i.i.d. Ti, without covariates, is\nl(\u00b7) \u223c GP(0, \u03ba),\n\n\u03bb(t)|l, \u03bb0(t) = \u03bb0(t)\u03c3(l(t)),\n\nTi|\u03bb iid\u223c \u03bb(t)e\u2212(cid:82) Ti\n\n(1)\nwhich can be interpreted as a baseline hazard with a multiplicative nonparametric noise. This is an\nattractive feature as an expert may choose a particular hazard function and then the nonparametric\nnoise amends an incomplete or incorrect prior knowledge. The incorporation of covariates is discussed\nlater in this section, while censoring is discussed in section 3.\n\n0 \u03bb(s)ds,\n\n2\n\n\fNotice that E(\u03c3(X)) = 1/2 for a zero-mean Gaussian random variable. Then, as we are working\nwith a centred Gaussian process, it holds that E(\u03bb(t)) = \u03bb0(t)E(\u03c3(l(t))) = \u03bb0(t)/2. Hence, we can\nimagine our model as a random hazard centred in \u03bb0(t)/2 with a multiplicative noise. In the simplest\nscenario, we may take a constant baseline hazard \u03bb0(t) = 2\u2126 with \u2126 > 0. In such case, we obtain a\nrandom hazard centred in \u2126, which is simply the hazard function of a exponential random variable\nwith mean 1/\u2126. Another choice might be \u03bb0(t) = 2\u03b2t\u03b1\u22121, which determines a random hazard\nfunction centred in \u03b2t\u03b1\u22121, which corresponds to the hazard function of the Weibull distribution, a\npopular default distribution in survival analysis.\nIn addition to the hierarchical model in (1), we include hyperparameters to the kernel \u03ba and to the\nbaseline hazard \u03bb0(t). In particular for the kernel, it is common to include a length scale parameter\nand an overall variance.\nFinally, we need to ensure the model we proposed de\ufb01nes a well-de\ufb01ned survival function, i.e.\nS(t) \u2192 0 as t tends to in\ufb01nity. This is not trivial as our random survival function is generated by\na Gaussian process. The next proposition, proved in the supplemental material, states that under\nsuitable regularity conditions, the prior de\ufb01nes proper survival functions.\nProposition 1. Let (l(t))t\u22650 \u223c GP(0, \u03ba) be a stationary continuous Gaussian process. Suppose\nthat \u03ba(s) is non-increasing and that lims\u2192\u221e \u03ba(s) = 0. Moreover, assume it exists K > 0 and \u03b1 > 0\nsuch that \u03bb0(t) \u2265 Kt\u03b1\u22121 for all t \u2265 1. Let S(t) be the random survival function associated with\n(l(t))t\u22650, then limt\u2192\u221e S(t) = 0 with probability 1.\n\nNote the above proposition is satis\ufb01ed by the hazard functions of the Exponential and Weibull\ndistributions.\n\n2.1 Adding covariates\n\nWe model the relation between time and covariates by the kernel of the Gaussian process prior. A\nsimple way to generate kernels in time and covariates is to construct kernels for each covariate and\ntime, and then perform basic operation of them, e.g. addition or multiplication. Let (t, X) denotes a\ntime t and with covariates X \u2208 Rd. Then for pairs (t, X) and (s, Y ) we can construct kernels like\n\nj=1 XjYjKj(t, s).\n\n(2)\n\nObserve that the \ufb01rst kernel establishes an additive relation between time and covariates while\nthe second creates an interaction between the value of the covariates and time. More complicated\nstructures that include more interaction between covariates can be considered. We refer to the work\nof [6] for details about the construction and interpretation of the operations between kernels. Observe\nthe new kernel produces a Gaussian process from the space of time and covariates to the real line, i.e\nit has to be evaluated in a pair of time and covariates.\nThe new model to generate Ti, assuming we are given the covariates Xi, is\n\nl(\u00b7) \u223c GP(0, K), \u03bbi(t)|l, \u03bb0(t), Xi = \u03bb0(t)\u03c3(l(t, Xi)), Ti|\u03bbi\n\nindep\u223c \u03bb(Ti)e\u2212(cid:82) Ti\n\n0 \u03bbi(s)ds,\n\n(3)\n\nIn our construction of the kernel K, we choose all kernels Kj as stationary kernels (e.g. squared\nexponential), so that K is stationary with respect to time, so proposition 1 is valid for each \ufb01xed\ncovariate X, i.e. giving a \ufb01x covariate X, we have SX (t) = P(T > t|X) \u2192 0 as t \u2192 \u221e.\n\n3\n\nInference\n\n3.1 Data augmentation scheme\n\n\u03bbi(t) exp\u2212(cid:82) t\n\nNotice that the likelihood of the model in equation (3) has to deal with terms of the form\n0 \u03bbi(s)ds as these expressions come from the density of the \ufb01rst jump of a non-\nhomogeneous Poisson process with intensity \u03bbi. In general the integral is not analytically tractable\nsince \u03bbi is de\ufb01ned by a Gaussian process. A numerical scheme can be used, but it is approximate and\n\n3\n\nor, the following kernel, which is the one we use in our experiments,\n\n\u02c6K((t, X), (s, Y )) = \u02c6K0(t, s) +(cid:80)d\nK((t, X), (s, Y )) = K0(t, s) +(cid:80)d\n\n\u02c6Kj(Xj, Yj),\n\nj=1\n\n\fcomputationally expensive. Following [1] and [18], we develop a data augmentation scheme based\non thinning a Poisson process that allows us to ef\ufb01ciently avoid a numerical method.\nIf we want to sample a time T with covariate X, as given in equation (3), we can use the following\ngenerative process. Simulate a sequence of points g1, g2, . . . of points distributed according a Poisson\nprocess with intensity \u03bb0(t). We assume the user is using a well-known parametric form, then\nsampling the points g1, g2, . . . is tractable (in the Weibull case this can be easily done). Starting from\nk = 1 we accept the point gk with probability \u03c3(l(gk, X)). If it is accepted we set T = gk, otherwise\nwe try the point gk+1 and repeat. We denote by G the set of rejected point, i.e. if we accepted gk, then\nG = {g1, . . . , gk\u22121}. Note the above sampling procedure needs to evaluate the Gaussian process in\nthe points (gk, X) instead the whole space.\nFollowing the above scheme to sample T , the following proposition can be shown.\n\n\uf8f6\uf8f8\n\n(1 \u2212 \u03c3(l(g)))\n\n(4)\n\nProposition 2. Let \u039b0(t) =(cid:82) T\n\uf8eb\uf8ed\u03bb0(T )\n\np(G, T|\u03bb0, l(t)) =\n\n(cid:89)\n\ng\u2208G\n\n0 \u03bb0(t)dt, then\n\n\uf8f6\uf8f8 e\u2212\u039b0(T )\n\n\uf8eb\uf8ed\u03c3(l(T ))\n\n\u03bb0(g)\n\n(cid:89)\n\ng\u2208G\n\nProof sketch. Consider a Poisson process on [0,\u221e) with intensity \u03bb0(t). Then, the \ufb01rst term in the\nRHS of equation (4) is the density of putting points exactly in G \u222a {T}. The second term is the\nprobability of putting no points in [0, T ] \\ (G \u222a {T}), i.e. e\u2212\u039b0(T ). The second term is independent\nof the \ufb01rst one. The last term comes from the acceptance/rejection part of the process. The points\ng \u2208 G are rejected with probability 1 \u2212 \u03c3(g), while the point T is accepted with probability \u03c3(T ).\nSince the acceptance/rejection of points is independent of the Poisson process we get the result.\n\nUsing the above proposition, the model of equation (1) can be reformulated as the following tractable\ngenerative model:\n\nl(\u00b7) \u223c GP(0, K),\n\n(G, T )|\u03bb0(t), l(t) \u223c e\u2212\u039b0(T )(\u03c3(l(T ))\u03bb0(T ))\n\n(1 \u2212 \u03c3(l(g)))\u03bb0(g).\n\n(5)\n\n(cid:89)\n\ng\u2208G\n\nOur model states a joint distribution for the pair (G, T ) where G is the set of rejected jump point of\nthe thinned Poisson process and T is the \ufb01rst accepted one.\nTo perform inference we need data (Gi, Ti, Xi), whereas we only receive points (Ti, Xi). Thus, we\nneed to sample the missing data Gi given (Ti, Xi). The next proposition gives us a way to do this.\nProposition 3. [18] Let T be a data point with covariate X and let G be its set of rejected points.\nThen the distribution of G given (T, X, \u03bb0, l) is distributed as a non-homogeneous Poisson process\nwith intensity \u03bb0(t)(1 \u2212 \u03c3(l(t, X))) on the interval [0, T ].\n\n3.2\n\nInference algorithm\n\nrefers to the set of rejected points of Ti. We denote G =(cid:83)n\nwe denote l(A) = {l(a) : a \u2208 A}. Also \u039b0(t) refers to(cid:82) t\n\nThe above data augmentation scheme suggests the following inference algorithm. For each data\npoint (Ti, Xi) sample Gi|(Ti, Xi, \u03bb0, l), then sample l|((Gi, Ti, Xi)n\ni=1, \u03bb0), where n is the number\nof data points. Observe that the sampling of l given (Gi, Ti, Xi)n\ni=1, \u03bb0) can be seen as a Gaussian\nprocess binary classi\ufb01cation problem, where the points Gi and Ti represent two different classes. A\nvariety of MCMC techniques can be used to sample l, see [15] for details.\ni=1. The set Gi\nFor our algorithm we use the following notation. We denote the dataset as (Ti, Xi)n\ni=1 Gi and T = {T1, . . . , Tn} for the\nwhole set of rejected and accepted points, respectively. For a point t \u2208 Gi \u222a {Ti} we denote l(t)\ninstead of l(t, Xi), but remember that each point has an associated covariate. For a set of points A\n0 \u03bb0(s)ds and \u039b0(t)\u22121 denotes its inverse\nfunction (it exists since \u039b0(t) is increasing). Finally, N denotes the number of iterations we are going\nto run our algorithm. The pseudo code of our algorithm is given in Algorithm 1.\nLines 2 to 11 sample the set of rejected points Gi for each survival time Ti. Particularly lines 3\nto 5 use the Mapping theorem, which tells us how to map a homogeneous Poisson process into a\nnon-homogeneous with the appropriate intensity. Observe it makes uses of the function \u039b0 and its\n\n4\n\n\fAlgorithm 1: Inference Algorithm.\n\nInput: Set of times T and the Gaussian proces l instantiated in T and other initial parameters\n\nfor i=1:n do\n\n1 for q=1:N do\n2\n3\n4\n5\n\nni \u223c Poisson(1; \u039b0(Ti));\n\u02dcCi \u223c U (ni; 0, \u039b0(Ti));\nSet Ai = \u039b\u22121\ni=1Ai\n\nSet A = \u222an\nSample l(A)|l(G \u222a T ), \u03bb0\nfor i=1:n do\n\n0 ( \u02dcAi);\n\n6\n7\n8\n9\n10\n\n11\n12\n13\n\nUi \u223c U (ni; 0, 1)\nset G(i) = {a \u2208 Ai such that Ui < 1 \u2212 \u03c3(l(a))}\n\nSet G = \u222an\nUpdate parameters of \u03bb0(t)\nUpdate l(G \u222a T ) and hyperparameter of the kernel.\n\ni=1Gi\n\ninverse function, which shall be provided or be easily computable. The following lines classify the\npoints drawn from the Poisson process with intensity \u03bb0 in the set Gi as in proposition 3. Line 7 is\nused to sample the Gaussian process in the set of points A given the values in the current set G \u222a T .\nObserve that at the beginning of the algorithm, we have G = \u2205.\n\n3.3 Adding censoring\n\nUsually, in Survival analysis, we encounter three types of censoring: right, left and interval censoring.\nWe assume each data point Ti is associated with an (observable) indicator \u03b4i, denoting the type of\ncensoring or if the time is not censored. We describe how the algorithm described before can easily\nhandle any type of censorship.\nRight censorship: In presence of right censoring, the likelihood for a survival time Ti is S(Ti). The\nrelated event in terms of the rejected points correspond to do not accept any location [0, Ti). Hence,\nwe can treat right censorship in the same way as the uncensored case, by just sampling from the\ndistribution of the rejected jump times prior Ti. In this case, Ti is not an accepted location, i.e. Ti is\nnot considered in the set T of line 7 nor 13.\nLeft censorship: In this set-up, we know the survival time is at most Ti, then the likelihood of\nsuch time is F (Ti). Treating this type of censorship is slightly more dif\ufb01cult than the previous case\nbecause the event is more complex. We ask for accepting at least one jump time prior Ti, which\nmight leads us to have a larger set of latent variables. In order to avoid this, we proceed by imputing\nthe \u2018true\u2019 survival time T (cid:48)\ni by using its truncated distribution on [0, Ti]. Then we proceed using T (cid:48)\n(uncensored) instead of Ti. We can sample T (cid:48)\ni\ni as following: we sample the \ufb01rst point of a Poisson\nprocess with the current intensity \u03bb, if such point is after Ti we reject the point and repeat the process\nuntil we get one. The imputation step has to be repeated at the beginning of each iteration.\nInterval censorship: If we know that survival time lies in the interval I = [Si, Ti] we can deal with\ninterval censoring in the same way as left censoring but imputing the survival time T (cid:48)\n\ni in I.\n\n4 Approximation scheme\n\nAs shown is algorithm 1, in line 7 we need to sample the Gaussian process (l(t))t\u22650 in the set of\npoints A from its conditional distribution, while in line 13, we have to update (l(t))t\u22650 in the set\nG \u222a T . Both lines require matrix inversion which scales badly for massive datasets or for data T that\ngenerates a large set G. In order to help the inference we use a random feature approximation of the\nKernel [17].\nWe exemplify the idea on the kernel we use in our experiment, which is given by K((t, X), (s, Y )) =\nj=1 XjYjKj(t, s), where each Kj is a square exponential kernel, with overall variance\n\nK0(t, s) +(cid:80)d\n\n5\n\n\f0 (t) +(cid:80)d\n\nj (t) =(cid:80)m\n\nj and length scale parameter \u03c6j Hence, for m \u2265 0, the approximation of our Gaussian process is\n\u03c32\ngiven by\n\ngm(t, X) = gm\n\nj=1 Xjgm\n\nj (t)\n\n(6)\n\nk=1 aj\n\nk cos(sj\n\nkt) + bj\n\nk sin(sj\n\nj ) where \u03c32\n\nkt), and each aj\n\nk and bj\nj is the overall variance of the kernel Kj. Moreover, sj\n\nwhere each gm\nk are independent samples\nof N (0, \u03c32\nk are independent samples\nof N (0, 1/(2\u03c0\u03c6j)) where \u03c6j is the length scale parameter of the kernel Kj. Notice that gm(t, X) is\na Gaussian process since each gm\nj (t) is the sum of independent normally distributed random variables.\nIt is know that as m goes to in\ufb01nity, the kernel of gm(t, X) approximates the kernel Kj. The above\napproximation can be done for any stationary kernel and we refer the reader to [17] for details.\nThe inference algorithm for this scheme is practically the same, except for two small changes. The\nk and bj\nvalues l(A) in line 7 are easier to evaluate because we just need to know the values of the aj\nk,\nand no matrix inversion is needed. In line 13 we just need to update all values ak\nj . Since they\nare independent variables there is no need for matrix inversion.\n\nj and bk\n\n5 Experiments\n\nk, bj\n\nk} and length-scale parameters.\n\nAll the experiments are performed using our approximation scheme of equation (6) with a value of\nm = 50. Recall that for each Gaussian process, we used a squared exponential kernel with overall\nvariance \u03c32\nj and length scale parameter \u03c6j. Hence for a set of d covariates we have a set of 2(d + 1)\nhyper-parameters associated to the Gaussian processes. In particular, we follow a Bayesian approach\nand place a log-Normal prior for the length scale parameter \u03c6j, and a gamma prior (inverse gamma is\nalso useful since it is conjugate) for the variance \u03c32\nj . We use elliptical slice sampler [16] for jointly\nupdating the set of coef\ufb01cients {aj\nWith respect the baseline hazard we consider two models. For the \ufb01rst option, we choose the baseline\nhazard 2\u03b2t\u03b1\u22121 of a Weibull random variable. Following a Bayesian approach, we choose a gamma\nprior on \u03b2 and a uniform U (0, 2.3) on \u03b1. Notice the posterior distribution for \u03b2 is conjugate and thus\nwe can easily sample from it. For \u03b1, use a Metropolis step to sample from its posterior. Additionally,\nobserve that for the prior distribution of \u03b1, we constrain the support to (0, 2.3). The reason for this is\nbecause the expected size of the set G increases with respect to \u03b1 and thus slow down computations.\nAs second alternative is to choose the baseline hazard as \u03bb0(t) = 2\u2126, with gamma prior over the\nparameter \u2126. The posterior distribution of \u2126 is also gamma. We refer to both models as the Weibull\nmodel (W-SGP) and the Exponential model (E-SGP) respectively.\nThe implementation for both models is exactly the same as in Algorithm 1 and uses the same hyper-\nparameters described before. As the tuning of initial parameters can be hard, we use the maximum\nlikelihood estimator as initial parameters of the model.\n\n5.1 Synthetic Data\n\nIn this section we present experiments made with synthetic data. Here we perform the experiment\nproposed in [4] for crossing data. We simulate n = 25, 50, 100 and 150 points from each of the\nfollowing densities, p0(t) = N (3, 0.82) and p1(t) = 0.4N (4, 1) + 0.6N (2, 0.82), restricted to R+.\nThe data contain the sample points and a covariate indicating if such points were sampled from the\np.d.f p0 or p1. Additionally, to each data point, we add 3 noisy covariates taking random values in the\ninterval [0, 1]. We report the estimations of the survival functions for the Weibull model in \ufb01gure 1\nwhile the results for the Exponential model are given in the supplemental material.\nIt is clear that for the clean data (without extra noisy covariates), the more data the better the\nestimation. In particular, the model perfectly detects the cross in the survival functions. For the noisy\ndata we can see that with few data points the noise seems to have an effect in the precision of our\nestimation in both models. Nevertheless, the more points the more precise is our estimate for the\nsurvival curves. With 150 points, each group seems to be centred on the corresponding real survival\nfunction, independent of the noisy covariates.\nWe \ufb01nally remark that for the W-SGP and E-SGP models, the prior of the hazards are centred in a\nWeibull and a Exponential hazard, respectively. Since the synthetic data does not come from those\n\n6\n\n\fFigure 1: Weibull Model. First row: clean data, Second row: data with noise covariates. Per columns\nwe have 25, 50, 100 and 150 data points per each group (shown in X-axis) and data is increasing from\nleft to right. Dots indicate data is generated from p0, crosses, from p1. In the \ufb01rst row a credibility\ninterval is shown. In the second row each curve for each combination of noisy covariate is given.\n\ndistributions, it will be harder to approximate the true survival function with few data. Indeed, we\nobserve our models have problems at estimating the survival functions for times close to zero.\n\n5.2 Real data experiments\n\nTo compare our models we use the so-called concordance index. The concordance index is a standard\nmeasure in survival analysis which estimates how good the model is at ranking survival times.\nWe consider a set of survival times with their respective censoring indices and set of covariates\n(T1, \u03b41, X1), . . . , (Tn, \u03b4n, Xn). On this particular context, we just consider right censoring.\nTo compute the C-index, consider all possible pairs (Ti, \u03b4i, Xi; Tj, \u03b4j, Xj) for i (cid:54)= j. We call a\npair admissible if it can be ordered. If both survival times are right-censored i.e. \u03b4i = \u03b4j = 0 it is\nimpossible to order them, we have the same problem if the smallest of the survival times in a pair is\ncensored, i.e. Ti < Tj and \u03b4i = 0. All the other cases under this context will be called admissible.\nGiven just covariates Xi, Xj and the status \u03b4i, \u03b4j, the model has to predict if Ti < Tj or the other way\naround. We compute the C-index by considering the number of pairs which were correctly sorted by\nthe model, given the covariates, over the number of admissible pairs. A larger C-index indicates the\nmodel is better at predicting which patient dies \ufb01rst by observing the covariates. If the C-index close\nto 0.5, it means the prediction made by the model is close to random.\nWe run experiments on the Veteran data, avaiable in the R-package survival package [19]. Veteran\nconsists of a randomized trial of two treatment regimes for lung cancer. It has 137 samples and\n5 covariates: treatment indicating the type of treatment of the patients, their age, the Karnofsky\nperformance score, and indicator for prior treatment and months from diagnosis. It contains 9\ncensored times, corresponding to right censoring.\nIn the experiment we run our Weibull model (W-SGP) and Exponential model (E-SGP), ANOVA\nDDP, Cox Proportional Hazard and Random Survival Forest. We perform 10-fold cross validation\nand compute the C-index for each fold. Figure 2 reports the results.\nFor this dataset the only signi\ufb01cant variable corresponds to the Karnofsky performance score. In\nparticular as the values of this covariate increases, we expect an improved survival time. All the\nstudied models achieve such behaviour and suggest a proportionality relation between the hazards.\nThis is observable in the C-Index boxplot we can observe good results for proportional hazard rates.\n\n7\n\nlllllllllllllllllllllllll0.000.250.500.751.00024llllllllllllllllllllllllllllllllllllllllllllllllll024llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0246llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0246lllllllllllllllllllllllll0.000.250.500.751.00024llllllllllllllllllllllllllllllllllllllllllllllllll024llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0246llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0246\fFigure 2: Left: C-Index for ANOVA-DDP,COX,E-SGP,RSF,W-SGP; Middle: Survival curves\nobtained for the combination of score: 30, 90 and treatments: 1 (standard) and 2 (test); Right:\nSurvival curves, using W-SGP, across all scores for \ufb01xed treatment 1, diagnosis time 5 moths, age 38\nand no prior therapy. (Best viewed in colour)\n\nFigure 3: Survival curves across all scores for \ufb01xed treatment 1, diagnosis time 5 months, age 38 and\nno prior therapy. Left: ANOVA-DDP; Middle: Cox proportional; Right: Random survival forests.\n\nNevertheless, our method detect some differences between the treatments when the Karnofsky\nperformance score is 90, as it can be seen in \ufb01gure 2.\nFor the other competing models we observe an overall good result. In the case of ANOVA-DDP we\nobserve the lowest C-INDEX. In \ufb01gure 3 we see that ANOVA-DDP seems to be overestimating the\nSurvival function for lower scores. Arguably, our survival curves are more visually pleasant than Cox\nproportional hazards and Random Survival Trees.\n\n6 Discussion\n\nWe introduced a Bayesian semiparametric model for survival analysis. Our model is able to deal with\ncensoring and covariates. In can incorporate a parametric part, in which an expert can incorporate his\nknowledge via the baseline hazard but, at the same time, the nonparametric part allows the model to\nbe \ufb02exible. Future work consist in create a method to choose initial parameter to avoid sensitivity\nproblems at the beginning. Construction of kernels that can be interpreted by an expert is something\ndesirable as well. Finally, even though the random features approximation is a good approach and\nhelped us to run our algorithm in large datasets, it is still not suf\ufb01cient for datasets with a massive\nnumber of covariates, specially if we consider a large number of interactions between covariates.\n\nAcknowledgments\n\nYWT\u2019s research leading to these results has received funding from the European Research Council\nunder the European Union\u2019s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement\nno. 617071. Tamara Fern\u00e1ndez and Nicol\u00e1s Rivera were supported by funding from Becas CHILE.\n\n8\n\nl0.50.60.70.8ANOVA\u2212DDPCOXE\u2212SGPRSFW\u2212SGPC\u2212Indexllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.250.500.751.0002505007501000timeS(t)lllllllllllllllllllllllllAll dataScore 30, treatment 1Score 30, treatment 2Score 90, treatment 1Score 90, treatment 2COX (step)ANOVA\u2212DDPESGP (smooth)lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.250.500.751.0002505007501000timeS(t)E\u2212SGPllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllScore 10Score 20Score 30Score 40Score 50Score 60Score 70Score 75Score 80Score 85Score 90Score 99lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.250.500.751.0002505007501000timeS(t)ANOVA\u2212DDPllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllScore 10Score 20Score 30Score 40Score 50Score 60Score 70Score 75Score 80Score 85Score 90Score 99lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.250.500.751.0002505007501000timeS(t)COXllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllScore 10Score 20Score 30Score 40Score 50Score 60Score 70Score 75Score 80Score 85Score 90Score 99lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.250.500.751.0002505007501000timeS(t)RSFllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllScore 10Score 20Score 30Score 40Score 50Score 60Score 70Score 75Score 80Score 85Score 90Score 99\fReferences\n[1] Ryan Prescott Adams, Iain Murray, and David JC MacKay. Tractable nonparametric bayesian\ninference in poisson processes with gaussian process intensities. In Proceedings of the 26th\nAnnual International Conference on Machine Learning, pages 9\u201316. ACM, 2009.\n\n[2] James E Barrett and Anthony CC Coolen. Gaussian process regression for survival data with\n\ncompeting risks. arXiv preprint arXiv:1312.1591, 2013.\n\n[3] DR Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B\n\n(Methodological), 34(2):187\u2013220, 1972.\n\n[4] Maria De Iorio, Wesley O Johnson, Peter M\u00fcller, and Gary L Rosner. Bayesian nonparametric\n\nnonproportional hazards survival modeling. Biometrics, 65(3):762\u2013771, 2009.\n\n[5] Kjell Doksum. Tailfree and neutral random probabilities and their posterior distributions. The\n\nAnnals of Probability, pages 183\u2013201, 1974.\n\n[6] David K Duvenaud, Hannes Nickisch, and Carl E Rasmussen. Additive gaussian processes. In\n\nAdvances in neural information processing systems, pages 226\u2013234, 2011.\n\n[7] RL Dykstra and Purushottam Laud. A bayesian nonparametric approach to reliability. The\n\nAnnals of Statistics, pages 356\u2013367, 1981.\n\n[8] Thomas S Ferguson. A bayesian analysis of some nonparametric problems. The annals of\n\nstatistics, pages 209\u2013230, 1973.\n\n[9] Nils Lid Hjort, Chris Holmes, Peter M\u00fcller, and Stephen G Walker. Bayesian nonparametrics,\n\nvolume 28. Cambridge University Press, 2010.\n\n[10] Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, and Michael S Lauer. Random\n\nsurvival forests. The annals of applied statistics, pages 841\u2013860, 2008.\n\n[11] Heikki Joensuu, Peter Reichardt, Mikael Eriksson, Kirsten Sundby Hall, and Aki Vehtari.\nGastrointestinal stromal tumor: a method for optimizing the timing of ct scans in the follow-up\nof cancer patients. Radiology, 271(1):96\u2013106, 2013.\n\n[12] Heikki Joensuu, Aki Vehtari, Jaakko Riihim\u00e4ki, Toshirou Nishida, Sonja E Steigen, Peter\nBrabec, Lukas Plank, Bengt Nilsson, Claudia Cirilli, Chiara Braconi, et al. Risk of recurrence\nof gastrointestinal stromal tumour after surgery: an analysis of pooled population-based cohorts.\nThe lancet oncology, 13(3):265\u2013274, 2012.\n\n[13] Edward L Kaplan and Paul Meier. Nonparametric estimation from incomplete observations.\n\nJournal of the American statistical association, 53(282):457\u2013481, 1958.\n\n[14] Sara Martino, Rupali Akerkar, and H\u00e5vard Rue. Approximate bayesian inference for survival\n\nmodels. Scandinavian Journal of Statistics, 38(3):514\u2013528, 2011.\n\n[15] Iain Murray and Ryan P Adams. Slice sampling covariance hyperparameters of latent gaussian\n\nmodels. In Advances in Neural Information Processing Systems, pages 1732\u20131740, 2010.\n\n[16] Iain Murray, Ryan Prescott Adams, and David JC MacKay. Elliptical slice sampling.\n\nAISTATS, volume 13, pages 541\u2013548, 2010.\n\nIn\n\n[17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2007.\n\n[18] Vinayak Rao and Yee W. Teh. Gaussian process modulated renewal processes. In Advances in\n\nNeural Information Processing Systems, pages 2474\u20132482, 2011.\n\n[19] Terry M Therneau and Thomas Lumley. Package \u2018survival\u2019, 2015.\n\n[20] Stephen Walker and Pietro Muliere. Beta-stacy processes and a generalization of the p\u00f3lya-urn\n\nscheme. The Annals of Statistics, pages 1762\u20131780, 1997.\n\n9\n\n\f", "award": [], "sourceid": 2576, "authors": [{"given_name": "Tamara", "family_name": "Fernandez", "institution": "Oxford"}, {"given_name": "Nicolas", "family_name": "Rivera", "institution": "King's College London"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford"}]}