{"title": "Deep Multi-task Gaussian Processes for Survival Analysis with Competing Risks", "book": "Advances in Neural Information Processing Systems", "page_first": 2329, "page_last": 2337, "abstract": "Designing optimal treatment plans for patients with comorbidities requires accurate cause-specific mortality prognosis. Motivated by the recent availability of linked electronic health records, we develop a nonparametric Bayesian model for survival analysis with competing risks, which can be used for jointly assessing a patient's risk of multiple (competing) adverse outcomes. The model views a patient's survival times with respect to the competing risks as the outputs of a deep multi-task Gaussian process (DMGP), the inputs to which are the patients' covariates. Unlike parametric survival analysis methods based on Cox and Weibull models, our model uses DMGPs to capture complex non-linear interactions between the patients' covariates and cause-specific survival times, thereby learning flexible patient-specific and cause-specific survival curves, all in a data-driven fashion without explicit parametric assumptions on the hazard rates. We propose a variational inference algorithm that is capable of learning the model parameters from time-to-event data while handling right censoring. Experiments on synthetic and real data show that our model outperforms the state-of-the-art survival models.", "full_text": "Deep Multi-task Gaussian Processes for\nSurvival Analysis with Competing Risks\n\nAhmed M. Alaa\n\nElectrical Engineering Department\nUniversity of California, Los Angeles\n\nMihaela van der Schaar\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\n=D\u0006A@\u0006=\u0006==(K?\u0006=\u0002A@K\n\n\u0006ED=A\u0006=\u0002L=\u0006@AHI?D==H(A\u0006C\u0002\u0006N\u0002=?\u0002K\u0006\n\nAbstract\n\nDesigning optimal treatment plans for patients with comorbidities requires accu-\nrate cause-speci\ufb01c mortality prognosis. Motivated by the recent availability of\nlinked electronic health records, we develop a nonparametric Bayesian model for\nsurvival analysis with competing risks, which can be used for jointly assessing a\npatient\u2019s risk of multiple (competing) adverse outcomes. The model views a pa-\ntient\u2019s survival times with respect to the competing risks as the outputs of a deep\nmulti-task Gaussian process (DMGP), the inputs to which are the patients\u2019 covari-\nates. Unlike parametric survival analysis methods based on Cox and Weibull mod-\nels, our model uses DMGPs to capture complex non-linear interactions between\nthe patients\u2019 covariates and cause-speci\ufb01c survival times, thereby learning \ufb02exi-\nble patient-speci\ufb01c and cause-speci\ufb01c survival curves, all in a data-driven fashion\nwithout explicit parametric assumptions on the hazard rates. We propose a varia-\ntional inference algorithm that is capable of learning the model parameters from\ntime-to-event data while handling right censoring. Experiments on synthetic and\nreal data show that our model outperforms the state-of-the-art survival models.\n\n1 Introduction\n\nDesigning optimal treatment plans for elderly patients or patients with comorbidities is a challenging\nproblem: the nature (and the appropriate level of invasiveness) of the best therapeutic intervention\nfor a patient with a speci\ufb01c clinical risk depends on whether this patient suffers from, or is suscep-\ntible to other \"competing risks\" [1-3]. For instance, the decision on whether a diabetic patient who\nalso has a renal disease should receive dialysis or a renal transplant must be based on a joint prog-\nnosis of diabetes-related complications and end-stage renal failure; overlooking the diabetes-related\nrisks may lead to misguided therapeutic decisions [1]. The same problem arises in nephrology,\nwhere a typical patient\u2019s competing risks are peritonitis, death, kidney transplantation and transfer\nto haemodialysis [2]. An even more common encounter with competing risks realizes in oncology\nand cardiovascular medicine, where the risk of a cardiac disease may alter the decision on whether a\ncancer patient should undergo chemotherapy or a particular type of surgery [3]. Since conventional\nmethods for survival analysis, such as the Kaplan-Meier method and standard Cox proportional haz-\nards regression, are not equipped to handle competing risks, alternate variants of those methods that\nrely on cumulative incidence estimators have been proposed and used in clinical research [1-7].\nAccording to the most recent data brief by the Of\ufb01ce of National Coordinator (ONC)1, electronic\nhealth records (EHRs) are currently deployed in more than 75% of hospitals in the United States\n[8]. The increasing availability of data in EHRs has stimulated a great deal of research efforts\nthat used machine learning to conduct clinical risk prognosis and survival analysis. In particular,\n\n1DJJFI\u0003\u0002\u0002MMM\u0002DA=\u0006JDEJ\u0002C\u0006L\u0002IEJAI\u0002@AB=K\u0006J\u0002BE\u0006AI\u0002>HEABI\u0002\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fvarious recent works have proposed novel methods for survival analysis based on Gaussian\nprocesses [9], \"temporal\" logistic regression [10], ranking [11], and deep neural networks [12].\nAll these works have were restricted to the conventional survival analysis problem in which\nthere is only one event of interest rather than a set of competing risks. (A detailed overview of\nprevious works is provided in Section 3.) The usage of machine learning to construct data-driven\nsurvival models for patients with comorbidities is an important step towards precision medicine [13].\n\nContribution In the light of the discussion above, we develop a nonparametric Bayesian model for\nsurvival analysis with competing risks using deep (multi-task) Gaussian processes (DMGPs) [15].\nOur model relies on a novel conception of the competing risks problem as a multi-task learning prob-\nlem; that is, we model the cause-speci\ufb01c survival times as the outputs of a random vector-valued\nfunction [14], the inputs to which are the patients\u2019 covariates. This allows us to learn a \"shared\nrepresentation\" of the patients\u2019 survival times with respect to multiple related comorbidities. The\nproposed model is Bayesian: we assign a prior distribution over a space of vector-valued functions of\nthe patients\u2019 covariates [16], and update the posterior distribution given a (potentially right-censored)\ntime-to-event dataset. This process gives rise to patient-speci\ufb01c multivariate survival distributions,\nfrom which a patient-speci\ufb01c, cause-speci\ufb01c cumulative incidence function can be easily derived.\nSuch a patient-speci\ufb01c cumulative incidence function serves as actionable information, based upon\nwhich clinicians can design personalized treatment plans. Unlike many existing parametric survival\nmodels, our model neither assumes a parametric form for the interactions between the covariates and\nthe survival times, nor does it restrict the distribution of the survival times to a parametric model.\nThus, it can \ufb02exibly describe non-proportional hazard rates with complex interactions between co-\nvariates and survival times, which are common in many diseases with heterogeneous phenotypes\n(such as cardiovascular diseases [2]). Inference of patient-speci\ufb01c posterior survival distribution is\nconducted via a variational Bayes algorithm; we use inducing variables to derive a variational lower\nbound on the marginal likelihood of the observed time-to-event data [17], which we maximize using\nthe adaptive moment estimation algorithm [18]. We conduct a set of experiments on synthetic and\nreal data showing that our model outperforms state-of-the-art survival models.\n\n2 Preliminaries\nWe consider a dataset D comprising survival (time-to-event) data for n subjects who have been\nfollowed up for a \ufb01nite amount of time. Let D = fXi; Ti; kign\ni=1, where Xi 2 X is a d-dimensional\nvector of covariates associated with subject i, Ti 2 R+ is the time until an event occurred, and\nki 2 K is the type of event that occurred. The set K = f\u2205; 1; : : :; Kg is a \ufb01nite set of K mutually\nexclusive, competing events that could occur to subject i, where \u2205 corresponds to right-censoring.\nFor simplicity of exposition, we assume\nthat only one event occurs for every pa-\ntient; this corresponds, for instance, to\nthe case when the events in K corre-\nspond to deaths due to different causes.\nThis assumption does not simplify the\nproblem, in fact it implies the noniden-\nti\ufb01ability of the event times\u2019 distribu-\ntion parameters [6, 7], which makes the\nproblem more challenging. Figure 1 de-\npicts a time-to-event dataset D with pa-\ntients dying due to either cancer or car-\ndiovascular diseases, or have their end-\npoints censored. Throughout this paper,\nwe assume independent censoring [1-7],\ni.e. censoring times are independent of\nclinical outcomes.\nDe\ufb01ne a multivariate random variable T = (T 1; : : :; T K); where T k; k 2 K, denotes the net sur-\nvival time with respect to event k, i.e. the survival time of the subject given that only event k can\noccur. We assume that T is drawn from a conditional density function that depends on the sub-\n\nFigure 1: Depiction for the time-to-event data.\n\n2\n\nPatient9Patient8Patient7Patient6Patient5Patient4Patient3Patient2Patient1Cancer(k=2)TimesinceDiagnosis(k=\u2205)Censored(k=1)CardiovascularT7k7=1\fi ; : : :; T K\n\nject\u2019s covariates. For every subject i, we only observe the occurrence time for the earliest event, i.e.\ni ) and ki = arg minj T j\ni .\nTi = min(T 1\nThe cause-speci\ufb01c hazard function (cid:21)k(t; X) represents the instantaneous risk of event k, and is\nP(t (cid:20) T k < t + dt; k j T k (cid:21) t; X) [6]. By the law of\nformally de\ufb01ned as (cid:21)k(t; X) = limdt!0\ntotal probability, the overall hazard function is given by (cid:21)(t; X) =\nk2K (cid:21)k(t; X). This leads to\nthe notion of a survival function S(t; X) = exp(\n0 (cid:21)(u; X)du); which captures the probability of\na subject surviving all types of risk events up to time t. The Cumulative Incidence Function (CIF),\nalso known as the subdistribution function [2-7], is the probability of occurrence of a particular\nevent in k 2 K by time t, and is given by Fk(t; X) =\n0 (cid:21)k(u; X) S(u; X)du. Our main goal is to\nestimate the CIF function using the dataset D; through these estimates, treatment plans can be set\nup for patients who suffer from comorbidities or are at risk of different types of diseases.\n\n\u2211\n\n\u222b\n\nt\n\n1\ndt\n\n\u222b\n\nt\n\n3 Survival Analysis using Deep Multi-task Gaussian Processes\n\nWe conduct patient-speci\ufb01c survival analysis by directly modeling the event times T as a function\nof the patients\u2019 covariates through the generative probabilistic model described hereunder.\n\nDeep Multi-task Gaussian Processes (DMGPs) We assume that the net survival times for a pa-\ntient with covariates X are generated via a (nonparametric) multi-output random function g(:),\ni.e. T = g(X), and we use Gaussian processes to model g(:). A simple model of the form\ng(X) = f (X) + \u03f5, with f (:) being a Gaussian process and \u03f5 a Gaussian noise, would constrain\nT to have a symmetric Gaussian distribution with a restricted parametric form conditional on X\n[Sec. 2, 19]. This may not be a realistic construct for many settings in which the survival times\ndisplay an asymmetric distribution (e.g. cancer survival times [2]). To that end, we model g(:) as\na Deep multi-task Gaussian Process (DMGP) [15]; a multi-layer cascade of vector-valued Gaussian\nprocesses that confer a greater representational power and produce outputs that are generally non-\nGaussian. In particular, we assume that the net survival times T are generated via a DMGP with\ntwo layers as follows\n\nT = fT (Z) + \u03f5T ; \u03f5T (cid:24) N (0; (cid:27)2\nZ = fZ(X) + \u03f5Z; \u03f5Z (cid:24) N (0; (cid:27)2\n\nT I);\nZ I);\n\n(1)\nwhere (cid:27)T and (cid:27)Z are the noise variances at the two layers, fT (:) and fZ(:) are two Gaussian pro-\ncesses with hyperparameters (cid:2)T and (cid:2)Z respectively, and Z is a hidden variable that the \ufb01rst layer\npasses to the second. Based on (1), we have that g(X) = fT (fZ(X) + \u03f5Z) + \u03f5T . The model in (1)\nresembles a neural network with two layers and an in\ufb01nite number of hidden nodes in each layer,\nbut with an output that can be described probabilistically in terms of a distribution. We assume that\nfT (:) has K outputs, whereas fZ(:) has Q outputs. The use of a Gaussian processes with two layers\nallows us to jointly represent complex survival distributions and complex interactions with the co-\nvariates in a data-driven fashion, without the need to assume a prede\ufb01ned non-linear transformation\non the output space as it is the case in warped Gaussian processes [19-20].\nA dataset D comprising n\ni.i.d instances can be sampled\nfrom our model as follows:\nfZ (cid:24) GP(0; K(cid:2)Z );\nfT (cid:24) GP(0; K(cid:2)T );\nZi (cid:24) N (fZ(Xi); (cid:27)2\nZ I);\nTi (cid:24) N (fT (Zi); (cid:27)2\nT I);\ni ; : : :; T K\nTi = min(T 1\ni );\ni 2 f1; : : :; ng, where K(cid:2)\nis the Gaussian process kernel\nwith hyperparameters (cid:2).\nFigure 2 provides a graphical depiction for our model (observable variables are in double-circled\nnodes); patient\u2019s covariates are the parent node; the survival time is the leaf node.\n\nFigure 2: Graphical depiction for the probabilistic model.\n\n3\n\n...TfZfTCovariatesSurvivaltime(Leafnode)(Parentnode)FirstLayerSecondLayerXZ\u0398T\u0398ZCompetingeventstimesTT1TK\fSurvival Analysis as a Multi-task Learning Problem As can be seen in (1), the cause-speci\ufb01c net\nsurvival times are viewed as the outputs of a vector-valued function g(:). This casts the competing\nrisks problem in a multi-task learning framework that allows \ufb01nding a shared representation for the\nsubjects\u2019 survival behavior with respect to multiple correlated comorbidities, such as renal failure,\ndiabetes and cardiac diseases [1-3]. Such a shared representation is captured via the kernel functions\nfor the two DMGP layers (i.e. K(cid:2)Z and K(cid:2)T ). For both layers, we assume that the kernels follow\nan intrinsic coregionalization model [14, 16], i.e.\n\n\u2032\n\nK(cid:2)Z (x; x\n\n) = AZ kZ(x; x\n\n\u2032\n\n); K(cid:2)T (x; x\n\n\u2032\n\n) = AT kT (x; x\n\n\u2032\n\n);\n\n(2)\n\n)\n\n\u2032\n\n)\n\n((cid:0) 1\n\n\u2032\n\n; RZ = diag(\u21132\n\n1;Z; \u21132\n\n2;Z; : : : ; \u21132\n\n)T R\n\n; AT 2 RK(cid:2)K\n\n+\n\nare positive semi-de\ufb01nite matrices, kZ(x; x\n\n\u2032\n\u2032\n) are radial basis functions with automatic relevance determination, i.e. kZ(x; x\n2 (x (cid:0) x\n\u2032\n\nwhere AZ 2 RQ(cid:2)Q\n) and\n\u2032\n+\n) =\nkT (x; x\nZ (x (cid:0) x\n(cid:0)1\nd;Z); with \u2113j;Z being the length\nexp\nscale parameter of the jth feature (kT (x; x\n) can be de\ufb01ned similarly). Note that unlike regular\nGaussian processes, DMGPs are less sensitive to the selection of the parametric form of the ker-\nnel functions [15]. This because the output of the \ufb01rst layer undergoes a transformation through a\nlearned nonparametric function fZ(:), and hence the \"overall smoothness\" of the function g(X) is\ngoverned by an \"equivalent data-driven kernel\" function describing the transformation fT (fZ(:)).\nOur model adopts a Bayesian approach to multi-task learning: it posits a prior distribution on the\nmulti-output function g(X), and then conducts the survival analysis by updating the posterior distri-\nbution of the event times P(g(X)jD; (cid:2)Z; (cid:2)T ) given the evidential data in the time-to-event dataset\nD. The distribution P(g(X)jD; (cid:2)Z; (cid:2)T ) does not commit to any prede\ufb01ned parametric form since\nit is depends on a random variable transformation through a nonparametric function g(:). In Section\n4, we propose an inference algorithm for computing the posterior distribution P(TjD; X\n(cid:3)\n; (cid:2)Z; (cid:2)T )\n;D) is computed, we can di-\nfor a given out-of-sample subject with covariates X\n) for all events k 2 K as explained in Section 2. A pictorial\n(cid:3)\nrectly derive the CIF function Fk(t; X\nvisualization of the survival analysis procedure assuming 2 competing risks is provided in Fig. 3.\n\n(cid:3). Once P(Tj X\n(cid:3)\n\nFigure 3: Pictorial depiction for survival analysis with 2 competing risks using deep multi-task Gaussian\nprocesses. The posterior distribution of T given D is displayed in the top left panel, and the corresponding\ncumulative incidence functions for a particular patient with covariates X\nis displayed in the bottom left panel.\nThe posterior distributions on the two DMGP layers conditional on their inputs are depicted on the right panels.\n\n(cid:3)\n\nRelated Works Standard survival modeling in the statistical and medical research literature is\nlargely based on either the nonparametric Kaplan-Meier estimator [21], or the (parametric) Cox\nproportional hazard model [22]. The former is capable of learning \ufb02exible \u2013and potentially non-\nproportional\u2013 survival curves but fails to incorporate patients\u2019 covariates, whereas the latter is capa-\nble of incorporating covariates, but is restricted to rigid parametric assumptions that impose propor-\ntional hazard curves. These limitations seems to have been inherited by various recently developed\nBayesian nonparametric survival models. For instance, [24] develops a Bayesian survival model\nbased on a Dirichlet prior, and [23] develops a model based on Gaussian latent \ufb01elds, and proposes\nan inference algorithm that utilizes nested Laplace approximations; however, neither model incorpo-\nrates the individual patient\u2019s covariates, and hence both are restricted to estimating a population-level\nsurvival curves which cannot inform personalized treatment plans. Contrarily, our model does not\nsuffer from any such limitations since it learns patient-speci\ufb01c, nonparametric survival curves by\nadopting a Bayesian prior over a function space that takes the patients\u2019 covariates as an input.\n\n4\n\n\fA lot of interest has been recently devoted to the problem of survival analysis by the machine learn-\ning community. Recently developed survival models include random survival forests [26], deep\nexponential families [12], dependent logistic regressors [10], ranking algorithms [11], and semi-\nparametric Bayesian models based on Gaussian processes [9]. All of these methods are capable of\nincorporating the individual patient\u2019s covariates, but none of them has considered the problem of\ncompeting risks. The problem of survival analysis with competing risks has been only addressed\nthrough two classical parametric models: (1) the Fine-Gray model, which modi\ufb01es the traditional\nproportional hazard model by direct transformation of the CIF [4], and (2) the threshold regression\n(multi-state) models, which directly model net survival times as the \ufb01rst hitting times of a stochastic\nprocess (e.g. Weiner process) [25]. Unlike our model, both models are limited by strong parametric\nassumptions on both the hazard rates, and the nature of the interactions between the patient covari-\nates and the survival curves. These limitations have been slightly alleviated in [19], which uses a\nGaussian process to model the interactions between survival times and covariates. However, this\nmodel assumes a Gaussian distribution as a basis for an accelerated failure time model, which is\nboth unrealistic (since the distribution of survival times is often asymmetric), and also hinders the\nnonparametric modeling of survival curves. The model in [19] can be ameliorated via a warped\nGaussian process that \ufb01rst transforms the survival times through a deterministic, monotonic non-\nlinear function, and then applies Gaussian process regression on the transformed survival times [20],\nwhich would lead to more degrees of freedom in modeling the survival curves. Our model can be\nthought of as a generalization of a warped Gaussian process in which the deterministic non-linear\ntransformation is replaced with another data-driven Gaussian process, which enables \ufb02exible non-\nparametric modeling of the survival curves. In Section 5, we demonstrate the superiority of our\nmodel via experiments on synthetic and real datasets.\n\n4 Inference\n\n(cid:3)\n= g(X\n\n(cid:3) jD; X\n(cid:3)\n\n(cid:3) jD; X\n(cid:3)\n\n(cid:3) with T\n(cid:3)\n\n(cid:3), we evaluate dP(T\n\n; (cid:2)Z; (cid:2)T ) by direct Monte Carlo sampling.\n\n; (cid:2)Z; (cid:2)T ) for a given out-of-sample point X\n\nAs discussed in Section 3, conducting survival analysis requires computing the posterior probabil-\nity density dP(T\n). We\nfollow an empirical Bayes approach for updating the posterior on g(:). That is, we \ufb01rst tune the\nhyperparameters (cid:2)Z and (cid:2)T using the of\ufb02ine dataset D; and then for any out-of-sample patient\nwith covariates X\nWe calibrate the hyperparameters by maximizing the marginal likelihood dP(D j (cid:2)Z; (cid:2)T ). Note\nthat for every subject i in D, we observe a \"label\" of the form (Ti; ki), indicating the type of event\nthat occurred to the subject along with the time of its occurrence. Since Ti is the smallest element in\nT, then the label (Ti; ki) is informative of all the events (i.e. all the learning tasks) in K=fkig; we\n(cid:21) Ti;8j 2 K=fkig. We also note that the subject\u2019s data may be right-censored, i.e.\nknow that T j\n(cid:21) Ti;8j 2 K: Hence, the likelihood of the survival information in D\nki = \u2205, which implies that T j\ni\nis\ndP(fXi; Ti; kign\nwhere Ti is a set of events given by\nfT j\n\n(cid:21) Tigj2K=fkigg; ki \u0338= \u2205;\nki = \u2205:\n\n{ fT ki\n\ni=1; (cid:2)Z; (cid:2)T );\n\njfXign\n\nTi =\n\n(3)\n\ni=1\n\ni=1\n\ni\n\nj (cid:2)Z; (cid:2)T ) / dP(fTign\ni = Ti;fT j\n(cid:21) Tigj2K;\n\u222b\n\ni\n\ni\n\ni=1\n\ni=1\n\n(\n\njfXign\n\ni=1; (cid:2)Z; (cid:2)T ) =\n\ndP(fTign\n\ni=1; (cid:2)T ) dP(fZign\n\nWe can write the marginal likelihood in (3) as the conditional density by marginalizing over the\nconditional distribution of the hidden variable Zi as follows\ndP(fTign\njfZign\n\u222b\n\n(4)\nSince the integral in (4) is intractable, we follow the variational inference scheme proposed in [15],\nwhere we tune the hyperparameters by maximizing the following variational bound on (4):\nF =\nwhere Q is a variational distribution, and F (cid:20) log (dP(fTign\ni=1; (cid:2)Z; (cid:2)T )). Since the\nevent Ti happens with a probability that can be written in terms of a Gaussian density condi-\ntional on fZ and fT , we can obtain a tractable version of the variational bound F by introduc-\ning a set of M pseudo-inputs to the two layers of the DMGP, with corresponding function val-\nues U Z and U T at the \ufb01rst and second layers [15, 17], and setting the variational distribution to\n\ni=1;ffT (Zi)gn\nQ\njfXign\n\ni=1;ffz(Xi)gn\n\ni=1;fZign\n\ndP(fTign\n\ni=1; (cid:2)Z; (cid:2)T )\n\njfXign\n\njfXign\n\nQ (cid:1) log\n\ni=1; (cid:2)Z):\n\n)\n\nZ;fz;fT\n\ni=1\n\ni=1\n\n;\n\ni=1\n\n5\n\n\f[\n[\n\n+ E\n\ni=1\n\ni=1\n\nF = E\n\nQ = P(f T (Zi)j U T ; Zi) q(U T ) q(Zi) P(f Z(Xi)j U Z; Xi) q(U Z); where q(Zi) is a Gaussian dis-\ntribution, whereas q(U T ) and q(U Z) are free-form variational distributions. Given these settings,\nthe variational lower bound can be written as [Eq. 13, 15]\njff T (Zi)gn\njff Z(Xi)gn\n\n(5)\nwhere the \ufb01rst expectation is taken with respect to P(f T (Zi)j U T ; Zi) q(U T ) q(Zi) whereas the\nsecond is taken with respect to P(f Z(Xi)j U Z; Xi) q(U Z). Since all the densities involved in (5) are\nGaussian, F is tractable and can be written in closed-form. We use the adaptive moment estimation\n(ADAM) algorithm to optimize F with respect to (cid:2)T and (cid:2)Z [18].\n\nlog(dP(fTign\nlog(dP(fZign\n\nlog(dP(U Z))\n\nlog(dP(U T ))\n\n]\n]\n\n;\n\nq(U T )\n\nq(U Z)\n\ni=1)) +\n\ni=1)) +\n\n5 Experiments\n\nIn this Section, we validate our model by conducting a set of experiments on both a synthetic survival\nmodel, and a real-world time-to-event dataset. In all experiments, we use the cause-speci\ufb01c concor-\ndance index (C-index), recently proposed in [27], as a performance metric. The cause-speci\ufb01c\nC-index quanti\ufb01es the goodness of a model in ranking the subjects\u2019 survival times with respect to\na particular cause/event based on their covariates: a higher C-index indicates a better performance.\nFormally, we de\ufb01ne the (time-dependent) C-index for a cause k 2 K as follows [Sec. 2.3, 27]\nCk(t) := P(Fk(t; Xi) > Fk(t; Xj)jfki = kg ^ fTi (cid:20) tg ^ fTi < Tj _ kj \u0338= kg);\n\n(6)\nwhere we have used the CIF Fk(t; X) as a natural choice for the prognostic score in [Eq. (2.3),\n27]. The C-index de\ufb01ned in (6) corresponds to the probability that, for a time horizon t, a particular\nsurvival analysis method prompts an assignment of CIF functions for subjects i and j that satisfy\nFk(t; Xi) > Fk(t; Xj), given that ki = k, Ti < Tj, and that subject i was not right-censored\nby time t. A high C-index for cause k is achieved if the cause-speci\ufb01c CIF functions for a group\nof subjects who encounter event k are likely to be \"ordered\" in accordance with the ordering of\ntheir realized survival times. In all experiments, we estimate the C-index for the survival analysis\nmethods under consideration using the function ?E\u0006@AN of the 4-package FA?\nWe run the algorithm in Section 4 with Q = 3 outputs for the \ufb01rst layer of the DMGP, and we use\nthe default settings prescribed in [18] for the ADAM algorithm. We compare our model with four\nbenchmarks: the Fine-Gray proportional subdistribution hazards model (FG) [4, 28], the acceler-\nated failure time model using multi-task Gaussian processes (MGP) [19], the cause-speci\ufb01c Cox\nproportional hazards model (Cox) [27, 28], and the threshold-regression (multi-state) \ufb01rst-time hit-\nting model with a multidimensional Wiener process (THR) [25]. The MGP benchmark is a special\ncase of our model with 1 layer and a deterministic linear transformation of the survival times to\nGaussian process outputs [Sec. 3, 19]. We run the FG and Cox benchmarks using the 4 libraries\n?\u0006FHI\u0006 and IKHLEL=\u0006, whereas for the THR benchmark, we use the 4-package JDHAC\n3.\n\n2 [Sec. 3, 27].\n\n5.1 Synthetic Data\n\nThe goal of this Section is to\ndemonstrate the ability of our\nmodel to cope with highly het-\nerogeneous patient cohorts;\nwe demonstrate this by run-\nning experiments on two syn-\nthetic models with different\ntypes of interactions between\nsurvival times and covariates.\nIn particular, we run experiments using the synthetic survival models A and B described above;\nthe two models correspond to two patient cohorts that differ in terms of patients\u2019 heterogeneity. In\n\nModel A\nXi (cid:24) N (0; I),\ni (cid:24) exp((cid:13)T\n1 Xi),\nT 1\ni (cid:24) exp((cid:13)T\n2 Xi),\nT 2\nTi = minfT 1\ni g,\ni ; T 2\nki = arg mink2f1;2g T k\ni ,\ni 2 f1; : : :; ng.\n\n1 Xi)),\ni (cid:24) exp(jN (0; 1) + sinh((cid:13)T\nT 2\ni g,\ni ; T 2\nki = arg mink2f1;2g T k\ni ,\ni 2 f1; : : :; ng.\n\nXi (cid:24) N (0; I),\ni (cid:24) exp(cosh((cid:13)T\nT 1\nTi = minfT 1\n\n2 Xi)j),\n\nModel B\n\n2DJJFI\u0003\u0002\u0002?H=\u0006\u0002H\u0002FH\u0006\u0006A?J\u0002\u0006HC\u0002MA>\u0002F=?\u0006=CAI\u0002FA?\u0002E\u0006@AN\u0002DJ\u0006\u0006\n3DJJFI\u0003\u0002\u0002?H=\u0006\u0002H\u0002FH\u0006\u0006A?J\u0002\u0006HC\u0002MA>\u0002F=?\u0006=CAI\u0002JDHAC\u0002E\u0006@AN\u0002DJ\u0006\u0006\n\n6\n\n\fmodel A, we assume that survival times are exponentially distributed with a mean parameter that\ncomprises a simple linear function of the covariates, whereas in model B, we assume that the survival\ndistributions are not necessarily exponential, and that their parameters depend on the covariates in\na nonlinear fashion through the sinh and cosh functions. Both models have two competing risks,\ni.e. K = f\u2205; 1; 2g, and for both models we assume that each patient has d = 10 covariates that are\ndrawn from a standard normal distribution. The parameters (cid:13)1 and (cid:13)2 are 10-dimensional vectors,\nthe elements of which are drawn independently from a uniform distribution. Given a draw of (cid:13)1\nand (cid:13)2, a dataset D with n subjects can be sampled using the models described above. We run\n10,000 repeated experiments using each model, where in each experiment we draw a new (cid:13)1, (cid:13)2,\nand a dataset D with 1000 subjects; we divide D into 500 subjects for training and 500 subjects\nfor out-of-sample testing. We compute the CIF function for the testing subjects via the different\nbenchmarks, and based on those functions we evaluate the cause-speci\ufb01c C-index for time horizons\n[1; 2:5; 7:5; 10]. We average the C-indexes achieved by each benchmark over the 1000 experiments\nand report the mean value and the 95% con\ufb01dence interval at each time horizon. In all experiments,\nwe induce right-censoring on 100 subjects which we randomly pick from D; for a subject i, right-\ncensoring is induced by altering her survival time as follows: Ti uniform(0; Ti).\n\nFigure 5: Results for model B.\n\nFigure 4: Results for model A.\nFigure 6: Results for model B.\nFig. 4, 5, and 6 depict the cause-speci\ufb01c C-indexes for all the survival methods under consideration\nwhen applied to the data generated by models A and B (error bars correspond to the 95% con\ufb01dence\nintervals). As we can see, the DMGP model outperforms all other benchmarks for survival data\ngenerated by both models. For model A, we only depict C1(t) in Fig. 4 since the results on C2(t)\nare almost identical due to the symmetry of model A with respect to the two competing risks. Fig. 4\nshows that, for all time horizons, the DMGP model already confers a gain in the C-index even when\nthe data is generated by model A, which displays simple linear interactions between the covariates\nand the parameters of the survival time distribution. Fig. 5 and 6 show that the performance gains\nachieved by the DMGP are even larger under model B (for both C1(t) and C2(t)). This is because\nmodel B displays a highly nonlinear relationship between covariates and survival times, and in\naddition, it assumes a complicated form for the distributions of the survival times, all of which\nare features that can be captured well by a DMGP but not by the other benchmarks which posit\nstrict parametric assumptions. The superiority of DMGPs to MGPs shows the value of the extra\nrepresentational power attained by adding multiple layers to conventional MGPs.\n\n5.2 Real Data\n\nMore than 30 million patients in the U.S. are diagnosed with either cardiovascular disease (CVD) or\ncancer [1, 2, 29]. Mounting evidence suggests that CVD and cancer share a number of risk factors,\nand possess various biological similarities and (possible) interactions; in addition, many of the exist-\ning cancer therapies increase a patient\u2019s risk for CVD [2, 29]. Therefore, it is important that patients\nwho are at risk of both cancer and CVD be provided with a joint prognosis of mortality due to the\ntwo competing diseases in order to properly manage therapeutic interventions. This is a challenging\nproblem since CVD patient cohorts are very heterogeneous; CVD exhibits complex phenotypes for\nwhich mortality rates can vary as much as 10-fold among patients in the same phenotype [1, 2]. The\ngoal of this Section is to investigate the ability of our model to accurately model survival of patients\nin such a highly heterogeneous cohort, with CVD and cancer as competing risks.\nWe conducted experiments on a real-world patient cohort extracted from a publicly accessible dataset\nprovided by the Surveillance, Epidemiology, and End Results Program 4 (SEER). The extracted\ncohort contains data on survival of breast cancer patients over the years from 1992-2007. The\ntotal number of subjects in the cohort is 61,050, with a follow-up period restricted to 10 years.\n\n4DJJFI\u0003\u0002\u0002IAAH\u0002?=\u0006?AH\u0002C\u0006L\u0002?=KIAIFA?EBE?\u0002\n\n7\n\n2.557.5100.60.650.70.750.80.850.9TimeHorizontCause-speci\ufb01cC-indexC1(t) DMGPMGPTHRCoxFGModelA2.557.5100.60.650.70.750.80.850.9TimeHorizontCause-speci\ufb01cC-indexC1(t) DMGPMGPTHRCoxFGModelB2.557.5100.60.650.70.750.80.850.9TimeHorizontCause-speci\ufb01cC-indexC2(t) DMGPMGPTHRCoxFGModelB\fThe mortality rate of the subjects within the 10-year follow-up period is 25.56%. We divided the\nmortality causes into: (1) death due to breast cancer (13.64%), (2) death due to CVD (4.62%), and\n(3) death due to other causes (7.3%), i.e. K = f\u2205; 1; 2; 3g. Every subject is associated with 20\ncovariates including: age, race, gender, morphology information (Lymphoma subtype, histological\ntype, etc), diagnostic con\ufb01rmation, therapy information (surgery, type of surgery, etc), tumor size\nand type, etc. We divide the dataset into training and testing sets, and report the C-index results\nobtained for all benchmarks via 10-fold cross-validation.\n\nFigure 7: Boxplot for the cause-speci\ufb01c C-indexes of various methods. The x-axis contains the methods\u2019\nnames, and with each method, 3 boxplots corresponding to the C-indexes for the different causes are provided.\n\nFig. 7 depicts boxplots for the 10-year survival C-indexes (i.e. C1(10); C2(10) and C3(10)) of\nall benchmarks for the 3 competing risks. With respect to predicting survival times due to \"other\ncauses\", the gain provided by DMGPs is marginal. We believe that this due to the absence of\nthe covariates that are predictive of mortality due to causes other than breast cancer and CVD in\nthe SEER dataset. The median C-index of our model is larger than all other benchmarks for all\ncauses. In terms of the median C-index, our model provides a signi\ufb01cant improvement in predicting\nbreast cancer survival times while maintaining a decent gain in the accuracy of predicting survival\ntimes of CVD as well. This implies that DMGPs, by virtue of our nonparametric multi-task learning\nformulation, are capable of accurately (and \ufb02exibly) capturing the \"shared representation\" of the two\n\"correlated\" risks of breast cancer and CVD as a function of their shared risk factors (hypertension,\nobesity, diabetes mellitus, age, etc). As expected, since CVD is a phenotype-rich disease, predictions\nof breast cancer survival are more accurate than those for CVD for all benchmarks.\nThe competing multi-task modeling benchmark, MGP, is inferior to our model as it restricts the\nsurvival times to an exponential-like parametric distribution (See [Eq. 13, 19]). Contrarily, our\nmodel allows for a nonparametric model of the survival curves, which appears to be crucial for\nmodeling breast cancer survival. This is evident in the boxplots of the cause-speci\ufb01c Cox benchmark,\nwhich is the only benchmark that performs better on CVD than breast cancer. Since the Cox model\nis restricted to a proportional hazard model with parametric, non-crossing survival curves, its poor\nperformance on predicting breast cancer survival suggests that breast cancer patients have crossing\nsurvival curves, which signals the need for a nonparametric survival model [9]. This explains the\ngain achieved by DMGPs as compared to MGPs (and all other benchmarks), which posit strong\nparametric assumptions on the patients\u2019 survival curves.\n\n6 Discussion\n\nThe problem of survival analysis with competing risks has recently gained signi\ufb01cant attention in\nthe medical community due to the realization that many chronic diseases possess a shared biology.\nWe have proposed a survival model for competing risks that hinges on a novel multi-task learning\nconception of cause-speci\ufb01c survival analysis. Our model is liberated from the traditional parametric\nrestrictions imposed by previous models; it allows for nonparametric learning of patient-speci\ufb01c\nsurvival curves and their interactions with the patients\u2019 covariates. This is achieved by modeling the\npatients\u2019 cause-speci\ufb01c survival times as a function of the patients\u2019 covariates using deep multi-task\nGaussian processes. Through the personalized actionable prognoses offered by our model, clinicians\ncan design personalized treatment plans that (hopefully) save thousands of lives annually.\n\n8\n\n0.50.60.70.80.91C-indexCVDOthercausesDMGPMGPFGCoxTHRBreastcancer\fReferences\n\n[1] H. J. Lim, X. Zhang, R. Dyck, and N. Osgood. Methods of Competing Risks Analysis of End-stage Renal\nDisease and Mortality among People with Diabetes. BMC Medical Research Methodology, 10(1), 97, 2010.\n[2] P. C. Lambert, P. W. Dickman, C. P. Nelson, and P. Royston. Estimating the Crude Probability of Death due\nto Cancer and other Causes using Relative Survival Models. Statistics in Medicine, 29(7): 885-895, 2010.\n[3] J. M. Satagopan, L. Ben-Porat, M. Berwick, M. Robson, D. Kutler, and A. Auerbach. A Note on Competing\nRisks in Survival Data Analysis. British Journal of Cancer, 91(7): 1229-1235, 2004.\n[4] J. P. Fine and R. J. Gray. A Proportional Hazards Model for the Subdistribution of a Competing Risk.\nJournal of the American statistical association, 94(446): 496-509, 1999.\n[5] M. J. Crowder. Classical competing risks. CRC Press, 2001.\n[6] T. A. Gooley, W. Leisenring, J. Crowley, and B. E. Storer. Estimation of Failure Probabilities in the Presence\nof Competing Risks: New Representations of Old Estimators. Statistics in Medicine, 18(6): 695-706, 1999.\n[7] A. Tsiatis. A Non-identi\ufb01ability Aspect of the Problem of Competing Risks. PNAS, 72(1): 20-22, 1975.\n[8] J. Henry, Y. Pylypchuk, T. Searcy, and V. Patel. Adoption of Electronic Health Record Systems among US\nNon-federal Acute Care Hospitals: 2008-2015. The Of\ufb01ce of National Coordinator, 2016.\n[9] T. Fernndez, N. Rivera, and Y. W. Teh. Gaussian Processes for Survival Analysis. In NIPS, 2016.\n[10] C. N. Yu, R. Greiner, H. C. Lin, and V. Baracos. Learning Patient-speci\ufb01c Cancer Survival Distributions\nas a Sequence of Dependent Regressors. In NIPS, 1845-1853, 2011.\n[11] H. Steck, B. Krishnapuram, C. Dehing-oberije, P. Lambin and V. C. Raykar. On Ranking in Survival\nAnalysis: Bounds on the Concordance Index. In NIPS, 1209-1216, 2008.\n[12] R. Ranganath, A. Perotte, N. Elhadad, and D. Blei. Deep Survival Analysis. arXiv:1608.02158, 2016.\n[13] F. S. Collins and H. Varmus. A New Initiative on Precision Medicine. New England Journal of Medicine,\n372(9): 793-795, 2015.\n[14] M. A. Alvarez, L. Rosasco, N. D. Lawrence. Kernels for Vector-valued Functions: A Review. Foundations\nand Trends R\u20ddin Machine Learning, 4(3):195-266, 2012.\n[15] A. Damianou and N. Lawrence. Deep Gaussian Processes. In AISTATS, 2013.\n[16] E. V. Bonilla, K. M. Chai, and C. Williams. Multi-task Gaussian Process Prediction. In NIPS, 2007.\n[17] M. K. Titsias and N. D. Lawrence. Bayesian Gaussian Process Latent Variable Model. In AISTATS, 2010.\n[18] D. Kingma and J. Ba. ADAM: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.\n[19] J. E. Barrett and A. C. C. Coolen. Gaussian Process Regression for Survival Data with Competing Risks.\narXiv preprint arXiv:1312.1591, 2013.\n[20] E. Snelson, C. E. Rasmussen, and Z. Ghahramani. Warped Gaussian Processes. In NIPS, 2004.\n[21] E. L. Kaplan and P. Meier. Nonparametric Estimation from Incomplete Observations. Journal of the\nAmerican Statistical Association, 53(282):457-481, 1958.\n[22] D. Cox. Regression Models and Life-tables. Journal of Royal Statistical Society, 34(2):187-220, 1972.\n[23] M. De Iorio, W. O. Johnson, P. Mller, and G. L. Rosner. Bayesian Nonparametric Non-proportional\nHazards Survival Modeling. Biometrics, 65(3): 762-771, 2009.\n[24] S. Martino, R. Akerkar, and H. Rue. Approximate Bayesian Inference for Survival Models. Scandinavian\nJournal of Statistics, 38(3):514-528, 2011.\n[25] M. L. T. Lee and A. G. Whitmore. Threshold Regression for Survival Analysis: Modeling Event Times by\na Stochastic Process Reaching a Boundary. Statistical Science, 501-513, 2006.\n[26] H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S. Lauer. Random Survival Forests. The Annals of\nApplied Statistics, 841-860, 2008.\n[27] M. Wolbers, P. Blanche, M. T. Koller, J. C. Witteman and A. T. Gerds. Concordance for Prognostic Models\nwith Competing Risks. Biostatistics, 15(3): 526-539, 2014.\n[28] P. C. Austin, D. S. Lee, and J. P. Fine. Introduction to the Analysis of Survival Data in the Presence of\nCompeting Risks. Circulation, 133(6): 601-609, 2016.\n[29] R. Koene, et al. Shared Risk Factors in Cardiovascular Disease and Cancer. Circulation, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1350, "authors": []}