{"title": "Efficient Learning of Continuous-Time Hidden Markov Models for Disease Progression", "book": "Advances in Neural Information Processing Systems", "page_first": 3600, "page_last": 3608, "abstract": "The Continuous-Time Hidden Markov Model (CT-HMM) is an attractive approach to modeling disease progression due to its ability to describe noisy observations arriving irregularly in time. However, the lack of an efficient parameter learning algorithm for CT-HMM restricts its use to very small models or requires unrealistic constraints on the state transitions. In this paper, we present the first complete characterization of efficient EM-based learning methods for CT-HMM models. We demonstrate that the learning problem consists of two challenges: the estimation of posterior state probabilities and the computation of end-state conditioned statistics. We solve the first challenge by reformulating the estimation problem in terms of an equivalent discrete time-inhomogeneous hidden Markov model. The second challenge is addressed by adapting three approaches from the continuous time Markov chain literature to the CT-HMM domain. We demonstrate the use of CT-HMMs with more than 100 states to visualize and predict disease progression using a glaucoma dataset and an Alzheimer's disease dataset.", "full_text": "Ef\ufb01cient Learning of Continuous-Time Hidden\n\nMarkov Models for Disease Progression\n\nYu-Ying Liu, Shuang Li, Fuxin Li, Le Song, and James M. Rehg\n\nAtlanta, GA\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\nAbstract\n\nThe Continuous-Time Hidden Markov Model (CT-HMM) is an attractive ap-\nproach to modeling disease progression due to its ability to describe noisy ob-\nservations arriving irregularly in time. However, the lack of an ef\ufb01cient parameter\nlearning algorithm for CT-HMM restricts its use to very small models or requires\nunrealistic constraints on the state transitions. In this paper, we present the \ufb01rst\ncomplete characterization of ef\ufb01cient EM-based learning methods for CT-HMM\nmodels. We demonstrate that the learning problem consists of two challenges: the\nestimation of posterior state probabilities and the computation of end-state con-\nditioned statistics. We solve the \ufb01rst challenge by reformulating the estimation\nproblem in terms of an equivalent discrete time-inhomogeneous hidden Markov\nmodel. The second challenge is addressed by adapting three approaches from the\ncontinuous time Markov chain literature to the CT-HMM domain. We demon-\nstrate the use of CT-HMMs with more than 100 states to visualize and predict\ndisease progression using a glaucoma dataset and an Alzheimer\u2019s disease dataset.\n\nIntroduction\n\n1\nThe goal of disease progression modeling is to learn a model for the temporal evolution of a disease\nfrom sequences of clinical measurements obtained from a longitudinal sample of patients. By dis-\ntilling population data into a compact representation, disease progression models can yield insights\ninto the disease process through the visualization and analysis of disease trajectories. In addition,\nthe models can be used to predict the future course of disease in an individual, supporting the de-\nvelopment of individualized treatment schedules and improved treatment ef\ufb01ciencies. Furthermore,\nprogression models can support phenotyping by providing a natural similarity measure between\ntrajectories which can be used to group patients based on their progression.\nHidden variable models are particularly attractive for modeling disease progression for three rea-\nsons: 1) they support the abstraction of a disease state via the latent variables; 2) they can deal with\nnoisy measurements effectively; and 3) they can easily incorporate dynamical priors and constraints.\nWhile conventional hidden Markov models (HMMs) have been used to model disease progression,\nthey are not suitable in general because they assume that measurement data is sampled regularly\nat discrete intervals. However, in reality patient visits are irregular in time, as a consequence of\nscheduling issues, missed visits, and changes in symptomatology.\nA Continuous-Time HMM (CT-HMM) is an HMM in which both the transitions between hidden\nstates and the arrival of observations can occur at arbitrary (continuous) times [1, 2]. It is therefore\nsuitable for irregularly-sampled temporal data such as clinical measurements [3, 4, 5]. Unfortu-\nnately, the additional modeling \ufb02exibility provided by CT-HMM comes at the cost of a more com-\nplex inference procedure. In CT-HMM, not only are the hidden states unobserved, but the transition\ntimes at which the hidden states are changing are also unobserved. Moreover, multiple unobserved\nhidden state transitions can occur between two successive observations. A previous method ad-\ndressed these challenges by directly maximizing the data likelihood [2], but this approach is limited\n\n1\n\n\fto very small model sizes. A general EM framework for continuous-time dynamic Bayesian net-\nworks, of which CT-HMM is a special case, was introduced in [6], but that work did not address the\nquestion of ef\ufb01cient learning. Consequently, there is a need for ef\ufb01cient CT-HMM learning methods\nthat can scale to large state spaces (e.g. hundreds of states or more) [7].\nA key aspect of our approach is to leverage the existing literature for continuous time Markov chain\n(CTMC) models [8, 9, 10]. These models assume that states are directly observable, but retain\nthe irregular distribution of state transition times. EM approaches to CTMC learning compute the\nexpected state durations and transition counts conditioned on each pair of successive observations.\nThe key computation is the evaluation of integrals of the matrix exponential (Eqs. 12 and 13). Prior\nwork by Wang et. al. [5] used a closed form estimator due to [8] which assumes that the transition\nrate matrix can be diagonalized through an eigendecomposition. Unfortunately, this is frequently not\nachievable in practice, limiting the usefulness of the approach. We explore two additional CTMC ap-\nproaches [9] which use (1) an alternative matrix exponential on an auxillary matrix (Expm method);\nand (2) a direct truncation of the in\ufb01nite sum expansion of the exponential (Unif method). Neither\nof these approaches have been previously exploited for CT-HMM learning.\nWe present the \ufb01rst comprehensive framework for ef\ufb01cient EM-based parameter learning in CT-\nHMM, which both extends and uni\ufb01es prior work on CTMC models. We show that a CT-HMM can\nbe conceptualized as a time-inhomogenous HMM which yields posterior state distributions at the\nobservation times, coupled with CTMCs that govern the distribution of hidden state transitions be-\ntween observations (Eqs. 9 and 10). We explore both soft (forward-backward) and hard (Viterbi de-\ncoding) approaches to estimating the posterior state distributions, in combination with three methods\nfor calculating the conditional expectations. We validate these methods in simulation and evaluate\nour approach on two real-world datasets for glaucoma and Alzheimer\u2019s disease, including visual-\nizations of the progression model and predictions of future progression. Our approach outperforms\na state-of-the-art method [11] for glaucoma prediction, which demonstrates the practical utility of\nCT-HMM for clinical data modeling.\n2 Continuous-Time Markov Chain\nA continuous-time Markov chain (CTMC) is de\ufb01ned by a \ufb01nite and discrete state space S, a state\ntransition rate matrix Q, and an initial state probability distribution \u03c0. The elements qij in Q describe\nthe rate the process transitions from state i to j for i (cid:54)= j, and qii are speci\ufb01ed such that each row of\nj(cid:54)=i qij, qii = \u2212qi) [1]. In a time-homogeneous process, in which the qij\nare independent of t, the sojourn time in each state i is exponentially-distributed with parameter qi,\nwhich is f (t) = qie\u2212qit with mean 1/qi. The probability that the process\u2019s next move from state i is\nto state j is qij/qi. When a realization of the CTMC is fully observed, meaning that one can observe\nevery transition time (t(cid:48)\n0), ..., yV (cid:48) =\ns(t(cid:48)\n\nV (cid:48))}, where s(t) denotes the state at time t, the complete likelihood (CL) of the data is\n\nQ sums to zero (qi =(cid:80)\n\nV (cid:48)), and the corresponding state Y (cid:48) = {y0 = s(t(cid:48)\n\n(qyv(cid:48) ,yv(cid:48)+1 /qyv(cid:48) )(qyv(cid:48) e\nv(cid:48)+1 \u2212 t(cid:48)\n\nv(cid:48) is the time interval between two transitions, nij is the number of transitions\n\nV (cid:48)\u22121(cid:89)\nv(cid:48)=0\nwhere \u03c4v(cid:48) = t(cid:48)\nfrom state i to j, and \u03c4i is the total amount of time the chain remains in state i.\nIn general, a realization of the CTMC is observed only at discrete and irregular time points\n(t0, t1, ..., tV ), corresponding to a state sequence Y , which are distinct from the switching times.\nAs a result, the Markov process between two consecutive observations is hidden, with potentially\nmany unobserved state transitions. Thus both nij and \u03c4i are unobserved. In order to express the\nlikelihood of the incomplete observations, we can utilize a discrete time hidden Markov model by\nde\ufb01ning a state transition probability matrix for each distinct time interval t, P (t) = eQt, where\nPij(t), the entry (i, j) in P (t), is the probability that the process is in state j after time t given that\nit is in state i at time 0. This quantity takes into account all possible intermediate state transitions\nand timing between i and j which are not observed. Then the likelihood of the data is\n\nV (cid:48)\u22121(cid:89)\n\n1, . . . , t(cid:48)\n\nqyv(cid:48) ,yv(cid:48)+1 e\n\nv(cid:48) \u03c4v(cid:48) ) =\n\n|S|(cid:89)\n\n|S|(cid:89)\n\nv(cid:48) \u03c4v(cid:48) =\n\nqnij\nij e\n\n0, t(cid:48)\n\ni=1\n\nj=1,j(cid:54)=i\n\n\u2212qi\u03c4i\n\nCL =\n\nv(cid:48)=0\n\n\u2212qy\n\n\u2212qy\n\n(1)\n\nV \u22121(cid:89)\n\nV \u22121(cid:89)\n\n|S|(cid:89)\n\nr(cid:89)\n\n|S|(cid:89)\n\nv=0\n\nL =\n\nPyv ,yv+1 (\u03c4v) =\n\n(2)\nwhere \u03c4v = tv+1 \u2212 tv is the time interval between two observations, I(yv = i, yv+1 = j) is an\nindicator function that is 1 if the condition is true, otherwise it is 0, \u03c4\u2206, \u2206 = 1, ..., r, represents r\nunique values among all time intervals \u03c4v, and C(\u03c4 = \u03c4\u2206, yv = i, yv+1 = j) is the total counts\n\nPij(\u03c4\u2206)C(\u03c4 =\u03c4\u2206,yv =i,yv+1=j)\n\nPij(\u03c4v)\n\ni,j=1\n\ni,j=1\n\n\u2206=1\n\nv=0\n\nI(yv =i,yv+1=j) =\n\n2\n\n\fhood takes the form(cid:80)|S|\n\nfrom all successive visits when the condition is true. Note that there is no analytic maximizer of L,\ndue to the structure of the matrix exponential, and direct numerical maximization with respect to Q\nis computationally challenging. This motivates the use of an EM-based approach.\n(cid:80)|S|\nAn EM algorithm for CTMC is described in [8]. Based on Eq. 1, the expected complete log likeli-\nj=1,j(cid:54)=i{log(qij)E[nij|Y, \u02c6Q0]\u2212qiE[\u03c4i|Y, \u02c6Q0]}, where \u02c6Q0 is the current\nestimate for Q, and E[nij|Y, \u02c6Q0] and E[\u03c4i|Y, \u02c6Q0] are the expected state transition count and total\nduration given the incomplete observation Y and the current transition rate matrix \u02c6Q0, respectively.\nOnce these two expectations are computed in the E-step, the updated \u02c6Q parameters can be obtained\nvia the M-step as\n\ni=1\n\n\u02c6qij =\n\nE[nij|Y, \u02c6Q0]\nE[\u03c4i|Y, \u02c6Q0]\n\n, i (cid:54)= j and \u02c6qii = \u2212(cid:88)\n\nj(cid:54)=i\n\n\u02c6qij.\n\n(3)\n\nNow the main computational challenge is to evaluate E[nij|Y, \u02c6Q0] and E[\u03c4i|Y, \u02c6Q0]. By exploiting\nthe properties of the Markov process, the two expectations can be decomposed as [12]:\n\nE[nij|Y, \u02c6Q0] =\n\nE[nij|yv, yv+1, \u02c6Q0] =\n\nI(yv = k, yv+1 = l)E[nij|yv = k, yv+1 = l, \u02c6Q0]\n\nV \u22121(cid:88)\nV \u22121(cid:88)\n\nv=0\n\nV \u22121(cid:88)\n|S|(cid:88)\n|S|(cid:88)\nV \u22121(cid:88)\n\nv=0\n\nk,l=1\n\nE[\u03c4i|Y, \u02c6Q0] =\n\nE[\u03c4i|yv, yv+1, \u02c6Q0] =\n\nI(yv = k, yv+1 = l)E[\u03c4i|yv = k, yv+1 = l, \u02c6Q0]\n\nv=0\n\nv=0\n\nk,l=1\n\nwhere I(yv = k, yv+1 = l) = 1 if the condition is true, otherwise it is 0. Thus, the computation\nreduces to computing the end-state conditioned expectations E[nij|yv = k, yv+1 = l, \u02c6Q0] and\nE[\u03c4i|yv = k, yv+1 = l, \u02c6Q0], for all k, l, i, j \u2208 S. These expectations are also a key step in CT-HMM\nlearning, and Section 4 presents our approach to computing them.\n\n3 Continuous-Time Hidden Markov Model\nIn this section, we describe the continuous-time hidden Markov model (CT-HMM) for disease pro-\ngression and the proposed framework for CT-HMM learning.\n\n3.1 Model Description\nIn contrast to CTMC, where the states are directly observed, none of the states are directly observed\nin CT-HMM. Instead, the available observational data o depends on the hidden states s via the\nmeasurement model p(o|s). In contrast to a conventional HMM, the observations (o0, o1, . . . , oV )\nare only available at irregularly-distributed continuous points in time (t0, t1, . . . , tV ). As a conse-\nquence, there are two levels of hidden information in a CT-HMM. First, at observation time, the\nstate of the Markov chain is hidden and can only be inferred from measurements. Second, the state\ntransitions in the Markov chain between two consecutive observations are also hidden. As a result, a\nMarkov chain may visit multiple hidden states before reaching a state that emits a noisy observation.\nThis additional complexity makes CT-HMM a more effective model for event data, in comparison\nto HMM and CTMC. But as a consequence the parameter learning problem is more challenging.\nWe believe we are the \ufb01rst to present a comprehensive and systematic treatment of ef\ufb01cient EM\nalgorithms to address these challenges.\nA fully observed CT-HMM contains four sequences of information: the underlying state transition\nV (cid:48))} of the hidden\ntime (t(cid:48)\nMarkov chain, and the observed data O = (o0, o1, . . . , oV ) at time T = (t0, t1, . . . , tV ). Their joint\ncomplete likelihood can be written as\n\nV (cid:48)), the corresponding state Y (cid:48) = {y0 = s(t(cid:48)\n1, . . . , t(cid:48)\n|S|(cid:89)\nV (cid:48)\u22121(cid:89)\n\n0), ..., yV (cid:48) = s(t(cid:48)\nV(cid:89)\n\n|S|(cid:89)\n\nV(cid:89)\n\n0, t(cid:48)\n\n\u2212qy\n\nv(cid:48) \u03c4v(cid:48)\n\np(ov|s(tv)) =\n\n\u2212qi\u03c4i\n\nqnij\nij e\n\np(ov|s(tv)).\n\n(4)\n\nv=0\n\ni=1\n\nj=1,j(cid:54)=i\n\nv=0\n\nCL =\n\nqyv(cid:48) ,yv(cid:48)+1 e\n\nv(cid:48)=0\n\nWe will focus our development on the estimation of the transition rate matrix Q. Estimates for the\nparameters of the emission model p(o|s) and the initial state distribution \u03c0 can be obtained from the\nstandard discrete time HMM formulation [13], but with time-inhomogeneous transition probabilities\n(described below).\n\n3\n\n\f3.2 Parameter Estimation\nGiven a current estimate of the parameter \u02c6Q0, the expected complete log-likelihood takes the form\n\nL(Q) =\n\n{log(qij)E[nij|O, T, \u02c6Q0] \u2212 qiE[\u03c4i|O, T, \u02c6Q0]} +\n\nE[log p(ov|s(tv))|O, T, \u02c6Q0]. (5)\n\n|S|(cid:88)\n\n|S|(cid:88)\n\ni=1\n\nj=1,j(cid:54)=i\n\nIn the M-step, taking the derivative of L with respect to qij, we have\n\n\u02c6qij =\n\nE[nij|O, T, \u02c6Q0]\nE[\u03c4i|O, T, \u02c6Q0]\n\n\u02c6qij.\n\n(6)\n\nV(cid:88)\n, i (cid:54)= j and \u02c6qii = \u2212(cid:88)\n\nv=0\n\nj(cid:54)=i\n\nThe challenge lies in the E-step, where we compute the expectations of nij and \u03c4i conditioned on the\nobservation sequence. The statistic for nij can be expressed in terms of the expectations between\nsuccessive pairs of observations as follows:\n\np(s(t1), ..., s(tV )|O, T, \u02c6Q0)E[nij|s(t1), ..., s(tV ), \u02c6Q0]\n\np(s(t1), ..., s(tV )|O, T, \u02c6Q0)\n\nE[nij|s(tv), s(tv+1), \u02c6Q0]\n\nV \u22121(cid:88)\n\nv=1\n\n(7)\n\n(8)\n\nE[nij|O, T, \u02c6Q0] =\n\n=\n\n=\n\ns(t1),...,s(tV )\n\ns(t1),...,s(tV )\n\nV \u22121(cid:88)\n\n(cid:88)\n(cid:88)\n|S|(cid:88)\n\nv=1\n\nk,l=1\n\nn\u22121(cid:88)\n\n|S|(cid:88)\n\nv=1\n\nk,l=1\n\np(s(tv) = k, s(tv+1) = l|O, T, \u02c6Q0)E[nij|s(tv) = k, s(tv+1) = l, \u02c6Q0].\n\n(9)\n\nIn a similar way, we can obtain an expression for the expectation of \u03c4i:\n\nE[\u03c4i|O, T, \u02c6Q0] =\n\np(s(tv) = k, s(tv+1) = l|O, T, \u02c6Q0)E[\u03c4i|s(tv) = k, s(tv+1) = l, \u02c6Q0].\n\n(10)\n\nIn Section 4, we present our approach to computing the end-state conditioned statistics\nE[nij|s(tv) = k, s(tv+1) = l, \u02c6Q0] and E[\u03c4i|s(tv) = k, s(tv+1) = l, \u02c6Q0]. The remaining step\nis to compute the posterior state distribution at two consecutive observation times: p(s(tv) =\nk, s(tv+1) = l|O, T, \u02c6Q0).\n3.3 Computing the Posterior State Probabilities\nThe challenge in ef\ufb01ciently computing p(s(tv) = k, s(tv+1) = l|O, T, \u02c6Q0) is to avoid the explicit\nenumeration of all possible state transition sequences and the variable time intervals between inter-\nmediate state transitions (from k to l). The key is to note that the posterior state probabilities are only\nneeded at the times where we have observation data. We can exploit this insight to reformulate the\nestimation problem in terms of an equivalent discrete time-inhomogeneous hidden Markov model.\nSpeci\ufb01cally, given the current estimate \u02c6Q0, O and T , we will divide the time into V intervals, each\nwith duration \u03c4v = tv \u2212 tv\u22121. We then make use of the transition property of CTMC, and associate\neach interval v with a state transition matrix P v(\u03c4v) := e \u02c6Q0\u03c4v. Together with the emission model\np(o|s), we then have a discrete time-inhomogeneous hidden Markov model with joint likelihood:\n\n[P v(\u03c4v)](s(tv\u22121),s(tv ))\n\np(ov|s(tv)).\n\n(11)\n\nv=1\n\nv=0\n\nThe formulation in Eq. 11 allows us to reduce the computation of p(s(tv) = k, s(tv+1) =\nl|O, T, \u02c6Q0) to familiar operations. The forward-backward algorithm [13] can be used to compute the\nposterior distribution of the hidden states, which we refer to as the Soft method. Alternatively, the\nMAP assignment of hidden states obtained from the Viterbi algorithm can provide an approximate\ndistribution, which we refer to as the Hard method.\n4 EM Algorithms for CT-HMM\nPseudocode for the EM algorithm for CT-HMM parameter learning is shown in Algorithm 1.\nMultiple variants of the basic algorithm are possible, depending on the choice of method for\ncomputing the end-state conditioned expectations along with the choice of Hard or Soft decod-\ning for obtaining the posterior state probabilities in Eq. 11. Note that in line 7 of Algorithm 1,\n\n4\n\nV(cid:89)\n\nV(cid:89)\n\n\fAlgorithm 1 CT-HMM Parameter learning (Soft/Hard)\n1: Input: data O = (o0, ..., oV ) and T = (t0, . . . , tV ), state set S, edge set L, initial guess of Q\n2: Output: transition rate matrix Q = (qij)\n3: Find all distinct time intervals t\u2206, \u2206 = 1, ..., r, from T\n4: Compute P (t\u2206) = eQt\u2206 for each t\u2206\n5: repeat\n6:\n\nCompute p(v, k, l) = p(s(tv) = k, s(tv+1) = l|O, T, Q) for all v, and the complete/state-\noptimized data likelihood l by using Forward-Backward (soft) or Viterbi (hard)\nCreate soft count table C(\u2206, k, l) from p(v, k, l) by summing prob. from visits of same t\u2206\nUse Expm, Unif or Eigen method to compute E[nij|O, T, Q] and E[\u03c4i|O, T, Q]\nUpdate qij =\n\nE[\u03c4i|O,T,Q] , and qii = \u2212(cid:80)\n\n7:\n8:\nE[nij|O,T,Q]\n9:\n10: until likelihood l converges\n\ni(cid:54)=j qij\n\n(cid:90) t\n(cid:90) t\n\n0\n\nPk,l(t)\n\n1\n\nwe group probabilities from successive visits of same time interval and the same speci\ufb01ed end-\nstates in order to save computation time. This is valid because in a time-homogeneous CT-HMM,\nE[nij|s(tv) = k, s(tv+1) = l, \u02c6Q0] = E[nij|s(0) = k, s(t\u2206) = l, \u02c6Q0], where t\u2206 = tv+1\u2212tv, so that\nthe expectations only need to be evaluated for each distinct time interval, rather than each different\nvisiting time (also see the discussion below Eq. 2).\n4.1 Computing the End-State Conditioned Expectations\nThe remaining step in \ufb01nalizing the EM algorithm is to discuss the computation of the end-state\nconditioned expectations for nij and \u03c4i from Eqs. 9 and 10, respectively. The \ufb01rst step is to express\nthe expectations in integral form, following [14]:\nE[nij|s(0) = k, s(t) = l, Q] =\n\nPk,i(x)Pj,l(t \u2212 x) dx\n\n(12)\n\nqi,j\n\nPk,l(t)\n\n0\n\n(cid:21)\n\nk,l (t) and \u03c4 i,i\n\n(cid:20)Q B\n\nPk,i(x)Pi,l(t \u2212 x) dx.\n\nE[\u03c4i|s(0) = k, s(t) = l, Q] =\n\nk,l (t) = (cid:82) t\n\n0 Pk,i(x)Pj,l(t \u2212 x)dx = (cid:82) t\n\nIt is shown in [15] that (cid:82) t\n\n(13)\n0 (eQx)k,i(eQ(t\u2212x))j,l dx, while\nFrom Eq. 12, we de\ufb01ne \u03c4 i,j\n\u03c4 i,i\nk,l(t) can be similarly de\ufb01ned for Eq. 13 (see [6] for a similar construction). Several methods for\ncomputing \u03c4 i,j\nk,l(t) have been proposed in the CTMC literature. Metzner et. al. observe\nthat closed-form expressions can be obtained when Q is diagonalizable [8]. Unfortunately, this\nproperty is not guaranteed to exist, and in practice we \ufb01nd that the intermediate Q matrices are\nfrequently not diagonalizable during EM iterations. We refer to this approach as Eigen.\nAn alternative is to leverage a classic method of Van Loan [15] for computing integrals of ma-\n, where\ntrix exponentials. In this approach, an auxiliary matrix A is constructed as A =\n0 eQxBeQ(t\u2212x)dt =\nB is a matrix with identical dimensions to Q.\n(eAt)(1:n),(n+1):(2n), where n is the dimension of Q. Following [9], we set B = I(i, j), where\nI(i, j) is the matrix with a 1 in the (i, j)th entry and 0 elsewhere. Thus the left hand side reduces to\n\u03c4 i,j\nk,l (t) for all k, l in the corresponding matrix entries. Thus we can leverage the substantial literature\non numerical computation of the matrix exponential. We refer to this approach as Expm, after the\npopular Matlab function. A third approach for computing the expectations, introduced by Hobolth\nand Jensen [9] for CTMCs, is called uniformization (Unif ) and is described in the supplementary\nmaterial, along with additional details for Expm.\nExpm Based Algorithm Algorithm 2 presents pseudocode for the Expm method for computing\nend-state conditioned statistics. The algorithm exploits the fact that the A matrix does not change\nwith time t\u2206. Therefore, when using the scaling and squaring method [16] for computing matrix\nexponentials, one can easily cache and reuse the intermediate powers of A to ef\ufb01ciently compute\netA for different values of t.\n4.2 Analysis of Time Complexity and Run-Time Comparisons\nWe conducted asymptotic complexity analysis for all six combinations of Hard and Soft EM with\nthe methods Expm, Unif, and Eigen for computing the conditional expectations. For both hard and\n\n0 Q\n\n5\n\n\fAlgorithm 2 The Expm Algorithm for Computing End-State Conditioned Statistics\n1: for each state i in S do\n2:\n3:\n\n(cid:20)Q I(i, i)\n\nfor \u2206 = 1 to r do\n\n(et\u2206A)(1:n),(n+1):(2n)\n\n, where A =\n\n(cid:21)\n\nDi =\n\nE[\u03c4i|O, T, Q] + = (cid:80)\n\nPkl(t\u2206)\n\nend for\n\n4:\n5:\n6: end for\n7: for each edge (i, j) in L do\n8:\n9:\n\nfor \u2206 = 1 to r do\n\nE[nij|O, T, Q] + =(cid:80)\n\nNij =\n\nPkl(t\u2206)\n\n10:\n11:\n12: end for\n\nend for\n\nqij (et\u2206A)(1:n),(n+1):(2n)\n\n0\n\nQ\n\n(k,l)\u2208L C(\u2206, k, l)(Di)k,l\n\n(cid:20)Q I(i, j)\n\n(cid:21)\n\n0\n\nQ\n\n, where A =\n\n(k,l)\u2208L C(\u2206, k, l)(Nij)k,l\n\nsoft variants, the time complexity of Expm is O(rS4 + rLS3), where r is the number of distinct time\nintervals between observations, S is the number of states, and L is the number of edges. The soft\nversion of Eigen has the same time complexity, but since the eigendecomposition of non-symmetric\nmatrices can be ill-conditioned in any EM iteration [17], this method is not attractive. Unif is\nbased on truncating an in\ufb01nite sum and the truncation point M varies with maxi,t\u2206 qit\u2206, with the\nresult that the cost of Unif varies signi\ufb01cantly with both the data and the parameters. In comparison,\nExpm is much less sensitive to these values (log versus quadratic dependency). See the supplemental\nmaterial for the details. We conclude that Expm is the most robust method available for the soft EM\ncase. When the state space is large, hard EM can be used to tradeoff accuracy with time. In the hard\nEM case, Unif can be more ef\ufb01cient than Expm, because Unif can evaluate only the expectations\nspeci\ufb01ed by the required end-states from the best decoded paths, whereas Expm must always produce\nresults from all end-states.\nThese asymptotic results are consistent with our experimental \ufb01ndings. On the glaucoma dataset\nfrom Section 5.2, using a model with 105 states, Soft Expm requires 18 minutes per iteration on a\n2.67 GHz machine with unoptimized MATLAB code, while Soft Unif spends more than 105 minutes\nper iteration, Hard Unif spends 2 minutes per iteration, and Eigen fails.\n5 Experimental results\nWe evaluated our EM algorithms in simulation (Sec. 5.1) and on two real-world datasets: a glaucoma\ndataset (Sec. 5.2) in which we compare our prediction performance to a state-of-the-art method, and\na dataset for Alzheimer\u2019s disease (AD, Sec. 5.3) where we compare visualized progression trends to\nrecent \ufb01ndings in the literature. Our disease progression models employ 105 (Glaucoma) and 277\n(AD) states, representing a signi\ufb01cant advance in the ability to work with large models (previous\nCT-HMM works [2, 7, 5] employed fewer than 100 states).\n\n[0, 1] and renormalized such that(cid:80)\n\n5.1 Simulation on a 5-state Complete Digraph\nWe test the accuracy of all methods on a 5-state complete digraph with synthetic data generated\nunder different noise levels. Each qi is randomly drawn from [1, 5] and then qij is drawn from\nj(cid:54)=i qij = qi. The state chains are generated from Q, such that\nis the largest mean holding time.\neach chain has a total duration around T = 100\nmini qi\nThe data emission model for state i is set as N (i, \u03c32), where \u03c3 varies under different noise level\nsettings. The observations are then sampled from the state chains with rate\nmaxi qi\nis the smallest mean holding time, which should be dense enough to make the chain identi\ufb01able.\nA total of 105 observations are sampled. The average 2-norm relative error ||\u02c6q\u2212q||\nis used as the\n||q||\nperformance metric, where \u02c6q is a vector contains all learned qij parameters, and q is the ground\ntruth.\nThe simulation results from 5 random runs are listed in Table 1. Expm and Unif produce nearly\nidentical results so they are combined in the table. Eigen fails at least once for each setting, but\nwhen it works it produces similar results. All Soft methods achieve signi\ufb01cantly better accuracy\n\n, where\n\n, where\n\nmaxi qi\n\nmini qi\n\n0.5\n\n1\n\n1\n\n6\n\n\fTable 1: The average 2-norm relative error from 5 random runs on a 5-state complete digraph under varying\nnoise levels. The convergence threshold is \u2264 10\u22128 on relative data likelihood change.\n\nError\nS(Expm,Unif)\nH(Expm,Unif)\n\n\u03c3 = 1/4\n0.026\u00b10.008\n0.031\u00b10.009\n\n\u03c3 = 3/8\n0.032\u00b10.008\n0.197\u00b10.062\n\n\u03c3 = 1/2\n0.042\u00b10.012\n0.476\u00b10.100\n\n\u03c3 = 1\n0.199\u00b10.084\n0.857\u00b10.080\n\n\u03c3 = 2\n0.510\u00b10.104\n0.925\u00b10.030\n\nFigure 1: (a) The 2D-grid state structure for glaucoma progression modeling. (b) Illustration of the prediction\nof future states from s(0) = i. (c) One fold of convergence behavior of Soft(Expm) on the glaucoma dataset.\n\nthan Hard methods, especially when the noise level becomes higher. This can be attributed to the\nmaintenance of the full hidden state distribution which makes it more robust to noise.\n5.2 Application of CT-HMM to Predicting Glaucoma Progression\nIn this experiment we used CT-HMM to visualize a real-world glaucoma dataset and predict glau-\ncoma progression. Glaucoma is a leading cause of blindness and visual morbidity worldwide [18].\nThis disease is characterized by a slowly progressing optic neuropathy with associated irreversible\nstructural and functional damage. There are con\ufb02icting \ufb01ndings in the temporal ordering of de-\ntectable structural and functional changes, which confound glaucoma clinical assessment and treat-\nment plans [19]. Here, we use a 2D-grid state space model with 105 states, de\ufb01ned by successive\nvalue bands of the two main glaucoma markers, Visual Field Index (VFI) (functional marker) and\naverage RNFL (Retinal Nerve Fiber Layer) thickness (structural marker) with forwarding edges (see\nFig. 1(a)). More details of the dataset and model can be found in the supplementary material. We\nutilize Soft Expm for the following experiments, since it converges quickly (see Fig. 1(c)), has an\nacceptable computational cost, and exhibits the best performance.\nTo predict future continuous measurements, we follow a simple procedure illustrated in Fig. 1(b).\nGiven a testing patient, Viterbi decoding is used to decode the best hidden state path for the past\nvisits. Then, given a future time t, the most probable future state is predicted by j = maxj Pij(t)\n(blue node), where i is the current state (black node). To predict the continuous measurements, we\nsearch for the future time t1 and t2 in a desired resolution when the patient enters and leaves a state\nhaving same value range as state j for each disease marker separately. The measurement at time t\ncan then be computed by linear interpolation between t1 and t2 and the two data bounds of state j for\nthe speci\ufb01ed marker ([b1, b2] in Fig. 1(b)). The mean absolute error (MAE) between the predicted\nvalues and the actual measurements was used for performance assessment. The performance of CT-\nHMM was compared to both conventional linear regression and Bayesian joint linear regression [11].\nFor the Bayesian method, the joint prior distribution of the four parameters (two intercepts and two\nslopes) computed from the training set [11] is used alongside the data likelihood. The results in\nTable 2 demonstrate the signi\ufb01cantly improved performance of CT-HMM.\nIn Fig. 2(a), we visualize the model trained using the entire dataset. Several dominant paths can be\nidenti\ufb01ed: there is an early stage containing RNFL thinning with intact vision (blue vertical path in\nthe \ufb01rst column), and at around RNFL range [80, 85] the transition trend reverses and VFI changes\nbecome more evident (blue horizontal paths). This L shape in the disease progression supports the\n\ufb01nding in [20] that RNFL thickness of around 77 microns is a tipping point at which functional\ndeterioration becomes clinically observable with structural deterioration. Our 2D CT-HMM model\ncan be used to visualize the non-linear relationship between structural and functional degeneration,\nyielding insights into the progression process.\n5.3 Application of CT-HMM to Exploratory Analysis of Alzheimer\u2019s Disease\nWe now demonstrate the use of CT-HMM as an exploratory tool to visualize the temporal interaction\nof disease markers of Alzheimer\u2019s Disease (AD). AD is an irreversible neuro-degenerative disease\nthat results in a loss of mental function due to the degeneration of brain tissues. An estimated 5.3\n\n7\n\ns(0)=is(t)=jt1t2b1b2b3Functional deteriorationStructural deterioration......Functional deteriorationStructural deterioration(a)(b)(c)\fTable 2: The mean absolute error (MAE) of predicting the two glaucoma measures. (\u2217 represents that CT-\nHMM performs signi\ufb01cantly better than the competing method under student t-test).\n\nMAE\nVFI\nRNFL\n\nCT-HMM\n4.64 \u00b1 10.06\n7.05 \u00b1 6.57\n\nBayesian Joint Linear Regression\n\n5.57 \u00b1 11.11 * (p = 0.005)\n9.65 \u00b1 8.42 * (p \u2248 0.000)\n\nLinear Regression\n\n7.00 \u00b1 12.22 *(p \u2248 0.000)\n18.13 \u00b1 20.70 * (p \u2248 0.000)\n\nmillion Americans have AD, yet no prevention or cures have been found [21]. It could be bene\ufb01cial\nto visualize the relationship between clinical, imaging, and biochemical markers as the pathology\nevolves, in order to better understand AD progression and develop treatments.\nA 277 state CT-HMM model was constructed from a cohort of AD patients (see the supplementary\nmaterial for additional details). The 3D visualization result is shown in Fig. 2(b). The state transition\ntrends show that the abnormality of A\u03b2 level emerges \ufb01rst (blue lines) when cognition scores are\nstill normal. Hippocampus atrophy happens more often (green lines) when A\u03b2 levels are already\nlow and cognition has started to show abnormality. Most cognition degeneration happens (red lines)\nwhen both A\u03b2 levels and Hippocampus volume are already in abnormal stages. Our quantitative\nvisualization results supports recent \ufb01ndings that the decreasing of A\u03b2 level in CSF is an early\nmarker before detectable hippocampus atrophy in cognition-normal elderly [22]. The CT-HMM\ndisease model with interactive visualization can be utilized as an exploratory tool to gain insights of\nthe disease progression and generate hypotheses to be further investigated by medical researchers.\n\nFigure 2: Visualization scheme: (a) The strongest transition among the three instantaneous links from each\nstate are shown in blue while other transitions are drawn in dotted black. The line width and the node size\nre\ufb02ect the expected count. The node color represents the average sojourn time (red to green: 0 to 5 years and\nabove). (b) similar to (a) but the strongest transition from each state is color coded as follows: A\u03b2 direction\n(blue), hippo (green), cog (red), A\u03b2 +hippo (cyan), A\u03b2 +cog (magenta), hippo+cog (yellow), A\u03b2 +hippo+\ncog(black). The node color represents the average sojourn time (red to green: 0 to 3 years and above).\n6 Conclusion\nIn this paper, we present novel EM algorithms for CT-HMM learning which leverage recent ap-\nproaches [9] for evaluating the end-state conditioned expectations in CTMC models. To our knowl-\nedge, we are the \ufb01rst to develop and test the Expm and Unif methods for CT-HMM learning. We also\nanalyze their time complexity and provide experimental comparisons among the methods under soft\nand hard EM frameworks. We \ufb01nd that soft EM is more accurate than hard EM, and Expm works\nthe best under soft EM. We evaluated our EM algorithsm on two disease progression datasets for\nglaucoma and AD. We show that CT-HMM outperforms the state-of-the-art Bayesian joint linear\nregression method [11] for glaucoma progression prediction. This demonstrates the practical value\nof CT-HMM for longitudinal disease modeling and prediction.\n\nAcknowledgments\n\nPortions of this work were supported in part by NIH R01 EY13178-15 and by grant U54EB020404 awarded\nby the National Institute of Biomedical Imaging and Bioengineering through funds provided by the Big Data\nto Knowledge (BD2K) initiative (www.bd2k.nih.gov). Additionally, the collection and sharing of the\nAlzheimers data was funded by ADNI under NIH U01 AG024904 and DOD award W81XWH-12-2-0012. The\nresearch was also supported in part by NSF/NIH BIGDATA 1R01GM108341, ONR N00014-15-1-2340, NSF\nIIS-1218749, and NSF CAREER IIS-1350983.\n\n8\n\nFunctional degeneration (VFI)Structural degeneration (RNFL)structural(Hippocampus)biochemical(A beta)functional(Cognition)(a) Glaucoma progression(b) Alzheimer's disease progression\fReferences\n[1] D. R. Cox and H. D. Miller, The Theory of Stochastic Processes. London: Chapman and Hall,\n\n1965.\n\n[2] C. H. Jackson, \u201cMulti-state models for panel data: the msm package for R,\u201d Journal of Statis-\n\ntical Software, vol. 38, no. 8, 2011.\n\n[3] N. Bartolomeo, P. Trerotoli, and G. Serio, \u201cProgression of liver cirrhosis to HCC: an applica-\n\ntion of hidden markov model,\u201d BMC Med Research Methold., vol. 11, no. 38, 2011.\n\n[4] Y. Liu, H. Ishikawa, M. Chen, and et al., \u201cLongitudinal modeling of glaucoma progression us-\ning 2-dimensional continuous-time hidden markov model,\u201d Med Image Comput Comput Assist\nInterv, vol. 16, no. 2, pp. 444\u201351, 2013.\n\n[5] X. Wang, D. Sontag, and F. Wang, \u201cUnsupervised learning of disease progression models,\u201d\n\nProceeding KDD, vol. 4, no. 1, pp. 85\u201394, 2014.\n\n[6] U. Nodelman, C. R. Shelton, and D. Koller, \u201cExpectation maximization and complex duration\ndistributions for continuous time bayesian networks,\u201d in Proc. Uncertainty in AI (UAI 05),\n2005.\n\n[7] J. M. Leiva-Murillo, A. Arts-Rodrguez, and E. Baca-Garca, \u201cVisualization and prediction of\n\ndisease interactions with continuous-time hidden markov models,\u201d in NIPS, 2011.\n\n[8] P. Metzner, I. Horenko, and C. Schtte, \u201cGenerator estimation of markov jump processes based\non incomplete observations nonequidistant in time,\u201d Physical Review E, vol. 76, no. 066702,\n2007.\n\n[9] A. Hobolth and J. L. Jensen, \u201cSummary statistics for endpoint-conditioned continuous-time\n\nmarkov chains,\u201d Journal of Applied Probability, vol. 48, no. 4, pp. 911\u2013924, 2011.\n\n[10] P. Tataru and A. Hobolth, \u201cComparison of methods for calculating conditional expectations of\nsuf\ufb01cient statistics for continuous time markov chains,\u201d BMC Bioinformatics, vol. 12, no. 465,\n2011.\n\n[11] F. Medeiros, L. Zangwill, C. Girkin, and et al., \u201cCombining structural and functional measure-\nments to improve estimates of rates of glaucomatous progression,\u201d Am J Ophthalmol, vol. 153,\nno. 6, pp. 1197\u2013205, 2012.\n\n[12] M. Bladt and M. Srensen, \u201cStatistical inference for discretely observed markov jump pro-\n\ncesses,\u201d J. R. Statist. Soc. B, vol. 39, no. 3, p. 395410, 2005.\n\n[13] L. R. Rabinar, \u201cA tutorial on hidden markov models and selected applications in speech recog-\n\nnition,\u201d Proceedings of the IEEE, vol. 77, no. 2, 1989.\n\n[14] A. Hobolth and J. L.Jensen, \u201cStatistical inference in evolutionary models of DNA sequences\nvia the EM algorithm,\u201d Statistical Applications in Genetics and Molecular Biology, vol. 4,\nno. 1, 2005.\n\n[15] C. Van Loan, \u201cComputing integrals involving the matrix exponential,\u201d IEEE Trans. Automatic\n\nControl, vol. 23, pp. 395\u2013404, 1978.\n\n[16] N. Higham, Functions of Matrices: Theory and Computation. SIAM, 2008.\n[17] P. Metzner, I. Horenko, and C. Schtte, \u201cGenerator estimation of markov jump processes,\u201d Jour-\n\nnal of Computational Physics, vol. 227, p. 353375, 2007.\n\n[18] S. Kingman, \u201cGlaucoma is second leading cause of blindness globally,\u201d Bulletin of the World\n\nHealth Organization, vol. 82, no. 11, 2004.\n\n[19] G. Wollstein, J. Schuman, L. Price, and et al., \u201cOptical coherence tomography longitudinal\nevaluation of retinal nerve \ufb01ber layer thickness in glaucoma,\u201d Arch Ophthalmol, vol. 123,\nno. 4, pp. 464\u201370, 2005.\n\n[20] G. Wollstein, L. Kagemann, R. Bilonick, and et al., \u201cRetinal nerve \ufb01bre layer and visual func-\n\ntion loss in glaucoma: the tipping point,\u201d Br J Ophthalmol, vol. 96, no. 1, pp. 47\u201352, 2012.\n\n[21] The Alzheimers Disease Neuroimaging Initiative, \u201chttp://adni.loni.usc.edu,\u201d\n[22] A. M. Fagan, D. Head, A. R. Shah, and et. al, \u201cDecreased CSF A beta 42 correlates with brain\n\natrophy in cognitively normal elderly,\u201d Ann Neurol., vol. 65, no. 2, p. 176183, 2009.\n\n9\n\n\f", "award": [], "sourceid": 2003, "authors": [{"given_name": "Yu-Ying", "family_name": "Liu", "institution": "Georgia Tech"}, {"given_name": "Shuang", "family_name": "Li", "institution": "Georgia Tech"}, {"given_name": "Fuxin", "family_name": "Li", "institution": "Georgia Tech"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}, {"given_name": "James", "family_name": "Rehg", "institution": "Georgia Tech"}]}