{"title": "Collapsed variational Bayes for Markov jump processes", "book": "Advances in Neural Information Processing Systems", "page_first": 3749, "page_last": 3757, "abstract": "Markov jump processes are continuous-time stochastic processes widely used in statistical applications in the natural sciences, and more recently in machine learning. Inference for these models typically proceeds via Markov chain Monte Carlo, and can suffer from various computational challenges. In this work, we propose a novel collapsed variational inference algorithm to address this issue. Our work leverages ideas from discrete-time Markov chains, and exploits a connection between these two through an idea called uniformization. Our algorithm proceeds by marginalizing out the parameters of the Markov jump process, and then approximating the distribution over the trajectory with a factored distribution over segments of a piecewise-constant function. Unlike MCMC schemes that marginalize out transition times of a piecewise-constant process, our scheme optimizes the discretization of time, resulting in significant computational savings. We apply our ideas to synthetic data as well as a dataset of check-in recordings, where we demonstrate superior performance over state-of-the-art MCMC methods.", "full_text": "Collapsed variational Bayes for Markov jump\n\nprocesses\n\nJiangwei Pan\u2217\u2020\n\nDepartment of Computer Science\n\nDuke University\n\npanjiangwei@gmail.com\n\nBoqian Zhang\u2217\n\nDepartment of Statistics\n\nPurdue University\n\nzhan1977@purdue.edu\n\nVinayak Rao\n\nDepartment of Statistics\n\nPurdue University\nvarao@purdue.edu\n\nAbstract\n\nMarkov jump processes are continuous-time stochastic processes widely used\nin statistical applications in the natural sciences, and more recently in machine\nlearning. Inference for these models typically proceeds via Markov chain Monte\nCarlo, and can suffer from various computational challenges. In this work, we\npropose a novel collapsed variational inference algorithm to address this issue. Our\nwork leverages ideas from discrete-time Markov chains, and exploits a connection\nbetween these two through an idea called uniformization. Our algorithm proceeds\nby marginalizing out the parameters of the Markov jump process, and then ap-\nproximating the distribution over the trajectory with a factored distribution over\nsegments of a piecewise-constant function. Unlike MCMC schemes that marginal-\nize out transition times of a piecewise-constant process, our scheme optimizes the\ndiscretization of time, resulting in signi\ufb01cant computational savings. We apply\nour ideas to synthetic data as well as a dataset of check-in recordings, where we\ndemonstrate superior performance over state-of-the-art MCMC methods.\n\n1 Markov jump processes\nMarkov jump processes (MJPs) (\u00c7inlar, 1975) are stochastic processes that generalize discrete-time\ndiscrete-state Markov chains to continuous-time. MJPs \ufb01nd wide application in \ufb01elds like biology,\nchemistry and ecology, where they are used to model phenomena like the evolution of population\nsizes (Opper and Sanguinetti, 2007), gene-regulation Boys et al. (2008), or the state of a computing\nnetwork Xu and Shelton (2010). A realization of an MJP is a random piecewise-constant function of\ntime, transitioning between a set of states, usually of \ufb01nite cardinality N (see Figure 1, left). This\nstochastic process is parametrized by an N \u00d7 1 distribution \u03c0 giving the initial distribution over states,\nand an N \u00d7 N rate matrix A governing the dynamics of the process. The off-diagonal element Aij\n(i (cid:54)= j) gives the rate of transitioning from state i to j, and these elements determine the diagonal\ni(cid:54)=j Aij. Thus, the rows of A sum to 0, and the\nnegative of the diagonal element Aii gives the total rate of leaving state i. Simulating a trajectory\nfrom an MJP over an interval [0, T ] follows what is called the Gillespie algorithm (Gillespie, 1977):\n\nelement Aii according to the relation Aii = \u2212(cid:80)\n\n1. First, at time t = 0, sample an initial state s0 from \u03c0.\n2. From here onwards, upon entering a new state i, sample the time of the next transition from\nan exponential with rate |Aii|, and then a new state j (cid:54)= i with probability proportional to Aij.\nThese latter two steps are repeated until the end of the interval, giving a piecewise-constant\ntrajectory consisting of a sequence of holds and jumps.\n\nNote that under this formulation, it is impossible for the system to make self-transition, these are\neffectively absorbed into the rate parameters Aii.\n\n\u2217Equal contribution\n\u2020Now at Facebook\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: (left) a realiza-\ntion of an MJP, (right)\nsampling a path via uni-\nformization.\n\nBayesian inference for MJPs: In practical applications, one only observes the MJP trajectory S(t)\nindirectly through a noisy observation process. Abstractly, this forms a hidden Markov model\nproblem, now in continuous time. For instance, the states of the MJP could correspond to different\nstates of a gene-network, and rather than observing these directly, one only has noisy gene-expression\nlevel measurements. Alternately, each state i can have an associated emission rate \u03bbi, and rather\nthan directly observing S(t) or \u03bbS(t), one observes a realization of a Poisson process with intensity\n\u03bbS(t). The Poisson events could correspond to mutation events on a strand of DNA, with position\nindexed by t (Fearnhead and Sherlock, 2006). In this work, we consider a dataset of users logging\ntheir activity into the social media website FourSquare, with each \u2018check-in\u2019 consisting of a time and\na location. We model each user with an MJP, with different states having different distributions over\ncheck-in locations. Given a sequence of user check-ins, one is interested in quantities like the latent\nstate of the user, various clusters of check-in locations, and the rate at which users transition from\none state to another. We describe this problem and the dataset in more detail in our experiments.\nIn typical situations, the parameters \u03c0 and A are themselves unknown, and it is necessary to learn\nthese, along with the latent MJP trajectory, from the observed data. A Bayesian approach places a\nprior over these parameters and uses the observed data to obtain a posterior distribution. A simple and\nconvenient prior over A is a Dirichlet-Gamma prior: this places a Dirichlet prior over \u03c0, and models\nthe off-diagonal elements Aij as draws from a Gamma(a, b) distribution. The negative diagonal\nelement |Aii| is then just the sum of the corresponding elements from the same row, and is marginally\ndistributed as a Gamma((N \u2212 1)a, b) variable. This prior is convenient in the context of MCMC\nsampling, allowing a Gibbs sampler that alternately samples (\u03c0, A) given a MJP trajectory S(t), and\nthen a new trajectory S(t) given A and the observations. The \ufb01rst step is straightforward: given\nan MJP trajectory, the Dirichlet-Gamma prior is conjugate, resulting in a simple Dirichlet-Gamma\nposterior (but see Fearnhead and Sherlock (2006) and the next section for a slight generalization\nthat continues to be conditionally conjugate). Similarly, recent developments in MCMC inference\nhave made the second step fairly standard and ef\ufb01cient, see Rao and Teh (2014); Hajiaghayi et al.\n(2014). Despite its computational simplicity, this Gibbs sampler comes at a price: it can mix slowly\ndue to coupling between S(t) and A. Alternate approaches like particle MCMC (Andrieu et al.,\n2010) do not exploit the MJP stucture, resulting in low acceptance rates, and estimates with high\nvariance. These challenges associated with MCMC raise the need for new techniques for Bayesian\ninference. Here, we bring recent ideas from variational Bayes towards posterior inference for MJPs,\nproposing a novel and ef\ufb01cient collapsed variational algorithm that marginalizes out the parameter A,\nthereby addressing the issue of slow mixing. Our algorithm adaptively \ufb01nds regions of low and high\ntransition activity, rather than integrating these out. In our experiments, we show that these can bring\nsigni\ufb01cant computational bene\ufb01ts. Our algorithm is based on an alternate approach to sampling an\nMJP trajectory called uniformization (Jensen, 1953), which we describe next.\n2 Uniformization\nGiven a rate matrix A, choose an \u2126 > max|Aii|, and sample a set of times from a Poisson process\nwith intensity \u2126. These form a random discretization of time, giving a set of candidate transition\ntimes (Figure 1, top right). Next sample a piecewise-constant trajectory by running a discrete-time\nMarkov chain over these times, with Markov transition matrix given by B = (I + 1\n\u2126 A), and with\ninitial distribution \u03c0. It is easy to verify that B is a valid transition matrix with at least one non-zero\ndiagonal element. This allows the discrete-time system to move back to the same state, something\nimpossible under the original MJP. In fact as \u2126 increases the probability of self-transitions increases;\nhowever at the same time, a large \u2126 implies a large number of Poisson-distributed candidate times.\nThus the self-transitions serve to discard excess candidate times, and one can show (Jensen, 1953;\nRao and Teh, 2014) that after discarding the self-transitions, the resulting distribution over trajectories\nis identical to an MJP with rate matrix A for any \u2126 \u2265 max|Aii| (Figure 1, bottom right).\nRao and Teh (2012) describe a generalization, where instead of a single \u2126, each state i has its own\ndominating rate \u2126i > |Aii|. The transition matrix B is now de\ufb01ned as Bii = 1 + Aii/\u2126i, and\n\n2\n\n\fBij = Aij/\u2126i, for all i, j \u2208 (1, . . . , N ), i (cid:54)= j. Now, on entering state i, one proposes the the next\ncandidate transition time from a rate-\u2126i exponential, and then samples the next state from Bi. As\nbefore, self-transitions amount to rejecting the opportunity to leave state i. Large \u2126i result in more\ncandidate transition times, but more self-transitions. Rao and Teh (2012) show that these two effects\ncancel out, and the resulting path, after discarding self-transitions is a sample from an MJP.\nAn alternate prior on the parameters of an MJP: We use uniformization to formulate a novel prior\ndistribution over the parameters of an MJP; this will facilitate our later variational Bayes algorithm.\nConsider Ai, the ith row of the rate matrix A. This is speci\ufb01ed by the diagonal element Aii, and\nthe vector Bi := 1|Aii| (Ai1,\u00b7\u00b7\u00b7 , Ai,i\u22121, 0, Ai,i+1,\u00b7\u00b7\u00b7 , AiN ). Recall that the latter is a probability\nvector, giving the probability of the next state after i. In Fearnhead and Sherlock (2006), the authors\nplace a Gamma prior on |Aii|, and what is effectively, a Dirichlet(\u03b1,\u00b7\u00b7\u00b7 , 0,\u00b7\u00b7\u00b7 , \u03b1) prior on Bi\n(although they treat Bi as an N \u2212 1-component vector by ignoring the 0 at position i).\nWe place a Dirichlet(a,\u00b7\u00b7\u00b7 , a0,\u00b7\u00b7\u00b7 , a) prior on Bi for all i. Such Bi\u2019s allow self-transitions, and\nform the rows of the transition matrix B from uniformization. Note that under uniformization, the\nrow Ai is uniquely speci\ufb01ed by the pair (\u2126, Bi) via the relationship Ai = \u2126(Bi \u2212 1i), where 1i is\nthe indicator for i. We complete our speci\ufb01cation by placing a Gamma prior over \u2126.\nNote that since the rows of A sum to 0, and the rows of B sum to 1, both matrices are completely\ndetermined by N (N \u2212 1) elements. On the other hand, our speci\ufb01cation has N (N \u2212 1) + 1 random\nvariables, the additional term arising because of the prior over \u2126. Given A, \u2126 plays no role in the\ngenerative process de\ufb01ned by Gillespie\u2019s algorithm, although it is an important parameter in MCMC\ninference algorithms based on uniformization. In our situation, B represents transition probabilities\nconditioned on there being a transition, and now \u2126 does carry information about A, namely the\ndistribution over event times. Later, we will look at the implied marginal distribution over A. First\nhowever, we consider the generalized uniformization scheme of Rao and Teh (2012). Here we have\nN additional parameters, \u21261 to \u2126N . Again, under our model, we place Gamma priors over these\n\u2126i\u2019s, and Dirichlet priors on the rows of the transition matrix B.\nNote that in Rao and Teh (2014, 2012), \u2126 is set to 2 maxi |Aii|. From the identity B = I + 1\n\u2126 A, it\nfollows that under any prior over A, with probability 1, the smallest diagonal element of B is 1/2.\nOur speci\ufb01cation avoids such a constrained prior over B, instead introducing an additional random\nvariable \u2126. Indeed, our approach is equivalent to a prior over (\u2126, A), with \u2126 = k maxi Aii for some\nrandom k. We emphasize that the choice of this prior over k does not effect the generative model,\nonly the induced inference algorithms such as Rao and Teh (2014) or our proposed algorithm.\nTo better understand the implied marginal distribution over A, consider the representation of Rao and\nTeh (2012), with independent Gamma priors over the \u2126i\u2019s. We have the following result:\nProposition 1. Place independent Dirichlet priors on the rows of B as above, and independent\nGamma((N \u2212 1)a + a0, b) priors on the \u2126i. Then, the associated matrix A has off-diagonal elements\nthat are marginally Gamma(a, b)-distributed, and negative-diagonal elements that are marginally\nGamma((N \u2212 1)a, b)-distributed, with the rows of A adding to 0 almost surely.\nThe proposition is a simple consequence of the Gamma-Dirichlet calculus: \ufb01rst observe that the\ncollection of variables \u2126iBij is a vector of independent Gamma(a, b) variables. Noting that Aij =\n\u2126iBij, we have that the off-diagonal elements of A are independent Gamma(a, b)s, for i (cid:54)= j. Our\nproof is complete when we notice that the rows of A sum to 0, and that the sum of independent\nGamma variables is still Gamma-distributed, with scale parameter equal to the sum of the scales. It is\nalso easy to see that given A, the \u2126i is set by \u2126i = |Aii| + \u03c9i, where \u03c9i \u223c Gamma(a0, b).\nIn this work, we will simply matters by scaling all rows by a single, shared \u2126. This will result in a\nvector of Aij\u2019s each marginally distributed as a Gamma variable, but now positively correlated due to\nthe common \u2126. We will see that this simpli\ufb01cation does not affect the accuracy of our method. In\nfact, as our variational algorithm will maintain just a point estimate for \u2126, so that its effect on the\ncorrelation between the Aii\u2019s is negligible.\n3 Variational inference for MJPs\nGiven noisy observations X of an MJP, we are interested in the posterior p(S(t), A|X). Following\nthe earlier section, we choose an augmented representation, where we replace A with the pair (B, \u2126).\nSimilarly, we represent the MJP trajectory S(t) with the pair (T, U ), where T is the set of candidate\ntransition times, and U (with |U| = |T|), is the set of states at these times. For our variational\n\n3\n\n\fwe will approximate q(U|T ) factorially as q(U|T ) =(cid:81)|T|\n\nalgorithm, we will integrate out the Markov transition matrix B, working instead with the marginal\ndistribution p(T, U, \u2126). Such a collapsed representation avoids issues that plague MCMC and VB\napproaches, where coupling between trajectory and transition matrix slows down mixing/convergence.\nThe distribution p(T, U, \u2126) is still intractable however, and as is typical in variational algorithms, we\nwill make a factorial approximation p(T, U, \u2126) \u2248 q(T, U )q(\u2126). Writing q(T, U ) = q(U|T )q(T ),\nwe shall also restrict q(T ) to a delta-function: q(T ) = \u03b4 \u02c6T (T ) for some \u02c6T . In this way, \ufb01nding the\n\u2018best\u2019 approximating q(T ) within this class amounts to \ufb01nding a \u2018best\u2019 discretization of time. This\napproach of optimizing over a time-discretization is in contrast to MCMC schemes that integrate out\nthe time discretization, and has a two advantages:\nSimpli\ufb01ed computation: Searching over time-discretization can be signi\ufb01cantly more ef\ufb01cient than\nintegrating it out. This is especially true when a trajectory involves bursts of transitions interspersed\nwith long periods of inactivity, where schemes like Rao and Teh (2014) can be quite inef\ufb01cient.\nBetter interpretability: A number of applications use MJPs as tools to segment a time interval into\ninhomogeneous segments. A full distribution over such an object can be hard to deal with.\nFollowing work on variational inference for discrete-time Markov chains (Wang and Blunsom, 2013),\nt=1 q(ut). Finally, since we \ufb01x q(T ) to a\ndelta function, we will also restrict q(\u2126) to a delta function, only representing uncertainty in the MJP\nparameters via the marginalized transition matrix B.\nWe emphasize that even though we optimize over time discretizations, we still maintain posterior\nuncertainty of the MJP state. Thus our variational approximation represents a distribution over\npiecewise-constant trajectories as a single discretization of time, with a probability vector over\nstates for each time segment (Figure 2). Such an approximation does not involve too much loss\nof information, while being more convenient than a full distribution over trajectories, or a set of\nsample paths. While optimizing over trajectories, our algorithm attempts to \ufb01nd segments where the\ndistribution over states is reasonably constant, if not it will re\ufb01ne a segment into two smaller ones.\nOur overall variational inference algorithm then involves minimizing the Kullback-Liebler distance\nbetween this posterior approximation and the true posterior. We do this in a coordinate-wise manner:\nt=1 q(ut): Given a discretization T , and an \u2126, uniformization tells us that\ninference over U is just inference for a discrete-time hidden Markov model. We adapt the approach of\nWang and Blunsom (2013) to update q(U ). Assume the observations X follow an exponential family\nlikelihood with parameter Cs for state s: p(xl\nt)/Z(Cs), where Z is\nt is the l-th observation observed in between [Tt, Tt+1). Then for a\nthe normalization constant, and xl\nsequence of |T| observations, we have p(X, U|B, C) \u221d\n\n1) Updating q(U|T ) =(cid:81)|T|\n\nt|St = s) = exp(\u03c6(xl\n\nt)T Cs)h(xl\n\nexp(\u03c6(xl\n\nt)T Cut)h(xl\n\nt)/Z(cut) =\n\nB#ij\n\nij\n\nexp( \u00af\u03c6T\n\ni Ci)(\n\ni=1\n\nj=1\n\ni=1\n\nt=0\n\nl=1\n\n\uf8ee\uf8f0 S(cid:89)\n\nS(cid:89)\n\n\uf8f9\uf8fb S(cid:89)\n\n|T|(cid:89)\n\nnt(cid:89)\n\nh(xl\nt)\nZ(Cut)\n\n)\n\nHere nt is the number of observations in [Tt, Tt+1) and #ij is the number of transitions from state i\n\nl=1 \u03c6(xl\n\nt,s.t. ut=i\n\n\u02dc\u03c6t.\n\nPlacing Dirichlet(\u03b1) priors on the rows of B, and an appropriate conjugate prior on C, we have\n\nBut,ut+1\n\n|T|(cid:89)\nnt(cid:89)\nto j, and \u02dc\u03c6t =(cid:80)nt\n\nt=0\n\nl=1\n\np(X, U, B, C) \u221d=\n\nexp(C T\n\ni ( \u00af\u03c6i + \u03b2))(\n\n|T|(cid:89)\n\nnt(cid:89)\n\nh(xl\nt)\nZ(Cut)\n\n).\n\nt) and \u00af\u03c6i =(cid:80)\n\uf8ee\uf8f0 S(cid:89)\nS(cid:89)\n\n\u0393(S\u03b1)\n\n\uf8ee\uf8f0 S(cid:89)\n\ni=1\n\n\uf8f9\uf8fb S(cid:89)\n\nB#ij +\u03b1\u22121\n\nij\n\u0393(\u03b1)\n\nS(cid:89)\n\n\uf8f9\uf8fb S(cid:89)\n\nl=1\nIntegrating out B and C, and writing #i for the number of visits to state i, we have:\n\nj=1\n\ni=1\n\ni=1\n\nt=0\n\np(X, U ) \u221d=\n\n\u0393(S\u03b1)\n\n\u0393(#ij + \u03b1)\n\n\u0393(#i + \u03b1)\n\nj=1\n\n\u0393(\u03b1)\n\n\u00afZi( \u00af\u03c6i + \u03b2).\n\nThen, p(ut = k|\u00b7) \u221d (#\u00act\n\nut\u22121,k + \u03b1)\u03b4t\n\n+ \u03b4t\u22121,t+1\n\nk\n\nk (#\u00act\n(#\u00act\n\nk,ut\u22121\nk + \u03b1)\u03b4t\n\nk\n\ni=1\n\n+ \u03b1)\u03b4t\n\nk\n\n\u00b7 \u00afZk( \u00af\u03c6\u00act\n\nk + \u00af\u03c6k(Xt) + \u03b2)\n\nStandard\n=\nargmin KL(q(U, T, \u2126)(cid:107)p(U, T, \u2126|X)) as q(ut) = Eq\u00act[log p(ut|\u00b7)], We then have the update\n\ncalculations\n\nvariational\n\ninference\n\nsolution\n\nq(ut)\n\ngive\n\nthe\n\nfor\n\nto\n\n4\n\n\f(left) Merging to\nFigure 2:\ntime segments.\n(right) split-\nting a time segment. Horizon-\ntal arrows are VB messages.\n\nrule:\n\n+ \u03b1]\n\n\u00b7 Eq\u00act \u00afZk( \u00af\u03c6\u00act\n\nk + \u00af\u03c6k(Xt) + \u03b2)\n\nk,ut\u22121\n\n+ \u03b4t\u22121,t+1\n+ \u03b4t\u22121,t+1\n\nk\n\nk\n\nq(ut = k) \u221d Eq\u00act[#\u00act\nEq\u00act[#\u00act\n\nut\u22121,k + \u03b1]Eq\u00act[#\u00act\nut\u22121,k + S\u03b1]Eq\u00act[#\u00act\n\n+ \u03b1]\n\nk,ut\u22121\n\nk + \u03b2)\n\nEq\u00act \u00afZk( \u00af\u03c6\u00act\nFor the special case of multinomial observations, we refer to Wang and Blunsom (2013).\n2) Updating q(T ): We perform a greedy search over the space of time-discretizations by making\nlocal stochastic updates to the current T . Every iteration, we \ufb01rst scan the current T to \ufb01nd a\nbene\ufb01cial merge (Figure 2, left): go through the transition times in sequential or random order, merge\nwith the next time interval, compute the variational lower bound under this discretization, and accept\nif it results in an improvement. This eliminates unnecessary transitions times, reducing fragmentation\nof the segmentation, and the complexity of the learnt model. Calculating the variational bound for\nthe new time requires merging the probability vectors associated with the two time segments into a\nnew one. One approach is to initialize this vector to some arbitrary quantity, run step 1 till the q\u2019s\nconverge, and use the updated variational bound to accept or reject this proposal. Rather than taking\nthis time-consuming approach, we found it adequate to set the new q to a convex combination to the\nold q\u2019s, each weighted by the length of their corresponding interval length. In our experiments, we\nfound that this performed comparably at a much lower computational cost.\nIf no merge is found, we then try to \ufb01nd a bene\ufb01cial split. Again, go through the time segments in\nsome order, now splitting each interval into two. After each split, compare the likelihood before and\nafter the split, and accept (and return) if the improvement exceeds a threshold. Again, such a split\nrequires computing probability vectors for the newly created segments. Now, we assign each segment\nthe same vector as the original segment (plus some noise to break symmetry). We then run one pass\nof step 1, updating the q\u2019s on either side of the new segment, and then updating the q\u2019s in the two\nsegments. We consider two interval splitting schemes, bisection and random-splitting.\nOverall, our approach is related to split-merge approaches for variational inference in nonparametric\nBayesian models Hughes et al. (2015); these too maintain and optimize point estimates of complex,\ncombinatorial objects, instead maintaining uncertainty over quantities like cluster assignment. In our\nreal-world check-in applications, we consider a situation where there is not just one MJP trajectory,\nbut a number of trajectories corresponding to different users. In this situation, we take a stochastic\nvariational Bayes approach, picking a random user and following the steps outlined earlier.\nUpdating q(\u2126): With a Gamma(a1, a2) prior over \u2126, the posterior over \u2126 is also Gamma, and we\ncould set \u2126 to the MAP. We found this greedy approach unstable sometimes, instead using a partial\nupdate, with the new \u2126 equal to the mean of the old value and the MAP value. Writing s for the total\nnumber of transition times in all m trajectories, this gives us \u2126new = (\u2126old + (a1 + s)/(a2 + m))/2.\n4 Experiments\nWe present qualitative and quantitative experiments using synthetic and real datasets to demonstrate\nthe accuracy and ef\ufb01ciency of our variational Bayes (VB) algorithm. We mostly focus on comparisons\nwith the MCMC algorithms from Rao and Teh (2014) and Rao and Teh (2012).\nDatasets. We use a dataset of check-in sequences from 8967 FourSquare users in the year 2011,\noriginally collected by Gao et al. (2012) for studying location-based social networks. Each check-in\nhas a time stamp and a location (latitude and longitude), with users having 191 check-in records on\naverage. We only consider check-ins inside a rectangle containing the United States and parts of\nMexico and Canada (see Figure 3, left), and randomly select 200 such sequences for our experiments.\nWe partition the space into a 40\u00d7 40 grid, and de\ufb01ne the observation distribution of each MJP state as\na categorical distribution over the grid cells. See Pan et al. (2016) for more details on this application.\nWe also use two synthetic datasets in our experiments, with observations in a 5 \u00d7 5 grid. For the\n\ufb01rst dataset, we \ufb01x \u2126 = 20 and construct a transition matrix B for 5 states with B(i, i) = 0.8,\n\n5\n\n\f(left) check-ins of 500 users.\nFigure 3:\n(right-top) heatmap of emission matrices;\n(right-bottom) true and inferred trajectories:\nthe y-values are perturbed for clarity.\n\nFigure 4: (left,middle) posterior distribution over states of two trajectories in second synthetic dataset;\n(right) evolution of log p(T | \u2126, X) in the VB algorithm for two sample sequences\nB(i, 5) = 0.19, B(5, 5) = 0, and B(5, i) = 0.25 for i \u2208 [1, 4]. By construction, these sequences\ncan contain many short time intervals at state 5, and longer time intervals at other states. We set the\nobservation distribution of state i to have 0.2 probability on grid cells in the i-th row and 0 probability\notherwise. For the second synthetic dataset, we use 10 states and draw both the transition probabilities\nof B and the observation probabilities from Dirichlet(1) distribution. Given (\u2126, B), we sample 50\nsequences, each containing 100 evenly spaced observations.\nHyperparameters: For VB on synthetic datasets we place a Gamma(20, 2) prior on \u2126, and Dirich-\nlet(2) priors on the transition probabilities and the observation probabilities, while on the check-in\ndata, a Gamma(6, 1), a Dirichlet(0.1) and a Dirichlet(0.01) are placed. For MCMC on synthetic\ndatasets, we place a Gamma(2, 0.2) and a Dirichlet(0.1) for the rate matrix, while on the check-in\ndata, a Gamma(1, 1) and a Dirichlet(0.1) are placed.\nVisualization: We run VB on the \ufb01rst synthetic dataset for 200 iterations, after which we use the\nposterior expected counts of observations in each state to infer the output emission probabilities\n(see Figure 3(top-right)). We then relabel the states under the posterior to best match the true\nstate (our likelihood is invariant to state labels); Figure 3(bottom-right) shows the true and MAP\nMJP trajectories of two sample sequences in the synthetic dataset. Our VB algorithm recovers the\ntrajectories well, although it is possible to miss some short \u201cbumps\u201d. MCMC also performs well in\nthis case, although as we will show, it is signi\ufb01cantly more expensive.\nThe inferred posteriors of trajectories have more uncertainty for the second synthetic dataset. Figure 4\n(left and middle) visualizes the posterior distributions of two hidden trajectories with darker regions\nfor higher probabilities. The ability to maintain posterior uncertainty about the trajectory information\n\nFigure 5:\nreconstruction error of\nMCMC and VB (using random and\neven splitting) for the (left) \ufb01rst and\n(right) the second synthetic dataset. The\nrandom split scheme is in blue , even\nsplit scheme is in red , and VB random\nsplit scheme with true omega in orange.\nMCMC is in black.\n\n6\n\ntime00.20.40.60.81MJP state0246true trajectoryinferred trajectory (VB)Observationstime00.20.40.60.81MJP state0246true trajectoryinferred trajectory (VB)Observationstime00.20.40.60.81MJP state024681012time00.20.40.60.81MJP state024681012iteration050100150200log-probability-38-36-34-32-30-28-160-150-140-130-1201.61.71.81.92.00500100015002000running time (seconds)reconstruction error2.32.42.52.62.70500100015002000running time (seconds)reconstruction error\fFigure 6: Synthetic dataset\n1(top) and 2(bottom): His-\ntogram of number of transi-\ntions using VB with (left) ran-\ndom splitting; (middle) even\nspliting; (right) using MCMC.\n\nFigure 7: histogram of num-\nber of transitions using (left)\nVB and (middle) MCMC;\n(right) transition times of 10\nusers using VB\n\nis important in real world applications, and is something that k-means-style approximate inference\nalgorithms (Huggins et al., 2015) ignore.\nInferred trajectories for real-world data. We run the VB algorithm on the check-in data using 50\nstates for 200 iterations. Modeling such data with MJPs will recover MJP states corresponding to\ncities or areas of dense population/high check-in activity. We investigate several aspects about the\nMJP trajectories inferred by the algorithm. Figure 4(right) shows the evolution of log p(T | \u2126, X)\n(up to constant factor) of two sample trajectories. This value is used to determine whether a merge or\nsplit is bene\ufb01cial in our VB algorithm. It has an increasing trend for most sequences in the dataset, but\ncan sometimes decrease as the trajectory discretization evolves. This is expected, since our stochastic\nalgorithm maintains a pseudo-bound. Figure 6 shows similar results for the synthetic datasets.\nNormally, we expect a user to switch areas of check-in activity only a few times in a year. Indeed,\nFigure 7 (left) shows the histogram of the number of transition times across all trajectories, and\nthe majority of trajectories have 3 or less transitions. We also plot the actual transition times of\n10 random trajectories (right). In contrast, MCMC tends to produce more transitions, many of\nwhich are redundant. This is a side effect of uniformization in MCMC sampling, which requires a\nhomogeneously dense Poisson distributed trajectory discretization at every iteration.\nRunning time vs. reconstruction error. We measure the quality of the inferred posterior distribu-\ntions of trajectories using a reconstruction task on the check-in data. We randomly select 100 test\nsequences, and randomly hold out half of the observations in each test sequence. The training data\nconsists of the observations that are not held out, i.e., 100 full sequences and 100 half sequences. We\nrun our VB algorithm on this training data for 200 iterations. After each iteration, we reconstruct the\nheld-out observations as follows: given a held-out observation at time t on test sequence \u03c4, using the\nmaximum-likelihood grid cell to represent each state, we compute the expected grid distance between\nthe true and predicted observations using the estimated posterior q(ut). The reconstruction error\nfor \u03c4 is computed by averaging the grid distances over all held-out observations in \u03c4. The overall\nreconstruction error is the average reconstruction error over all test sequences. Similarly, we run the\nMCMC algorithm on the training data for 1000 iterations, and compute the overall reconstruction\nerror after every 10 iterations, using the last 300 iterations to approximate the posterior distribution\nof the MJP trajectories. We also run an improved variant of the MCMC algorithm, where we use the\ngeneralized uniformization scheme Rao and Teh (2012) with different \u2126i for each state. This allows\ncoarser discretizations for some states and typically runs faster per iteration.\n\n7\n\n01020300102030# of transitions# of trajectories0510150102030# of transitions# of trajectories025050075010000102030# of transitions# of trajectories01020300102030# of transitions# of trajectories051015200102030# of transitions# of trajectories02004000102030# of transitions# of trajectories# of transitions051015# of trajectories020406080100120# of transitions020406080# of trajectories01020304050time00.20.40.60.81trajectory id0246810\fFigure 8: (left) reconstruction error of VB and MCMC algorithms; (middle) reconstruction error\nusing random and even splitting; (right) reconstruction error for more iterations\n\nFigure 9: Posterior distribution over states of three trajectories in checkin dataset.\n\nFigure 8(left) shows the evolution of reconstruction error during the algorithms. The error using VB\nplateaus much more quickly than the MCMC algorithms. The error gap between MCMC and VB\nis because of slow mixing of the paths and parameters, as a result of the coupling between latent\nstates and observations as well as modeling approximations. Although the improved MCMC takes\nless time per iteration, it is not more effective for reconstruction in this experiment. Figure 5 shows\nsimilar results for the synthetic datasets. Figure 9 visualizes the posterior distributions of three hidden\ntrajectories with darker shades for higher probabilities.\nWe have chosen to split each time interval randomly in our VB algorithm. Another possibility is\nto simply split it evenly. Figure 8(middle) compares the reconstruction error of the two splitting\nschemes. Random splitting has lower error since it produces more successful splits; on the other hand,\nthe running time is smaller with even splitting due to fewer transitions in the inferred trajectories. In\nFigure 8(right), we resampled the training set and the testing set and ran the experiment for longer. It\nshows that the error gap between VB and MCMC is closing.\nRelated and future work: Posterior inference for MJPs has primarily been carried out via\nMCMC Hobolth and Stone (2009); Fearnhead and Sherlock (2006); Bladt and S\u00f8rensen (2005); Met-\nzner et al. (2007). The state-of-the-art MCMC approach is the scheme of Rao and Teh (2014, 2012),\nboth based on uniformization. Other MCMC approaches center around particle MCMC Andrieu\net al. (2010), e.g. Hajiaghayi et al. (2014). There have also been a few deterministic approaches\nto posterior inference. The earliest variational approach is from Opper and Sanguinetti (2007),\nalthough they consider a different problem from ours, viz. structured MJPs with interacting MJPs\n(e.g. population sizes of a predator and prey species, or gene networks). They then use a mean-\ufb01eld\nposterior approximation where these processes are assumed independent. Our algorithm focuses on\na single, simple MJP, and an interesting extension is to put the two schemes together for systems\nof coupled MJPs. Finally a recent paper Huggins et al. (2015) that studies the MJP posterior using\na small-variance asymptotic limit. This approach, which generalizes k-means type algorithms to\nMJPs however provides only point estimates of the MJP trajectory and parameters, and cannot\nrepresent posterior uncertainty. Additionally, it still involves coupling between the MJP parameters\nand trajectory, an issue we bypass with our collapsed algorithm.\nThere are a number of interesting extensions worth studying. First is to consider more structured vari-\national approximations (Wang and Blunsom, 2013), than the factorial approximations we considered\nhere. Also of interest are extensions to more complex MJPs, with in\ufb01nite state-spaces (Saeedi and\nBouchard-C\u00f4t\u00e9, 2011) or structured state-spaces (Opper and Sanguinetti, 2007). It is also interesting\nto look at different extensions of the schemes we proposed in this paper: different choices of split-\nmerge proposals, and more complicated posterior approximations of the parameter \u2126. Finally, it is\ninstructive to use other real-world datasets to compare our approaches with more traditional MCMC\napproaches.\n\n8\n\nrunning time (seconds)020040060080010001200reconstruction error0246810VBiteration 50MCMCiteration 200improved MCMCiteration 200running time (seconds)020040060080010001200reconstruction error0246810VB random splitVB even split0100020003000400001234VB random splitVB even splitMCMCreconstruction errorrunning time (seconds)0102030400.51.01.52.02.5jumpstate01020304012345jumpstate01020304012345jumpstate\fReferences\nAndrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. Journal of the\n\nRoyal Statistical Society Series B, 72(3):269\u2013342.\n\nBladt, M. and S\u00f8rensen, M. (2005). Statistical inference for discretely observed Markov jump processes. Journal\n\nof the Royal Statistical Society: B, 67(3):395\u2013410.\n\nBoys, R. J., Wilkinson, D. J., and Kirkwood, T. B. L. (2008). Bayesian inference for a discretely observed\n\nstochastic kinetic model. Statistics and Computing, 18(2):125\u2013135.\n\n\u00c7inlar, E. (1975). Introduction to Stochastic Processes. Prentice Hall.\n\nFearnhead, P. and Sherlock, C. (2006). An exact Gibbs sampler for the Markov-modulated Poisson process.\n\nJournal Of The Royal Statistical Society Series B, 68(5):767\u2013784.\n\nGao, H., Tang, J., and Liu, H. (2012). gscorr: Modeling geo-social correlations for new check-ins on location-\nbased social networks. In Proc. of the 21st ACM conf. on Information and knowledge management. ACM.\n\nGillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem., 81(25):2340\u2013\n\n2361.\n\nHajiaghayi, M., Kirkpatrick, B., Wang, L., and Bouchard-C\u00f4t\u00e9, A. (2014). Ef\ufb01cient Continuous-Time Markov\nChain Estimation. In International Conference on Machine Learning (ICML), volume 31, pages 638\u2013646.\n\nHobolth, A. and Stone, E. (2009). Simulation from endpoint-conditioned, continuous-time Markov chains on a\n\n\ufb01nite state space, with applications to molecular evolution. Ann Appl Stat, 3(3):1204.\n\nHuggins, J. H., Narasimhan, K., Saeedi, A., and Mansinghka, V. K. (2015). Jump-means: Small-variance\nasymptotics for Markov jump processes. In Proceedings of the 32nd International Conference on Machine\nLearning, ICML 2015, Lille, France, 6-11 July 2015, pages 693\u2013701.\n\nHughes, M. C., Stephenson, W. T., and Sudderth, E. B. (2015). Scalable adaptation of state complexity for\n\nnonparametric hidden Markov models. In NIPS 28, pages 1198\u20131206.\n\nJensen, A. (1953). Markoff chains as an aid in the study of Markoff processes. Skand. Aktuarietiedskr., 36:87\u201391.\n\nMetzner, P., Horenko, I., and Sch\u00fctte, C. (2007). Generator estimation of Markov jump processes based on\n\nincomplete observations nonequidistant in time. Phys. Rev. E, 76.\n\nOpper, M. and Sanguinetti, G. (2007). Variational inference for Markov jump processes. In NIPS 20.\n\nPan, J., Rao, V., Agarwal, P., and Gelfand, A. (2016). Markov-modulated marked poisson processes for check-in\n\ndata. In International Conference on Machine Learning, pages 2244\u20132253.\n\nRao, V. and Teh, Y. W. (2014). Fast MCMC sampling for Markov jump processes and extensions. Journal of\n\nMachine Learning Research, 13.\n\nRao, V. A. and Teh, Y. W. (2012). MCMC for continuous-time discrete-state systems. In Bartlett, P., Pereira, F.,\nBurges, C., Bottou, L., and Weinberger, K., editors, Advances in Neural Information Processing Systems 25,\npages 710\u2013718.\n\nSaeedi, A. and Bouchard-C\u00f4t\u00e9, A. (2011). Priors over Recurrent Continuous Time Processes. In NIPS 24.\n\nWang, P. and Blunsom, P. (2013). Collapsed variational Bayesian inference for hidden Markov models. In\n\nAISTATS.\n\nXu, J. and Shelton, C. R. (2010). Intrusion detection using continuous time Bayesian networks. Journal of\n\nArti\ufb01cial Intelligence Research, 39:745\u2013774.\n\n9\n\n\f", "award": [], "sourceid": 2085, "authors": [{"given_name": "Boqian", "family_name": "Zhang", "institution": "Purdue University"}, {"given_name": "Jiangwei", "family_name": "Pan", "institution": "Facebook"}, {"given_name": "Vinayak", "family_name": "Rao", "institution": "Purdue University"}]}