{"title": "Deep Poisson gamma dynamical systems", "book": "Advances in Neural Information Processing Systems", "page_first": 8442, "page_last": 8452, "abstract": "We develop deep Poisson-gamma dynamical systems (DPGDS) to model sequentially observed multivariate count data, improving previously proposed models by not only mining deep hierarchical latent structure from the data, but also capturing both first-order and long-range temporal dependencies. Using sophisticated but simple-to-implement data augmentation techniques, we derived closed-form Gibbs sampling update equations by first backward and upward propagating auxiliary latent counts, and then forward and downward sampling latent variables. Moreover, we develop stochastic gradient MCMC inference that is scalable to very long multivariate count time series. Experiments on both synthetic and a variety of real-world data demonstrate that the proposed model not only has excellent predictive performance, but also provides highly interpretable multilayer latent structure to represent hierarchical and temporal information propagation.", "full_text": "Deep Poisson gamma dynamical systems\n\nDandan Guo,\n\nHao Zhang\nNational Laboratory of Radar Signal Processing\n\nBo Chen\u2217,\n\nCollaborative Innovation Center of Information Sensing and Understanding\n\nXidian University, Xi\u2019an, China\n\ngdd_xidian@126.com, bchen@mail.xidian.edu.cn, zhanghao_xidian@163.com\n\nMingyuan Zhou\n\nMcCombs School of Business\n\nThe University of Texas at Austin\n\nAustin, TX 78712, USA\n\nmingyuan.zhou@mccombs.utexas.edu\n\nAbstract\n\nWe develop deep Poisson-gamma dynamical systems (DPGDS) to model sequen-\ntially observed multivariate count data, improving previously proposed models by\nnot only mining deep hierarchical latent structure from the data, but also capturing\nboth \ufb01rst-order and long-range temporal dependencies. Using sophisticated but\nsimple-to-implement data augmentation techniques, we derived closed-form Gibbs\nsampling update equations by \ufb01rst backward and upward propagating auxiliary\nlatent counts, and then forward and downward sampling latent variables. More-\nover, we develop stochastic gradient MCMC inference that is scalable to very\nlong multivariate count time series. Experiments on both synthetic and a variety\nof real-world data demonstrate that the proposed model not only has excellent\npredictive performance, but also provides highly interpretable multilayer latent\nstructure to represent hierarchical and temporal information propagation.\n\n1\n\nIntroduction\n\nThe need to model time-varying count vectors x1, ..., xT appears in a wide variety of settings,\nsuch as text analysis, international relation study, social interaction understanding, and natural\nlanguage processing [1\u20139]. To model these count data, it is important to not only consider the\nsparsity of high-dimensional data and robustness to over-dispersed temporal patterns, but also\ncapture complex dependencies both within and across time steps. In order to move beyond linear\ndynamical systems (LDS) [10] and its nonlinear generalization [11] that often make the Gaussian\nassumption [12], the gamma process dynamic Poisson factor analysis (GP-DPFA) [5] factorizes\nthe observed time-varying count vectors under the Poisson likelihood as xt \u223c Poisson(\u03a6\u03b8t), and\ntransmit temporal information smoothly by evolving the factor scores with a gamma Markov chain\nas \u03b8t \u223c Gamma(\u03b8t\u22121, \u03b2), which has highly desired strong non-linearity. To further capture cross-\nfactor temporal dependence, a transition matrix \u03a0 is further used in Poisson\u2013gamma dynamical\nsystem (PGDS) [7] as \u03b8t \u223c Gamma(\u03a0\u03b8t\u22121, \u03b2). However, these shallow models may still have\nshortcomings in capturing long-range temporal dependencies [8]. For example, if given \u03b8t, then \u03b8t+1\nno longer depends on \u03b8t\u2212k for all k \u2265 1.\nDeep probabilistic models are widely used to capture the relationships between latent variables across\nmultiple stochastic layers [4, 8, 13\u201316]. For example, deep dynamic Poisson factor analysis (DDPFA)\n\n\u2217Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[8] utilizes recurrent neural networks (RNN) [3] to capture long-range temporal dependencies of\nthe factor scores. The latent variables and RNN parameters, however, are separately inferred. Deep\ntemporal sigmoid belief network (DTSBN) [4] is a deep dynamic generative model de\ufb01ned as a\nsequential stack of sigmoid belief networks (SBNs), whose hidden units are typically restricted to be\nbinary. Although a deep structure is designed to describe complex long-range temporal dependencies,\nhow the layers in DTSBN are related to each other lacks an intuitive interpretation, which is of\nparamount interest for a multilayer probabilistic model [15].\nIn this paper, we present deep Poisson gamma dynamical systems (DPGDS), a deep probabilistic\ndynamical model that takes the advantage of the hierarchical structure to ef\ufb01ciently incorporate\nboth between-layer and temporal dependencies, while providing rich interpretation. Moving beyond\nDTSBN using binary hidden units, we build a deep dynamic directed network with gamma distributed\nnonnegative real hidden units, inferring a multilayer contextual representation of multivariate time-\nvarying count vectors. Consequently, DPGDS can handle highly overdispersed counts, capturing\nthe correlations between the visible/hidden features across layers and over times using the gamma\nbelief network [15]. Combing the deep and temporal structures shown in Fig. 1(a), DPGDS breaks\nthe assumption that given \u03b8t, \u03b8t+1 no longer depends on \u03b8t\u2212k for k \u2265 1, suggesting that it may\nbetter capture long-range temporal dependencies. As a result, the model can allow more speci\ufb01c\ninformation, which are also more likely to exhibit fast temporal changing, to transmit through lower\nlayers, while allowing more general information, which are also more likely to slowly evolve over\ntime, to transmit through higher layers. For example, as shown in Fig. 1(b) that is learned from\nGDELT2003 with DPGDS, when analyzing these international events, the factors at lower layers are\nmore speci\ufb01c to discover the relationships between the different countries, whereas those at higher\nlayers are more general to re\ufb02ect the con\ufb02icts between the different areas consisting of several related\ncountries, or the ones occurring simultaneously, and the latent representation \u03b8t at a lower layer\nvaries more intensely than that at a higher layer.\nDistinct from DDPFA [8] that adopts a two-stage inference, the latent variables of DPGDS can\nbe jointly trained with both a Backward-Upward\u2013Forward-Downward (BUFD) Gibbs sampler and\na sophisticated stochastic gradient MCMC (SGMCMC) algorithm that is scalable to very long\nmultivariate time series [17\u201321]. Furthermore, the factors learned at each layer can re\ufb01ne the\nunderstanding and analysis of sequentially observed multivariate count data, which, to the best of\nour knowledge, may be very challenging for existing methods. Finally, based on a diverse range\nof real-world data sets, we show that DPGDS exhibits excellent predictive performance, inferring\ninterpretable latent structure with well captured long-range temporal dependencies.\n\n2 Deep Poisson gamma dynamic systems\n\nShown in Fig. 1(a) is the graphical representation of a three-hidden-layer DPGDS. Let us denote\n\u03b8 \u223c Gam(a, c) as a gamma random variable with mean a/c and variance a/c2. Given a set of\nV -dimensional sequentially observed multivariate count vectors x1, ..., xT , represented as a V \u00d7 T\nmatrix X, the generative process of a L-hidden-layer DPGDS, from top to bottom, is expressed as\n,\u00b7\u00b7\u00b7 ,\n\n,\u00b7\u00b7\u00b7 , \u03b8(l)\n\n+ \u03a0(l)\u03b8(l)\n\n\u03c40(\u03a6(l+1)\u03b8(l+1)\n\n\u03c40\u03a0(L)\u03b8(L)\n\n(cid:17)\n\n(cid:17)\n\nt\u22121), \u03c40\n\n(cid:16)\n(cid:16)\n\nt \u223c Gam\n\u03b8(L)\nt \u223c Gam\n\u03b8(1)\n\nt\u22121, \u03c40\nt + \u03a0(1)\u03b8(1)\n\nt\u22121), \u03c40\n\n\u03c40(\u03a6(2)\u03b8(2)\n\n(cid:16)\nt \u223cPois\n\n(cid:17)\nt \u223c Gam\n, x(1)\n\n(cid:16)\n\nt\n\n(cid:17)\n\u03b4(1)\nt \u03a6(1)\u03b8(1)\n,\nt \u2208 RKl\n\nt\n\n(1)\n\n+\n\nis the factor loading matrix at layer l, \u03b8(l)\n\n+ the hidden units of\na transition matrix of layer l that captures cross-factor temporal\nt \u2208 R+ as a scaling factor, re\ufb02ecting the scale of the counts at time t;\nt = \u03b4(1) for t = 1, ..., T . We denote \u03c40 \u2208 R+ as a scaling hyperparameter that\nare\n\nwhere \u03a6(l) \u2208 RKl\u22121\u00d7Kl\nlayer l at time t, and \u03a0(l) \u2208 RKl\u00d7Kl\n+\ndependencies. We denote \u03b4(1)\none may also set \u03b4(1)\ncontrols the temporal variation of the hidden units. The multilayer time-varying hidden units \u03b8(l)\nt\nwell suited for downstream analysis, as will be shown below.\nDPGDS factorizes the count observation x(1)\nPoisson likelihood. It further factorizes the shape parameters of the gamma distributed \u03b8(l)\nlayer l at time t into the sum of \u03a6(l+1)\u03b8(l+1)\nand \u03a0(l)\u03b8(l)\n\nt\u22121, capturing the temporal dependence at the same layer. At the top layer, \u03b8(L)\n\n, \u03a6(1), and \u03b8(1)\n\ninto the product of \u03b4(1)\n\nunder the\nt of\n, capturing the dependence between different layers,\nis only\n\nt\n\nt\n\nt\n\nt\n\nt\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: Graphical model and illustration for a three-hidden-layer deep Poisson Gamma Dynamical\nSystem (DPGDS). (a) The generative model; (b) Visualization of data and latent factors learned\nfrom GDELT2003, with the black, red, blue and green lines denoting the observed data, temporal\ntrajectories of example latent factors at layer 1, 2, 3, respectively.\n\n(cid:16)\n\n(cid:17)\n\nt\n\n(cid:16)\n\nk\n\n(cid:17)\n\n1\n\nKl\n\n, \u03c40\n\ndependent on \u03a0(L)\u03b8(L)\n1 \u223c Gam\n\u03b8(L)\n\u03bd (l) = (\u03bd(l)\n\n\u03c40\u03bd(L)\n1 , ..., \u03bd(l)\nKl\nk \u223c Dir(\u03bd(l)\n1 \u03bd(l)\n\u03c0(l)\n\n1 \u223c Gam\n\n\u03c40\u03a6(l+1)\u03b8(l+1)\n\nfor l = 1, . . . , L \u2212 1 and\n. To complete the hierarchical model, we introduce Kl factor weights\n\nt\u22121, and at t = 1, \u03b8(l)\n, \u03c40\n) in layer l to model the strength of each factor, and for l = 1, ..., L, we let\nk , ..., \u03bd(l)\n\nk \u223c Gam( \u03b30\n\nk ..., \u03bd(l)\nKl\ncan be interpreted as the probability of transiting\n\nk , \u03bd(l)\nk is the kth column of \u03a0(l) and \u03c0(l)\nk1k2\n\nk , \u03be(l)\u03bd(l)\n\n\u03bd(l)\nk ), \u03bd(l)\n\nk\u22121\u03bd(l)\n\nk+1\u03bd(l)\n\n, \u03b2(l)).\n\n(2)\n\n1k , ..., \u03c6(l)\n\nk = (\u03c6(l)\n\nKl\u22121k) \u223c Dir(\u03b7(l), ..., \u03b7(l)), and \u03b4(1)\n\nNote that \u03c0(l)\nfrom topic k2 of the previous time to topic k1 of the current time at layer l.\nFinally, we place Dirichlet priors on the factor loadings and draw other parameters from a noninforma-\n, \u03be(l), \u03b2(l) \u223c Gam(\u00010, \u00010).\ntive gamma prior: \u03c6(l)\nNote that imposing Dirichlet distributions on the columns of \u03a0(l) and \u03a6(l) not only makes the\nlatent representation more identi\ufb01able and interpretable, but also facilitates inference, as will be\nshown in the next section. Clearly when L = 1, DPGDS reduces to PGDS [7]. In real-world\napplications, a binary observation can be linked to a latent count using the Bernoulli-Poisson link as\nb = 1(n \u2265 1), n \u223c Pois(\u03bb) [22]. Nonnegative-real-valued matrix can also be linked to a latent count\nmatrix via a Poisson randomized gamma distribution as x \u223c Gam(n, c), n \u223c Pois(\u03bb) [23].\nHierarchical structure: To interpret\n\n(1), we notice that\nif the temporal structure is ignored. Thus it is\nstraightforward to interpret \u03c6(l)\n\u03c6(l)\nk ,\nwhich are often quite speci\ufb01c at the bottom layer and become increasingly more general when moving\nupwards, as will be shown below in Fig. 5(a).\nLong-range temporal dependencies: Using the law of total expectations on (1), for a three-hidden-\nlayer DPGDS shown in Fig. 1(a), we have\n\nthe hierarchical structure of\n\u03b8(l)\nt\n\nk by projecting them to the bottom data layer as\n\n(cid:104)(cid:81)l\np=1 \u03a6(p)(cid:105)\n\n(cid:104)(cid:81)l\u22121\nt=1 \u03a6(t)(cid:105)\n\n,{\u03a6(p)}l\n\nE(cid:104)\n\n| \u03b8(l)\n\nx(1)\n\n(cid:105)\n\np=1\n\n=\n\nt\n\nt\n\nE[x(1)\n\nt\n\n| \u03b8(1)\n\nt\u22121, \u03b8(2)\n\nt\u22122, \u03b8(3)\n\nt\u22123]/\u03b4(1)\n\nt = \u03a6(1)\u03a0(1)\u03b8(1)\n\nt\u22121 + \u03a6(1)\u03a6(2)[\u03a0(2)]2\u03b8(2)\nt\u22122\n\n(3)\nwhich suggests that {\u03a0(l)}L\nl=1 play the role of transiting the latent representation across time and,\ndifferent from most existing dynamic models, DPGDS can capture and transmit long-range temporal\ninformation (often general and change slowly over time) through its higher hidden layers.\n\n+ \u03a6(1)\u03a6(2)(\u03a0(2)\u03a6(3) + \u03a6(3)\u03a0(3))[\u03a0(3)]2\u03b8(3)\nt\u22123,\n\n3 Scalable MCMC inference\n\nIn this paper, in each iteration, across layers and times, we \ufb01rst exploit a variety of data augmentation\ntechniques for count data to \u201cbackward\u201d and \u201cupward\u201d propagate auxiliary latent counts, with which\n\n3\n\n\uf028\uf0291t\uf071\uf028\uf02921t\uf071\uf02d\uf028\uf0292t\uf071\uf028\uf02921t\uf071\uf02b\uf028\uf02911tx\uf02d\uf028\uf0291tx\uf028\uf02911tx\uf02b\uf028\uf0293t\uf071\uf028\uf02931t\uf071\uf02b\uf028\uf02911t\uf071\uf02d\uf028\uf02911t\uf071\uf02b\uf028\uf0293\uf046\uf028\uf0292\uf046\uf028\uf0291\uf046\uf028\uf0293\uf050\uf028\uf0292\uf050\uf028\uf0291\uf050\uf028\uf02931t\uf071\uf02d Jan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81Jan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81ISR->PSEJan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81USA->AFGJan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81USA->ISRJan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81Jan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81Jan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81USA->ISR ISR->USAISR->PSE PSE->ISR USA->AFG AFG->USAJan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81ISR->PSE PSE->ISR EGY->ISR IND->PAKJan/2003Apr/2003Jul/2003Oct/2003Dec/200300.20.40.60.81USA->AFG AFG->USA AFG->IRQ IRQ->AFGISR->PSE PSE->ISR USA->ISR USA->AFGUSA->PSE AFG->USA\uf028\uf02931:T\uf071\uf028\uf02921:T\uf071\uf028\uf02911:T\uf071\uf028\uf02911:Tx\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 2: Graphical representation of the model and data augmentation and marginalization based\ninference scheme. (a) An alternative representation of layer l = 1 using the relationships between\nthe Poisson and multinomial distributions; (b) A negative binomial distribution based representation\nthat marginalizes out the gamma from the Poisson distributions, corresponding to (4) for t = T ; (c)\nAn equivalent representation that introduces CRT distributed auxiliary variables, corresponding to\n(5); (d) An equivalent representation using P3, corresponding to (6); (e) An equivalent representation\nobtained by using P1, corresponding to (7); (f) A representation obtained by repeating the same\naugmentation-marginalization steps described in (a).\n\nwe then \u201cdownward\u201d and \u201cforward\u201d sample latent variables, leading to a Backward-Upward\u2013Forward-\nDownward Gibbs sampling (BUFD) Gibbs sampling algorithm.\n\n3.1 Backward and upward propagation of latent counts\n\nDifferent from PGDS that has only backward propagation for latent counts, DPGDS have both\nbackward and upward ones due to its deep hierarchical structure. To derive closed-form Gibbs\nsampling update equations, we exploit three useful properties for count data, denoted as P1, P2,\nand P3 [7, 24], respectively, as presented in the Appendix. Let us denote x \u223c NB(r, p) as the\nk!\u0393(r) pk(1 \u2212 p)r,\nnegative binomial distribution with probability mass function P (x = k) = \u0393(k+r)\nwhere k \u2208 {0, 1, . . .}. First, we can augment each count x(1)\nin (1) into the summation of K1\nvkt \u223c Pois(\u03b4(1)\nk=1 A(1)\nvk \u03b8(1)\nlatent counts that are smaller or equal as x(1)\nkt ), with\nvk = 1 by construction, we also have A(1)\u00b7kt \u223c Pois(\u03b4(1)\nt \u03b8(1)\nkt ), as\nT at the last time point T , as none of the other time-step factors\n\nv=1 \u03c6(1)\nshown in Fig. 2(a). We start with \u03b8(1)\ndepend on it in their priors. Via P2, as shown in Fig. 2(b), we can marginalize out \u03b8(1)\n\nvt = (cid:80)K1\n\nA(1)\u00b7kt =(cid:80)V\n\nvkt, A(1)\n\nv=1 A(1)\n\nt \u03c6(1)\n\nvt\n\nkT to obtain\n\nvkt . Since(cid:80)V\n(cid:18)(cid:88)K2\n(cid:20)\n\n(cid:88)K1\n\n\u03c6(2)\nkk2\n\n\u03b8(2)\nk2T +\n\nk2=1\n\nk1=1\n\n\u03c0(1)\nkk1\n\n\u03b8(1)\nk1,T\u22121\n\n, g(\u03b6 (1)\nT )\n\n,\n\n(4)\n\nA(1)\u00b7kT \u223c NB\n\n\u03c40\n\n(cid:19)\n\n(cid:21)\n\nwhere \u03b6 (1)\n\nT = ln(1 + \u03b4(1)\n\nT\n\u03c40\n\n) and g (\u03b6) = 1 \u2212 exp (\u2212\u03b6).\n\nIn order to marginalize out \u03b8(1)\nthe Chines restaurant table (CRT) distribution [24] as\n\nT\u22121, as shown in Fig. 2(c), we introduce an auxiliary variable following\n\n(cid:88)K1\n\n(cid:19)(cid:21)\n\n(cid:20)\n\n(cid:18)(cid:88)K2\n(cid:104)\n\nkT \u223c CRT\nx(2)\n\nA(1)\u00b7kT , \u03c40\n\n\u03c6(2)\nkk2\n\n\u03b8(2)\nk2T +\n\nk2=1\n\n\u03c0(1)\nkk1\n\n\u03b8(1)\nk1,T\u22121\n\nk1=1\n\n.\n\n(5)\n\nAs shown in Fig. 2(d), we re-express the joint distribution over A(1)\u00b7kT and x(2)\nA(1)\u00b7kT \u223c SumLog(x(2)\n\nkT \u223c Pois\n\nT )), x(2)\n\nkT , g(\u03b6 (1)\n\n\u03b6 (1)\nT \u03c40\n\n\u03b8(2)\nk2T +\n\n\u03c6(2)\nkk2\n\n(cid:88)K1\n\n(cid:16)(cid:88)K2\n\nk2=1\n\nkT according to P3 as\n\n\u03c0(1)\nkk1\n\n\u03b8(1)\nk1,T\u22121\n\n, (6)\n\nk1=1\n\n(cid:17)(cid:105)\n\n4\n\nPoisson\uf028\uf02911T\uf071\uf02dPoissonGamma\uf028\uf02922T\uf071\uf02d\uf028\uf02921T\uf071\uf02d\uf028\uf0292T\uf071GammaGammaGamma\uf028\uf02912TA\uf02d\uf028\uf02911TA\uf02d\uf028\uf0291TA\uf028\uf02912T\uf071\uf02d\uf028\uf0291T\uf071PoissonPoisson\uf028\uf02912T\uf071\uf02d\uf028\uf02911T\uf071\uf02dPoissonNB\uf028\uf02922T\uf071\uf02d\uf028\uf02921T\uf071\uf02d\uf028\uf0292T\uf071GammaGammaGamma\uf028\uf02912TA\uf02d\uf028\uf02911TA\uf02d\uf028\uf0291TAPoisson\uf028\uf02912T\uf071\uf02d\uf028\uf02911T\uf071\uf02dPoissonNB\uf028\uf02922T\uf071\uf02d\uf028\uf02921T\uf071\uf02d\uf028\uf0292T\uf071GammaGammaGamma\uf028\uf02912TA\uf02d\uf028\uf02911TA\uf02d\uf028\uf0291TA\uf028\uf0292TxCRTPoisson\uf028\uf02912T\uf071\uf02d\uf028\uf02911T\uf071\uf02dPoissonPoisson\uf028\uf02922T\uf071\uf02d\uf028\uf02921T\uf071\uf02d\uf028\uf0292T\uf071GammaGammaGamma\uf028\uf02912TA\uf02d\uf028\uf02911TA\uf02d\uf028\uf0291TA\uf028\uf0292TxSumLogPoisson\uf028\uf02912T\uf071\uf02d\uf028\uf02911T\uf071\uf02dPoisson\uf028\uf02922T\uf071\uf02d\uf028\uf02921T\uf071\uf02d\uf028\uf0292T\uf071GammaGammaGamma\uf028\uf02912TA\uf02d\uf028\uf02911TA\uf02d\uf028\uf0292,1Tx\uf028\uf0292,2TxPoissonPoissonPoisson\uf028\uf02912T\uf071\uf02d\uf028\uf02911T\uf071\uf02dPoisson\uf028\uf02922T\uf071\uf02d\uf028\uf02921T\uf071\uf02d\uf028\uf0292T\uf071GammaGammaGamma\uf028\uf02912TA\uf02d\uf028\uf02911Tm\uf02d\uf028\uf0292,2TxPoisson\f(cid:88)K1\n\nkT \u223c Pois(\u03b6 (1)\nx(2,1)\nT \u03c40\n\nwhere the sum-logarithmic (SumLog) distribution is de\ufb01ned as in Zhou and Carin [24]. Via P1, as in\nFig. 2(e), the Poisson random variable x(2)\nkT + x(2,2)\nkT , where\n\u03b8(2)\nk2T ).\n\nkT in (6) can be augmented as x(2)\n\u03b8(1)\nk1,T\u22121), x(2,2)\n\nk2=1\nIt is obvious that due to the deep dynamic structure, the count at layer two x(2)\nkT is divided into two\nparts: one from time T \u2212 1 at layer one, while the other from time T at layer two. Furthermore, \u03b6 (1)\nis the scaling factor at layer two, which is propagated from the one at layer one \u03b4(1)\nT . Repeating the\nprocess all the way back to t = 1, and from l = 1 up to l = L, we are able to marginalize out all\ngamma latent variables {\u0398}T,L\nt=1,l=1 and provide closed-form conditional posteriors for all of them.\n\nkT \u223c Pois(\u03b6 (1)\nT \u03c40\n\n(cid:88)K2\n\nkT = x(2,1)\n\n\u03c6(2)\nkk2\n\n\u03c0(1)\nkk1\n\n(7)\n\nk1=1\n\nT\n\n3.2 Backward-upward\u2013forward-downward Gibbs sampling\n\nZ (l)\u00b7kt = (cid:80)Kl\n\nSampling auxiliary counts: This step is about the \u201cbackward\u201d and \u201cupward\u201d pass. Let us denote\nvt . Working backward for t = T, ..., 2 and\n\nklkt, Z (l)\u00b7k,T +1 = 0, and x(1,1)\n\nkt = x(1)\n\nkl=1 Z (l)\n\n(cid:80)Kl\n(cid:18)(cid:88)Kl+1\n\n\u03c6(l)\nk1\u03b8(l)\nkl=1 \u03c6(l)\n\n1t\n\nupward for l = 1, ..., L, we draw\nkKlt) \u223c Multi\n(cid:20)\n\nk1t, ..., A(l)\n\n(A(l)\n\nx(l,l)\nkt\n\n;\n\n, ...,\n\n\u03b8(l)\nklt\n\nkt \u223c CRT\nx(l+1)\n\n(cid:88)Kl\nboth of time t \u2212 1 at layer l and time t at layer l + 1. With p1 := (cid:80)Kl\n(cid:80)Kl+1\n\nNote that via the deep structure, the latent counts x(l+1)\n\nA(l)\u00b7kt + Z (l)\u00b7k,t+1, \u03c40\n\n\u03b8(l+1)\nkl+1t +\n\n\u03b8(l+1)\nkl+1t, we can sample the latent counts at layer l and l + 1 by\n\nkl+1=1 \u03c6(l+1)\n\n\u03c6(l+1)\nkkl+1\n\n\u03b8(l)\nklt\n\nkl+1=1\n\nkkl+1\n\nkl=1\n\nkkl\n\nkt\n\n(cid:80)Kl\n\n\u03b8(l)\n\u03c6(l)\nKlt\nkKl\nkl=1 \u03c6(l)\n\nkkl\n\n, x(l+1,l+1)\n\nkt\n\n) \u223c Multi\n\nx(l+1)\nkt\n\n, p1/(p1 + p2), p2/(p1 + p2)\n\n,\n\nwill be in\ufb02uenced by the effects from\n\u03b8(l)\nkl,t\u22121 and p2 :=\n\nkl=1 \u03c0(l)\n\nkkl\n\n(cid:33)\n\n,\n\n\u03c0(l)\nkk1\n\n\u03b8(l)\nk1,t\u22121\n\n(cid:19)(cid:21)\n\n(8)\n\n.\n\n(9)\n\n(cid:17)\n\n(cid:33)\n\n(10)\n\n.\n\n(11)\n\n(x(l+1,l)\n\nkt\n\nand then draw\n\n(Z (l)\n\nk1t, ..., Z (l)\n\nkKlt) \u223c Multi\n\nx(l+1,l)\nkt\n\n\u03c0(l)\nk1 \u03b8(l)\nkl=1 \u03c0(l)\n\n1,t\u22121\n\u03b8(l)\nkl,t\u22121\n\nkkl\n\n(cid:80)Kl\n\n\u03c0(l)\n\u03b8(l)\nKl,t\u22121\nkKl\n\u03b8(l)\nkl=1 \u03c0(l)\nkl,t\u22121\n\nkkl\n\n, ...,\n\n(cid:104)\n\nSampling hidden units \u03b8(l)\nworking forward for t = 1, ..., T and downward for l = L, ..., 1, we can sample\n\nt and calculating \u03b6 (l)\n\n: Given the augmented latent count variables,\n\nt\n\nkt \u223c Gamma\n\u03b8(l)\n\nA(l)\u00b7kt + Z (l)\u00b7k(t+1) + \u03c40\n\n\u03c6(l+1)\nkkl+1\n\n\u03b8(l+1)\nkl+1t +\n\nkl+1=1\n\n(cid:88)Kl\n(cid:0)1 + \u03b6 (l\u22121)\n\nkl=1\n\n\u03c0(l)\nkkl\n\nt\n\n\u03c40\n\n(cid:17)\n(cid:1)(cid:105)\n\n,\n\n,\n\n\u03b8(l)\nk2,t\u22121\n\n+ \u03b6 (l)\nt+1\n\n(12)\n\n1 + \u03b6 (l\u22121)\n\nt = \u03b4(1)\n\nt\n\nt\n\u03c40\n\nt = ln\n\nand \u03b6 (l)\n\n+ \u03b6 (l)\nt+1\n\n. Note if \u03b4(1)\n\nwhere \u03b6 (0)\nt = \u03b4(1) for t = 1, ..., T , then we\nmay let \u03b6 (l) = \u2212W\u22121(\u2212 exp(\u22121 \u2212 \u03b6 (l\u22121)))\u2212 1\u2212 \u03b6 (l\u22121), where the function W\u22121 is the lower real\npart of the Lambert W function [7, 25]. From (12), we can \ufb01nd that the conditional posterior of \u03b8(l)\nt\nis parameterized by not only both \u03a6(l+1)\u03b8(l+1)\nt\u22121, which represent the information from\nlayer l + 1 (downward) and time t \u2212 1 (forward), respectively, but also both A(l)\u00b7,:,t and Z (l)\u00b7,:,t+1, which\nrecord the message from layer l \u2212 1 (upward) in (8) and time t + 1 (backward) in (11), respectively.\nWe describe the BUFD Gibbs sampling algorithm for DPGDS in Algorithm 1 and provide more\ndetails in the Appendix.\n\nand \u03a0(l)\u03b8(l)\n\nt\n\n3.3 Stochastic gradient MCMC inference\n\nAlthough the proposed BUFD Gibbs sampling algorithm for DPGDS has closed-form update equa-\ntions, it requires processing all time-varying vectors at each iteration and hence has limited scala-\nbility [26]. To allow for scalable inference, we apply the topic-layer-adaptive stochastic gradient\n\n5\n\n(cid:32)\n\n(cid:32)\n\n(cid:16)\n\n(cid:16)\n\n;\n\n(cid:80)Kl\n(cid:16)(cid:88)Kl+1\n\n(cid:17)\n\n\f(a)\n\n(b)\n\nFigure 3: Results on the bouncing ball data set. (a) Shown in the \ufb01rst to third columns are the top\n\ufb01fteen latent factors learned by a three-hidden-layer DPGDS at layers 1, 2, and 3, respectively; (b)\nThe average prediction errors as a function of the sequence length for various algorithms.\n\nRiemannian (TLASGR) MCMC algorithm described in Cong et al. [27] and Zhang et al. [26], which\ncan be used to sample simplex-constrained global parameters [28] in a mini-batch based manner.\nIt improves its sampling ef\ufb01ciency via the use of the Fisher information matrix (FIM) [29], with\nadaptive step-sizes for the latent factors and transition matrices of different layers. More speci\ufb01cally,\nfor \u03c0(l)\nk , column k of the transition matrix \u03a0(l) of layer l, its sampling can be ef\ufb01ciently realized as\n\n(cid:17)\n\n(cid:16)\n\n\u03c0(l)\nk\n\n(cid:20)(cid:16)\n\n(cid:17)\n(cid:32)\n\n(cid:104)(cid:16)\n(cid:104)\n\n\u03b5n\nM (l)\nk\n2\u03b5n\nM (l)\nk\n\n=\n\n\u03c0(l)\nk\n\n+\n\nn\n\nn+1\n\n\u03c1\u02dcz(l)\n\n:k\u00b7 +\u03b7(l)\n\n:k\n\n\u03c1\u02dcz(l)\u00b7k\u00b7 +\u03b7(l)\u00b7k\n\n\u03c0(l)\nk\n\n+ N\n\n0,\n\ndiag(\u03c0(l)\n\nk )n \u2212 (\u03c0(l)\n\nk )n(\u03c0(l)\n\nk )T\n\nn\n\n,\n\n(13)\n\n\u2220\n\n(cid:17)\u2212(cid:16)\n\n(cid:17)\n\n(cid:105)\n\nn\n\n(cid:17)(cid:16)\n(cid:105)(cid:33)(cid:21)\n\nis calculated using the estimated FIM, both \u02dcz(l)\n\n:k\u00b7 and \u02dcz(l)\u00b7k\u00b7 come from the augmented latent\nwhere M (l)\nk\ncounts Z (l), [.]\u2220 denotes a simplex constraint, and \u03b7(l)\nk . The update of \u03a6(l)\nis the same with Cong et al. [27], and all the other global parameters are sampled using SGNHT [20].\nWe provide the details of the SGMCMC for DPGDS in Algorithm 2 in the Appendix.\n\n:k denotes the prior of \u03c0(l)\n\n4 Experiments\n\nIn this section, we present experimental results on a synthetic dataset and \ufb01ve real-world datasets.\nFor a fair comparison, we consider PGDS [7], GP-DPFA [5], DTSBN [4], and GPDM [11] that\ncan be considered as a dynamic generalization of the Gaussian process latent variable model of\nLawrence [30], using the code provided by the authors. Note that as shown Schein et al. [7] and\nGan et al. [4], PGDS and DTSBN are state-of-the-art count time series modeling algorithms that\noutperform a wide variety of previously proposed ones, such as LDS [12] and DRFM [31]. The\nhyperparameter settings of PGDS, GP-DPFA, GPDM, TSBN, and DTSBN are the same as their\noriginal settings [4, 5, 7, 11]. For DPGDS, we set \u03c40 = 1, \u03b30 = 100, \u03b70 = 0.1 and \u00010 = 0.1. We\nuse [K (1), K (2), K (3)] = [200, 100, 50] for both DPGDS and DTSBN and K = 200 for PGDS,\nGP-DPFA, GPDM, and TSBN. For PGDS, GP-DPFA, GPDM, and DPGDS, we run 2000 Gibbs\nsampling as burn-in and collect 3000 samples for evaluation. We also use SGMCMC to infer DPGDS,\nwith 5000 collection samples after 5000 burn-in steps, and use 10000 SGMCMC iterations for both\nTSBN and DTSBN to evaluate their performance.\n\n4.1 Synthetic dataset\n\nFollowing the literature [1, 4], we consider sequences of different lengths,\nincluding T =\n10, 50, 100, 200, 300, 400, 500 and 600, and generate 50 synthetic bouncing ball videos for training,\nand 30 ones for testing. Each video frame is a binary-valued image with size 30 \u00d7 30, describing\nthe location of three balls within the image. Both TSBN and DTSBN model it with the Bernoulli\nlikelihood, while both PGDS and DPGDS use the Bernoulli-Poisson link [22].\nAs shown in Fig. 3(b), the average prediction errors of all algorithms decrease as the training sequence\nlength increases. A higher-order TSBN, TSBN-4, performs much better than the \ufb01rst-order TSBN\n\n6\n\n0100200300400500600Length of sequence (T)24681012Averge prediction error (APE)TSBN-1PGDSTSBN-4DTSBNDPGDS\fTable 1: Top-M results on real-world text data\n\nModel\n\nTop-M GDELT (T = 365)\n\nICEWS (T = 365)\n\nSOTU (T = 225) DBLP (T = 14) NIPS (T = 17)\n\nGPDPFA\n\nPGDS\n\nGPDM\n\nTSBN\n\nDTSBN-2\n\nDTSBN-3\n\nDPGDS-2\n\nDPGDS-3\n\nMP\nMR\nPP\nMP\nMR\nPP\nMP\nMR\nPP\nMP\nMR\nPP\nMP\nMR\nPP\nMP\nMR\nPP\nMP\nMR\nPP\nMP\nMR\nPP\n\n0.611 \u00b10.001\n0.145 \u00b10.002\n0.447 \u00b10.014\n0.679 \u00b10.001\n0.150 \u00b10.001\n0.420 \u00b10.017\n0.520\u00b10.001\n0.141 \u00b10.001\n0.362\u00b10.021\n0.594 \u00b10.007\n0.124 \u00b10.001\n0.418 \u00b10.019\n0.439 \u00b10.001\n0.134 \u00b10.001\n0.391 \u00b10.001\n0.411 \u00b10.001\n0.141 \u00b10.001\n0.367 \u00b10.011\n0.688 \u00b10.002\n0.149 \u00b10.001\n0.443 \u00b10.025\n0.689 \u00b10.002\n0.150 \u00b10.001\n0.456 \u00b10.015\n\n0.607 \u00b10.002\n0.235 \u00b10.005\n0.465 \u00b10.008\n0.658 \u00b10.001\n0.245 \u00b10.005\n0.455 \u00b10.008\n0.530 \u00b10.002\n0.234 \u00b10.001\n0.185\u00b10.017\n0.471 \u00b10.001\n0.158 \u00b10.001\n0.445 \u00b10.031\n0.475 \u00b10.002\n0.208 \u00b10.001\n0.446 \u00b10.001\n0.431 \u00b10.001\n0.189 \u00b10.001\n0.451 \u00b10.026\n0.659 \u00b10.001\n0.242 \u00b10.007\n0.473 \u00b10.012\n0.660 \u00b10.001\n0.244 \u00b10.003\n0.478 \u00b10.024\n\n0.379 \u00b10.002\n0.369 \u00b10.002\n0.617 \u00b10.013\n0.375 \u00b10.002\n0.373 \u00b10.002\n0.612 \u00b10.018\n0.274 \u00b10.001\n0.261 \u00b10.002\n0.587 \u00b10.016\n0.360 \u00b10.001\n0.275 \u00b10.001\n0.611 \u00b10.001\n0.370 \u00b10.004\n0.361 \u00b10.001\n0.587 \u00b10.027\n0.450 \u00b10.008\n0.274 \u00b10.001\n0.548 \u00b10.013\n0.379 \u00b10.002\n0.373 \u00b10.001\n0.622 \u00b10.014\n0.380 \u00b10.001\n0.374 \u00b10.002\n0.628 \u00b10.021\n\n0.435 \u00b10.009\n0.254 \u00b10.005\n0.581 \u00b10.011\n0.419 \u00b10.004\n0.252 \u00b10.004\n0.566 \u00b10.008\n0.388 \u00b10.004\n0.146 \u00b10.005\n0.509 \u00b10.008\n0.403 \u00b10.012\n0.194 \u00b10.001\n0.527 \u00b10.003\n0.407 \u00b10.003\n0.248 \u00b10.007\n0.522 \u00b10.005\n0.390 \u00b10.002\n0.252 \u00b10.004\n0.510 \u00b10.006\n0.430 \u00b10.009\n0.254 \u00b10.005\n0.582 \u00b10.007\n0.431 \u00b10.012\n0.255 \u00b10.004\n0.600 \u00b10.001\n\n0.843 \u00b10.005\n0.050 \u00b10.001\n0.807 \u00b10.006\n0.864 \u00b10.004\n0.050 \u00b10.001\n0.802 \u00b10.020\n0.355 \u00b10.008\n0.050 \u00b10.001\n0.384 \u00b10.028\n0.788 \u00b10.005\n0.050 \u00b10.001\n0.692 \u00b10.017\n0.756 \u00b10.001\n0.050 \u00b10.001\n0.737 \u00b10.004\n0.774 \u00b10.002\n0.050 \u00b10.001\n0.715 \u00b10.009\n0.867 \u00b10.008\n0.050 \u00b10.001\n0.814 \u00b10.035\n0.887 \u00b10.002\n0.050 \u00b10.001\n0.839 \u00b10.007\n\ndoes, suggesting that using high-order messages can help TSBN better pass useful information. As\ndiscussed above, since a deep structure provides a natural way to propagate high-order information\nfor prediction, it is not surprising to \ufb01nd that both DTSBN and DPGDS, which are both multi-\nlayer models, have exhibited superior performance. Moreover, it is clear that the proposed DPGDS\nconsistently outperforms DTSBN under all settings.\nAnother advantage of DPGDS is that its inferred deep latent structure often has meaningful interpre-\ntation. As shown in Fig. 3(a), for the bouncing ball data, the inferred factors at layer one represent\npoints or pixels, those at layer two cover larger spatial contiguous regions, some of which exhibit the\nshape of a single bouncing ball, and those at layer three are able to capture multiple bouncing balls.\nIn addition, we show in Appendix B the one-step prediction frames of different models.\n\n4.2 Real-world datasets\n\nBesides the binary-valued synthetic bouncing ball dataset, we quantitatively and qualitatively evaluate\nall algorithms on the following real-world datasets used in Schein et al. [7]. The State-of-the-Union\n(SOTU) dataset consists of the text of the annual SOTU speech transcripts from 1790 to 2014. The\nGlobal Database of Events, Language, and Tone (GDELT) and Integrated Crisis Early Warning\nSystem (ICEWS) are both datasets for international relations extracted from news corpora. Note that\nICEWS consists of undirected pairs, while GDELT consists of directed pairs of countries. The NIPS\ncorpus contains the text of every NIPS conference paper from 1987 to 2003. The DBLP corpus is\na database of computer science research papers. Each of these datasets is summarized as a V \u00d7 T\ncount matrix, as shown in Tab. 1. Unless speci\ufb01ed otherwise, we choose the top 1000 most frequently\nused terms to form the vocabulary, which means we set V = 1000 for all real-data experiments.\n\n4.2.1 Quantitative comparison\n\nFor a fair and comprehensive comparison, we calculate the precision and recall at top-M [4,5,31,32],\nwhich are calculated by the fraction of the top-M words that match the true ranking of the words and\nappear in the top-M ranking, respectively, with M = 50. We also use the Mean Precision (MP) and\nMean Recall (MR) over all the years appearing in the training set to evaluate different models. As\nanother criterion, the Predictive Precision (PP) shows the predictive precision for the \ufb01nal year, for\nwhich all the observations are held out. Similar as previous methods [4, 5], for each corpus, the entire\ndata of the last year is held out, and for the documents in the previous years we randomly partition the\nwords of each document into 80% / 20% in each trial, and we conduct \ufb01ve random trials to report the\nsample mean and standard deviation. Note that to apply GPDM, we have used Anscombe transform\n\n7\n\n\f[33] to preprocess the count data to mitigate the mismatch between the data and model assumption.\nThe results on all \ufb01ve datasets are summarized in Tab. 1, which clearly show that the proposed\nDPGDS has achieved the best performance on most of the evaluation criteria, and again a deep model\noften improves its performance by increasing its number of layers. To add more empirical study\non scalability, we have also tested the ef\ufb01ciency of our model on\na GDELT data (from 2001 to 2005, temporal granularity of 24 hrs,\nwith a total of 1825 time points), which is not too large so that\nwe can still run DPGDS-Gibbs and GPDM. As shown in Fig. 4,\nwe present how various algorithms progress over time, evaluated\nwith MP. It takes about 1000s for DTSBN and DPGDS-SGMCMC\nto converge, 3.5 hrs for DPGDS-Gibbs, 5 hrs for GPDM. Clearly,\nour DPGDS-SGMCMC is scalable and clearly outperforms both\nDTSBN and GPDM. We also present in Appendix C the results\nof DPGDS-SGMCMC on a very long time series, on which it\nbecomes too expensive to run a batch learning algorithm.\n\nFigure 4: MP as a function of\ntime for GDELT.\n\n4.2.2 Exploratory data analysis\n\nCompared to previously proposed dynamic systems, the proposed DPGDS, whose inferred latent\nstructure is simple to visualize, provides much richer interpretation. More speci\ufb01cally, we may not\nonly exhibit the content of each factor (topic), but also explore both the hierarchical relationships\nbetween them at different layers, and the temporal relationships between them at the same layer.\nBased on the results inferred on ICEWS 2001-2003 via a three hidden layer DPGDS, with the size of\n200-100-50, we show in Fig. 5 how some example topics are hierarchically and temporally related to\neach other, and how their corresponding latent representations evolve over time.\nIn Fig. 5(a), we select two large-weighted topics at the top hidden layer and move down the network\nto include any lower-layer topics that are connected to them with suf\ufb01ciently large weights. For\neach topic, we list all its terms whose values are larger than 1% of the largest element of the topic.\nIt is interesting to note that topic 2 at layer three is connected to three topics at layer two, which\nare characterized mainly by the interactions of Israel (ISR)-Palestinian Territory (PSE), Iraq (IRQ)-\nUSA-Iran (IRN), and North Korea (PRK)-South Korea (KOR)-USA-China (CHN)-Japan (JPN),\nrespectively. The activation strength of one of these three interactions, known to be dominant in\ngeneral during 2001-2003, can be contributed not only by a large activation of topic 2 at layer three,\nbut also by a large activation of some other topic of the same layer (layer two) at the previous time.\nFor example, topic 41 of layer two on \u201cISR-PSE, IND-PAK, RUS-UKR, GEO-RUS, AFG-PAK,\nSYR-USA, MNE-SRB\u201d could be associated with the activation of topic 46 of layer two on \u201cIND-PAK,\nRUS-TUR, ISR-PSE, BLR-RUS\u201d at the previous time; and topic 99 of layer two on \u201cPRK-KOR,\nJPN-USA, CHN-USA, CHN-KOR, CHN-JPN, USA-RUS\u201d could be associated with the activation of\ntopic 63 of layer two on \u201cIRN-USA, CHN-USA, AUS-CHN, CHN-KOR\u201d at the previous time.\nAnother instructive observation is that topic 140 of layer one on \u201cIRQ-USA, IRQ-GBR, IRN-IRQ,\nIRQ-KWT, AUS-IRQ\u201d is related not only in hierarchy to topic 34 of the higher layer on \u201cIRQ-USA,\nIRQ-GBR, GBR-USA, IRQ-KWT, IRN-IRQ, SYR-USA,\u201d but also in time to topic 166 of the same\nlayer on \u201cESP-USA, ESP-GBR, FRA-GBR, POR-USA,\u201d which are interactions between the member\nstates of the North Atlantic Treaty Organization (NATO). Based on the transitions from topic 13 on\n\u201cPRK-KOR\u201d to both topic 140 on \u201cIRQ-USA\u201d and 77 on \u201cISR-PSE,\u201d we can \ufb01nd that the ongoing\nIraq war and Israeli\u2013Palestinian relations regain attention after the six-party talks [7].\nTo get an insight of the bene\ufb01ts attributed to the deep structure, how the latent representations of\nseveral representative topics evolve over days are shown in Fig. 5(b). It is clear that relative to\nthese temporal factor trajectories at the bottom layer, which are speci\ufb01c for the bilateral interactions\nbetween two countries, these from higher layers vary more smoothly, whose corresponding high-\nlayer topics capture the multilateral interactions between multiple closely related countries. Similar\nphenomena have also been demonstrated in Fig. 1(b) on GDELT2003. Moreover, we \ufb01nd that a\nspike of the temporal trajectory of topic 166 (NATO) appears right before a one of topic 140 (Iraq\nwar), matching the above description in Fig. 5(a). Also, topic 14 of layer three and its descendants,\nincluding topic 23 of layer two and topic 48 at layer one are mainly about a breakthrough between\nRUS and Azerbaijan (AZE), coinciding with Putin\u2019s visit in January 2001. Additional example results\nfor the topics and their hierarchical and temporal relationships, inferred by DPGDS on different\ndatasets, are provided in the Appendix.\n\n8\n\n25050075010000.40.450.50.550.6Time (Seconds)MP DPGDS-GibbsDPGDS-SGMCMCDTSBNGPDM\f(a)\n\n(b)\n\nFigure 5: Topics and their temporal trajectories inferred by a three-hidden-layer DPGDS from the\nICEWS 2001-2003 dataset (best viewed in color). (a) Some example topics that are hierarchically or\ntemporally related; (b) The temporal trajectories of some inferred latent topics.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 6: Learned transition structure on ICEWS 2001-2003 from the same DPGDS depicted in\nFig. 5. Shown in (a)-(c) are transition matrices for layers 1, 2 and 3, respectively, with a darker color\nindicating a larger transition weight (between 0 and 1).\n\nIn Fig. 6, we also present a subset of the transition matrix \u03a0(l) in each layer, corresponding to the\ntop ten topics, some of which have been displayed in Fig. 5(b). The transition matrix \u03a0(l) captures\nthe cross-topic temporal dependence at layer l. From Fig. 6, besides the temporal transitions between\nthe topics at the same layer, we can also see that with the increase of the layer index l, the transition\nmatrix \u03a0(l) more closely approaches a diagonal matrix, meaning that the feature factors become\nmore likely to transit to themselves, which matches the characteristic of DPGDS that the topics in\nhigher layers have the ability to cover longer-range temporal dependencies and contain more general\ninformation, as shown in Fig. 5(a). With both the hierarchical connections between layers and\ndynamic transitions at the same layer, distinct from the shallow PGDS, DPGDS is equipped with a\nlarger capacity to model diverse temporal patterns with the help of its deep structure.\n\n5 Conclusions\n\nWe propose deep Poisson gamma dynamical systems (DPGDS) that take the advantage of a probabilis-\ntic deep hierarchical structure to ef\ufb01ciently capture both across-layer and temporal dependencies. The\ninferred latent structure provides rich interpretation for both hierarchical and temporal information\npropagation. For Bayesian inference, we develop both a Backward-Upward\u2013Forward-Downward\nGibbs sampler and a stochastic gradient MCMC (SGMCMC) that is scalable to long multivariate\ncount/binary time series. Experimental results on a variety of datasets show that DPGDS not only\nexhibits excellent predictive performance, but also provides highly interpretable latent structure.\n\n9\n\n14AZE-RUSIND-PAKRUS-UKR23AZE-RUSIND-VNMCHN-IRNIRQ-USA ISR-PSEPRK-KOR CHN-KOR46IND-PAKRUS-TURISR-PSEBLR-RUSIRQ-USAIRQ-GBRGBR-USAIRQ-KWTIRN-IRQSYR-USAIND-PAKBIH-SRBBAN-INDAFG-RUSMNE-SRBSYR-USALBN-USACUB-USAISR-SYRIRN-IRQGBR-USA BRA-USAZAF-USA CZE-USAPSE-USA 2PRK-KORJPN-USACHN-USACHN-KORCHN-JPNUSA-RUSIRQ-USA IRQ-GBRIRN-IRQAUS-IRQIRQ-KWT PRK-KORCHN-KORPHL-USAKOR-VNMCHN-JPN41ISR-PSEIND-PAKRUS-UKRGEO-RUSAFG-PAKSYR-USAMNE-SRB9077138140133499ISR-PSE MNE-SRBIRN-ISRBRA-USAISR-JOR 107CHN-USAJPN-USACHN-JPNRUS-USA 40IRN-USACHN-USAAUS-CHNCHN-KOR63AZE-RUS IND-VNMIND-IDN 48CHN-IRNMYS-MMRBAN-IDNJPN-ZAF CHN-THA 145AFG-TUKPOL-RUSARM-RUSAFG-RUS146ESP-USAESP-GBRFRA-GBRPOR-USA166CHN-USAAUS-USAINA-USA62IND-PAK RUS-TRKRUS-USA ISR-PSEJPN-FRA26Jan/2001Jul/2001Jan/2002Jul/2002Jan/2003Jul/2003020406080 Factor 2Factor 14Jan/2001Jul/2001Jan/2002Jul/2002Jan/2003Jul/20030100200300400500 Factor 77Factor 140Factor 13Factor 48Factor 166Jan/2001Jul/2001Jan/2002Jul/2002Jan/2003Jul/20030100200300400 Factor 41Factor 34Factor 99Factor 23Jan/2001Jul/2001Jan/2002Jul/2002Jan/2003Jul/2003050010001500 ISR-PSEIRQ-USAPRK-KORAZE-RUSTemporal factor trajectories at layer 1 Temporal factor trajectories at layer 3 Temporal factor trajectories at layer 2 DataFactor IndexFactor IndexFactor IndexFactor IndexFactor IndexFactor Index\fAcknowledgements\n\nD. Guo, B. Chen, and H. Zhang acknowledge the support of the Program for Young Thousand\nTalent by Chinese Central Government, the 111 Project (No. B18039), NSFC (61771361), NSFC for\nDistinguished Young Scholars (61525105) and the Innovation Fund of Xidian University. M. Zhou\nacknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation.\n\nReferences\n[1] I. Sutskever and G. E. Hinton, \u201cLearning multilevel distributed representations for high-\n\ndimensional sequences,\u201d in AISTATS, 2007.\n\n[2] C. Wang, D. Blei, and D. Heckerman, \u201cContinuous time dynamic topic models,\u201d in UAI, 2008,\n\npp. 579\u2013586.\n\n[3] M. Hermans and B. Schrauwen, \u201cTraining and analysing deep recurrent neural networks,\u201d in\n\nNIPS, 2013, pp. 190\u2013198.\n\n[4] Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin, \u201cDeep temporal sigmoid belief networks\n\nfor sequence modeling,\u201d in NIPS, 2015, pp. 2467\u20132475.\n\n[5] A. Acharya, J. Ghosh, and M. Zhou, \u201cNonparametric Bayesian factor analysis for dynamic\n\ncount matrices,\u201d in AISTATS, 2015.\n\n[6] L. Charlin, R. Ranganath, J. Mcinerney, and D. M. Blei, \u201cDynamic Poisson factorization,\u201d in\n\nACM, 2015, pp. 155\u2013162.\n\n[7] A. Schein, M. Zhou, and H. Wallach, \u201cPoisson\u2013gamma dynamical systems,\u201d in NIPS, 2016.\n[8] C. Y. Gong and W. Huang, \u201cDeep dynamic Poisson factorization model,\u201d in NIPS, 2017.\n[9] H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, and H. R.\nRabiee, \u201cRecurrent Poisson factorization for temporal recommendation,\u201d in KDD, 2017, pp.\n847\u2013855.\n\n[10] Z. Ghahramani and S. T. Roweis, \u201cLearning nonlinear dynamical systems using an EM algo-\n\nrithm,\u201d in NIPS, 1999, pp. 431\u2013437.\n\n[11] J. M. Wang, A. Hertzmann, and D. M. Blei, \u201cGaussian process dynamical models,\u201d in NIPS,\n\n2006.\n\n[12] R. E. Kalman, \u201cMathematical description of linear dynamical systems,\u201d Journal of The Society\n\nfor Industrial and Applied Mathematics, Series A: Control, vol. 1, no. 2, pp. 152\u2013192, 1963.\n\n[13] R. M. Neal, \u201cConnectionist learning of belief networks,\u201d Arti\ufb01cial Intelligence, vol. 56, no. 1,\n\npp. 71\u2013113, 1992.\n\n[14] R. Ranganath, L. Tang, L. Charlin, and D. M. Blei, \u201cDeep exponential families,\u201d in AISTATS,\n\n2014, pp. 762\u2013771.\n\n[15] M. Zhou, Y. Cong, and B. Chen, \u201cThe Poisson gamma belief network,\u201d in NIPS, 2015, pp.\n\n3043\u20133051.\n\n[16] R. Henao, Z. Gan, J. T. Lu, and L. Carin, \u201cDeep Poisson factor modeling,\u201d in NIPS, 2015, pp.\n\n2800\u20132808.\n\n[17] Y. A. Ma, T. Chen, and E. B. Fox, \u201cA complete recipe for stochastic gradient MCMC,\u201d in NIPS,\n\n2015, pp. 2917\u20132925.\n\n[18] M. Welling and Y. W. Teh, \u201cBayesian learning via stochastic gradient Langevin dynamics,\u201d in\n\nICML, 2011, pp. 681\u2013688.\n\n[19] S. Patterson and Y. W. Teh, \u201cStochastic gradient Riemannian Langevin dynamics on the\n\nprobability simplex,\u201d in NIPS, 2013, pp. 3102\u20133110.\n\n[20] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven, \u201cBayesian sampling using\n\nstochastic gradient thermostats,\u201d in NIPS, 2014, pp. 3203\u20133211.\n\n[21] C. Li, C. Chen, D. Carlson, and L. Carin, \u201cPreconditioned stochastic gradient Langevin dynamics\n\nfor deep neural networks,\u201d in AAAI, 2016, pp. 1788\u20131794.\n\n[22] M. Zhou, \u201cIn\ufb01nite edge partition models for overlapping community detection and link predic-\n\ntion,\u201d in AISTATS, 2015, pp. 1135\u20131143.\n\n10\n\n\f[23] M. Zhou, Y. Cong, and B. Chen, \u201cAugmentable gamma belief networks,\u201d Journal of Machine\n\nLearning Research, vol. 17, no. 163, pp. 1\u201344, 2016.\n\n[24] M. Zhou and L. Carin, \u201cNegative binomial process count and mixture modeling,\u201d IEEE Trans-\n\nactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 2, pp. 307\u2013320, 2015.\n\n[25] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth, \u201cOn the LambertW\n\nfunction,\u201d Advances in Computational Mathematics, vol. 5, no. 1, pp. 329\u2013359, 1996.\n\n[26] H. Zhang, B. Chen, D. Guo, and M. Zhou, \u201cWHAI: Weibull hybrid autoencoding inference for\n\ndeep topic modeling,\u201d in ICLR, 2018.\n\n[27] Y. Cong, B. Chen, H. Liu, and M. Zhou, \u201cDeep latent Dirichlet allocation with topic-layer-\n\nadaptive stochastic gradient Riemannian MCMC,\u201d in ICML, 2017.\n\n[28] Y. Cong, B. Chen, and M. Zhou, \u201cFast simulation of hyperplane-truncated multivariate normal\n\ndistributions,\u201d Bayesian Anal., vol. 12, no. 4, pp. 1017\u20131037, 2017.\n\n[29] M. A. Girolami and B. Calderhead, \u201cRiemann manifold Langevin and Hamiltonian Monte Carlo\nmethods,\u201d Journal of The Royal Statistical Society Series B-statistical Methodology, vol. 73,\nno. 2, pp. 123\u2013214, 2011.\n\n[30] N. D. Lawrence, \u201cProbabilistic non-linear principal component analysis with gaussian process\nlatent variable models,\u201d Journal of Machine Learning Research, vol. 6, pp. 1783\u20131816, 2005.\n[31] S. Han, L. Du, E. Salazar, and L. Carin, \u201cDynamic rank factor model for text streams,\u201d in NIPS,\n\n2014, pp. 2663\u20132671.\n\n[32] P. Gopalan, F. J. R. Ruiz, R. Ranganath, and D. M. Blei, \u201cBayesian nonparametric Poisson\n\nfactorization for recommendation systems,\u201d in AISTATS, 2014.\n\n[33] F. J. Anscombe, \u201cThe transformation of Poisson, binomial and negative-binomial data,\u201d\n\nBiometrika, vol. 35, no. 3/4, pp. 246\u2013254, 1948.\n\n[34] D. B. Dunson and A. H. Herring, \u201cBayesian latent variable models for mixed discrete outcomes,\u201d\n\nBiostatistics, vol. 6, no. 1, pp. 11\u201325, 2005.\n\n[35] M. Zhou, L. Hannah, D. B. Dunson, and L. Carin, \u201cBeta-negative binomial process and Poisson\n\nfactor analysis,\u201d in AISTATS, 2012, pp. 1462\u20131471.\n\n[36] M. Zhou, \u201cNonparametric Bayesian negative binomial factor analysis,\u201d Bayesian Anal., vol. 13,\n\nno. 4, pp. 1061\u20131089, 2018.\n\n11\n\n\f", "award": [], "sourceid": 5118, "authors": [{"given_name": "Dandan", "family_name": "Guo", "institution": "Xidian University"}, {"given_name": "Bo", "family_name": "Chen", "institution": "Xidian University"}, {"given_name": "Hao", "family_name": "Zhang", "institution": "Xidian University"}, {"given_name": "Mingyuan", "family_name": "Zhou", "institution": "University of Texas at Austin"}]}