{"title": "Spatial Normalized Gamma Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1554, "page_last": 1562, "abstract": "Dependent Dirichlet processes (DPs) are dependent sets of random measures, each being marginally Dirichlet process distributed. They are used in Bayesian nonparametric models when the usual exchangebility assumption does not hold. We propose a simple and general framework to construct dependent DPs by marginalizing and normalizing a single gamma process over an extended space. The result is a set of DPs, each located at a point in a space such that neighboring DPs are more dependent. We describe Markov chain Monte Carlo inference, involving the typical Gibbs sampling and three different Metropolis-Hastings proposals to speed up convergence. We report an empirical study of convergence speeds on a synthetic dataset and demonstrate an application of the model to topic modeling through time.", "full_text": "Spatial Normalized Gamma Processes\n\nVinayak Rao\n\nYee Whye Teh\n\nGatsby Computational Neuroscience Unit\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\nvrao@gatsby.ucl.ac.uk\n\nUniversity College London\n\nywteh@gatsby.ucl.ac.uk\n\nAbstract\n\nDependent Dirichlet processes (DPs) are dependent sets of random measures, each\nbeing marginally DP distributed. They are used in Bayesian nonparametric models\nwhen the usual exchangeability assumption does not hold. We propose a simple\nand general framework to construct dependent DPs by marginalizing and nor-\nmalizing a single gamma process over an extended space. The result is a set of\nDPs, each associated with a point in a space such that neighbouring DPs are more\ndependent. We describe Markov chain Monte Carlo inference involving Gibbs\nsampling and three different Metropolis-Hastings proposals to speed up conver-\ngence. We report an empirical study of convergence on a synthetic dataset and\ndemonstrate an application of the model to topic modeling through time.\n\n1\n\nIntroduction\n\nBayesian nonparametrics have recently garnered much attention in the machine learning and statis-\ntics communities, due to their elegant treatment of in\ufb01nite dimensional objects like functions and\ndensities, as well as their ability to sidestep the need for model selection. The Dirichlet process (DP)\n[1] is a cornerstone of Bayesian nonparametrics, and forms a basic building block for a wide variety\nof extensions and generalizations, including the in\ufb01nite hidden Markov model [2], the hierarchical\nDP [3], the in\ufb01nite relational model [4], adaptor grammars [5], to name just a few.\nBy itself, the DP is a model that assumes that data are in\ufb01nitely exchangeable, i.e. the ordering of\ndata items does not matter. This assumption is false in many situations and there has been a concerted\neffort to extend the DP to more structured data. Much of this effort has focussed on de\ufb01ning priors on\ncollections of dependent random probability measures. [6] expounded on the notion of dependent\nDPs, that is, a dependent set of random measures that are all marginally DPs. The property of\nbeing marginally DP here is both due to a desire to construct mathematically elegant solutions, and\nalso due to the fact that the DP and its implications as a statistical model, e.g. on the behaviour\nof induced clusterings of data or asymptotic consistency, are well-understood. In this paper, we\npropose a simple and general framework for the construction of dependent DPs on arbitrary spaces.\nThe idea is based on the fact that just as Dirichlet distributions can be generated by drawing a set\nof independent gamma variables and normalizing, the DP can be constructed by drawing a sample\nfrom a gamma process (\u0393P) and normalizing (i.e. it is an example of a normalized random measure\n[7, 8]). A \u0393P is an example of a completely random measure [9]: it has the property that the random\nmasses it assigns to disjoint subsets are independent. Furthermore, the restriction of a \u0393P to a subset\nis itself a \u0393P. This implies the following easy construction of a set of dependent DPs: de\ufb01ne a \u0393P\nover an extended space, associate each DP with a different region of the space, and de\ufb01ne each DP\nby normalizing the restriction of the \u0393P on the associated region. This produces a set of dependent\nDPs, with the amount of overlap among the regions controlling the amount of dependence. We call\nthis model a spatial normalized gamma process (SN\u0393P). More generally, our construction can be\nextended to normalizing restrictions of any completely random measure, and we call the resulting\ndependent random measures spatial normalized random measures (SNRMs).\n\n1\n\n\fIn Section 2 we brie\ufb02y describe the \u0393P. Then we describe our construction of the SN\u0393P in Section 3.\nWe describe inference procedures based on Gibbs and Metropolis-Hastings sampling in Section 4\nand report experimental results in Section 5. We conclude by discussing limitations and possible\nextensions of the model as well as related work in Section 6.\n\n2 Gamma Processes\n\nWe brie\ufb02y describe the gamma process (\u0393P) here. A good high-level introduction can be found in\n[10]. Let (\u0398, \u2126) be a measure space on which we would like to de\ufb01ne a \u0393P. Like the DP, realizations\nof the \u0393P are atomic measures with random weighted point masses. We can visualize the point\nmasses \u03b8 \u2208 \u0398 and their corresponding weights w > 0 as points in a product space \u0398 \u2297 [0,\u221e).\nConsider a Poisson process over this product space with mean measure\n\nHere \u03b1 is a measure on the space (\u0398, \u2126) and is called the base measure of the \u0393P. A sample from\nthis Poisson process will yield an in\ufb01nite set of atoms {\u03b8i, wi}\u221e\n\u0398\u2297[0,\u221e) \u00b5(d\u03b8dw) = \u221e.\nA sample from the \u0393P is then de\ufb01ned as\n\n\u00b5(d\u03b8dw) = \u03b1(d\u03b8)w\u22121e\u2212wdw.\n\ni=1 since(cid:82)\n\n(1)\n\n(2)\n\n\u221e(cid:88)\nIt can be shown that the total mass G(S) =(cid:80)\u221e\n\nG =\n\ni=1\n\nwi\u03b4\u03b8i \u223c \u0393P(\u03b1).\n\ni=1 wi1(\u03b8i \u2208 S) of any measurable subset S \u2282 \u0398 is\nsimply gamma distributed with shape parameter \u03b1(S), thus the natural name gamma process. Divid-\ning G by G(\u0398), we get a normalized random measure\u2014a random probability measure. Speci\ufb01cally,\nwe get a sample from the Dirichlet process DP(\u03b1):\n\nD = G/G(\u0398) \u223c DP(\u03b1).\n\n(3)\n\nHere we used an atypical parameterization of the DP in terms of the base measure \u03b1. The usual\n(equivalent) parameters of the DP are: strength parameter \u03b1(\u0398) and base distribution \u03b1/\u03b1(\u0398).\nFurther, the DP is independent of the normalization: D\u22a5\u22a5G(\u0398).\nThe gamma process is an example of a completely random measure [9]. This means that for mutually\ndisjoint measurable subsets S1, . . . , Sn \u2208 \u2126 the random numbers {G(S1), . . . , G(Sn)} are mutually\nindependent. Two straightforward consequences will be of importance in the rest of this paper.\nFirstly, if S \u2208 \u2126 then the restriction G(cid:48)(d\u03b8) = G(d\u03b8 \u2229 S) onto S is a \u0393P with base measure\n\u03b1(cid:48)(d\u03b8) = \u03b1(d\u03b8 \u2229 S). Secondly, if \u0398 = \u03981 \u2297 \u03982 is a two dimensional space, then the projection\n\nG(d\u03b81d\u03b82) onto \u03981 is also a \u0393P with base measure \u03b1(cid:48)(cid:48)(d\u03b81) =(cid:82)\n\nG(cid:48)(cid:48)(d\u03b81) =(cid:82)\n\n\u03b1(d\u03b81d\u03b82).\n\n\u03982\n\n\u03982\n\n3 Spatial Normalized Gamma Processes\n\nIn this section we describe our proposal for constructing dependent DPs. Let (\u0398, \u2126) be a probability\nspace and T an index space. We wish to construct a set of dependent random measures over (\u0398, \u2126),\none Dt for each t \u2208 T such that each Dt is marginally DP. Our approach is to de\ufb01ne a gamma\nprocess G over an extended space and let each Dt be a normalized restriction/projection of G.\nBecause restrictions and projections of gamma processes are also gamma processes, each Dt will\nbe DP distributed.\nTo this end, let Y be an auxiliary space and for each t \u2208 T, let Yt \u2282 Y be a measurable set. For any\nmeasure \u00b5 over \u0398 \u2297 Y de\ufb01ne the restricted projection \u00b5t by\n\n(4)\nNote that \u00b5t is a measure over \u0398 for each t \u2208 T. Now let \u03b1 be a base measure over the product\nspace \u0398 \u2297 Y and consider a gamma process\n\n\u00b5t(d\u03b8) =\n\nYt\n\n\u00b5(d\u03b8dy) = \u00b5(d\u03b8 \u2297 Yt).\n\n(cid:90)\n\nG \u223c \u0393P(\u03b1)\n\n2\n\n(5)\n\n\fover \u0398 \u2297 Y. Since restrictions and projections of \u0393Ps are \u0393Ps as well, Gt will be a \u0393P over \u0398 with\nbase measure \u03b1t:\n\n(cid:90)\n\nGt(d\u03b8) =\n\nG(d\u03b8dy) \u223c \u0393P(\u03b1t)\n\n(6)\n\nYt\n\nNow normalizing,\n\nDt = Gt/Gt(\u0398) \u223c DP(\u03b1t)\n\n(7)\nWe call the resulting set of dependent DPs {Dt}t\u2208T spatial normalized gamma processes (SN\u0393Ps).\nIf the index space is continuous, {Dt}t\u2208T can equivalently be thought of as a measure-valued\nstochastic process. The amount of dependence between Ds and Dt for s, t \u2208 T is related to the\namount of overlap between Ys and Yt. Generally, the subsets Yt are de\ufb01ned so that the closer s and\nt are in T, the more overlap Ys and Yt have and as a result Ds and Dt are more dependent.\n\n(cid:90) t+L\n\n3.1 Examples\nWe give two examples of SN\u0393Ps, both with index set T = R interpreted as the time line. Generaliza-\ntions to higher dimensional Euclidean spaces Rn are straightforward. Let H be a base distribution\nover \u0398 and \u03b3 > 0 be a concentration parameter.\nThe \ufb01rst example uses Y = R as well, with the subsets being Yt = [t \u2212 L, t + L] for some \ufb01xed\nwindow length L > 0. The base measure is \u03b1(d\u03b8dy) = \u03b3H(d\u03b8)dy/2L. In this case the measure-\nvalued stochastic process {Dt}t\u2208R is stationary. The base measure \u03b1t works out to be:\n\nt\u2212L\n\n\u03b1t(d\u03b8) =\n\n\u03b3H(d\u03b8)dy/2L = \u03b3H(d\u03b8),\n\n(8)\nso that each Dt \u223c DP(\u03b3H) with concentration parameter \u03b3 and base distribution H. We can\ninterpret this SN\u0393P as follows. Each atom in the overall \u0393P G has a time-stamp y and a time-span\nof [y \u2212 L, y + L], so that it will only appear in the DPs Dt within the window t \u2208 [y \u2212 L, y + L]. As\na result, two DPs Ds and Dt will share more atoms the closer s and t are to each other, and no atoms\nif |s \u2212 t| > 2L. Further, the dependence between Ds and Dt depends on |s \u2212 t| only, decreasing\nwith increasing |s \u2212 t| and independent if |s \u2212 t| > 2L.\nThe second example generalizes the \ufb01rst one by allowing different atoms to have different window\nlengths. Each atom now has a time-stamp y and a window length l, so that it appears in DPs in the\nwindow [y \u2212 l, y + l]. Our auxiliary space is thus Y = R \u2297 [0,\u221e), with Yt = {(y, l) : |y \u2212 t| \u2264 l}\n(see Figure 1). Let \u03b2(dl) be a distribution over window lengths in [0,\u221e). We use the base measure\n\u03b1(d\u03b8dydl) = \u03b3H(d\u03b8)dy\u03b2(dl)/2l. The restricted projection is then\n\n(cid:90)\n\n\u03b1t(d\u03b8) =\n\n|y\u2212t|\u2264l\n\n(cid:90) \u221e\n\n0\n\n(cid:90) t+l\n\nt\u2212l\n\n\u03b3H(d\u03b8)dy\u03b2(dl)/2l = \u03b3H(d\u03b8)\n\n\u03b2(dl)\n\ndy/2l = \u03b3H(d\u03b8)\n\n(9)\n\nso that each Dt is again simply DP(\u03b3H). Now Ds and Dt will always be dependent with the amount\nof dependence decreasing as |s \u2212 t| increases.\n\n3.2\n\nInterpretation as Mixtures of DPs\n\nEven though the SN\u0393P as described above de\ufb01nes an uncountably in\ufb01nite number of DPs, in practice\nwe will only have observations at a \ufb01nite number of times, say t1, . . . , tm. We de\ufb01ne R as the\nsmallest collection of disjoint regions of Y such that each Ytj is a union of subsets in R. Thus\nR = {\u2229m\nj=1 Sj (cid:54)= \u2205}. For\n1 \u2264 j \u2264 m let Rj be the collection of regions in R contained in Ytj , so that \u222aR\u2208Rj = Ytj . For\neach R \u2208 R de\ufb01ne\n\nj=1Sj : Sj = Ytj or Sj = Y\\Ytj , with at least one Sj = Ytj and \u2229m\n\nGR(d\u03b8) = G(d\u03b8 \u2297 R)\n\n(10)\nWe see that each GR is a \u0393P with base measure \u03b1R(d\u03b8) = \u03b1(d\u03b8 \u2297 R). Normalizing, DR =\nGR/GR(\u0398) \u223c DP(\u03b1R), with DR\u22a5\u22a5DR(cid:48) for distinct R, R(cid:48) \u2208 R. Now\nGR(cid:48) (\u0398) DR(d\u03b8)\n\nDtj (d\u03b8) =(cid:80)\n\nR\u2208Rj\n\nP\n\n(11)\n\nGR(\u0398)\n\nR(cid:48)\u2208Rj\n\n3\n\n\fFigure 1: The extended space Y\u2297L over which the overall \u0393P is de\ufb01ned in the second example. Not\nshown is the \u0398-space over which the DPs are de\ufb01ned. Also not shown is the fourth dimension W\nneeded to de\ufb01ne the Poisson process used to construct the \u0393P. t1, t2, t3 \u2208 Y are three times at which\nobservations are present. The subset Ytj corresponding to each tj is the triangular area touching tj.\nThe regions in R are the six areas formed by various intersections of the triangular areas.\n\nso each Dtj is a mixture where each component DR is drawn independently from a DP. Further, the\nmixing proportions are Dirichlet distributed and independent from the components by virtue of each\nGR(\u0398) being gamma distributed and independent from DR. Thus we have the following equivalent\nconstruction for a SN\u0393P:\nDR \u223c DP(\u03b1R)\n\nDtj = (cid:88)\n\nR\u2208Rj\n\n\u03c0jRDR\n\ngR \u223c Gamma(\u03b1R(\u0398))\n\u03c0jR =\n\ngRP\n\nR(cid:48)\u2208Rj\n\ngR\n\nfor R \u2208 R\nfor R \u2208 Rj\n\n(12)\n\nNote that the DPs in this construction are all de\ufb01ned only over \u0398, and references to the auxiliary\nspace Y and the base measure \u03b1 are only used to de\ufb01ne the individual base measures \u03b1R and the\nshape parameters of the gR\u2019s. Figure 1 shows the regions for the second example corresponding to\nobservations at three times.\nThe mixture of DPs construction is related to the hierarchical Dirichlet process de\ufb01ned in [11] (not\nthe one de\ufb01ned by Teh et al [3]). The difference is that the parameters of the prior over the mixing\nproportions exactly matches the concentration parameters of the individual DPs. A consequence of\nthis is that each mixture Dtj is now conveniently also a DP.\n\n4\n\nInference in the SN\u0393P\n\nThe mixture of DPs interpretation of the SN\u0393P makes sampling from the model, and consequently\ninference via Markov chain Monte Carlo sampling, easy. In what follows, we describe both Gibbs\nsampling and Metropolis-Hastings based updates for a hierarchical model in which the dependent\nDPs act as prior distributions over a collection of in\ufb01nite mixture models. Formally, our observations\nnow lie in a measurable space (X, \u03a3) equipped with a set of probability measures F\u03b8 parametrized\nby \u03b8 \u2208 \u0398. Observation i at time tj is denoted xji, lies in region rji and is drawn from mixture\ncomponent parametrized by \u03b8ji. Thus to augment (12), we have\n\u03b8ji \u223c Drji\n\n(13)\nwhere rji = R with probability \u03c0jR for each R \u2208 Rj. In words, we \ufb01rst pick a region rji from the\nset Rj, then a mixture component \u03b8ji, followed by drawing xji from the mixture distribution.\n\nrji \u223c Mult({\u03c0jR : R \u2208 Rj})\n\nxji \u223c F\u03b8ji\n\n4.1 Gibb Sampling\n\nWe derive a Gibbs sampler for the SN\u0393P where the region DPs DR are integrated out and replaced\nby Chinese restaurants. Let cji denote the index of the cluster in Drji to which observation xji is\nassigned. We also assume that the base distribution H is conjugate to the mixture distributions F\u03b8\nso that the cluster parameters are integrated out as well. The Gibbs sampler iteratively resamples the\n\n4\n\nt1LSCALE = Lt2t3Y\flatent variables left: rji\u2019s, cji\u2019s and gR\u2019s. In the following, let mjRc be the number of observations\n\u00acji\nRc (xji) be the density of\nfrom time tj assigned to cluster c in the DP DR in region R, and let f\nobservation xji conditioned on the other variables currently assigned to cluster c in DR, with its\ncluster parameters integrated out. We denote marginal counts with dots, for example m\u00b7Rc is the\nnumber of observations (over all times) assigned to cluster c in region R. Superscripts \u00acji means\nobservation xji is excluded.\nrji and cji are resampled together; their conditional joint probability given the other variables is:\n\np(rji = R, cji = c|others) \u221d\n\n\u00acji\n\u00b7Rc\n\nm\n\u00acji\n\u00b7R\u00b7 +\u03b1R(\u0398)\n\nm\n\n\u00acji\nRc (xji)\n\n(14)\n\nf\n\nThe probability of xji joining a new cluster in region R is\n\n(cid:18)\ngRP\n(cid:18)\n\ngRP\n\nr\u2208Rj\n\ngr\n\n(cid:19)(cid:18)\n(cid:19)(cid:18)\n\np(rji = R, cji = cnew|others) \u221d\n\n(15)\nwhere R \u2208 Rj and c denotes the index of an existing cluster in region R. The updates of the gR\u2019s\nare more complicated as they are coupled and not of standard form:\n\n\u00acji\n\u00b7R\u00b7 +\u03b1R(\u0398)\n\nfRcnew(xji)\n\n\u03b1R(\u0398)\n\nr\u2208Rj\n\ngr\n\nm\n\np({gR}R\u2208R|others) =\n\n(16)\nTo sample the gR\u2019s we introduce auxiliary variables {Aj} to simplify the rightmost term above. In\nparticular, using the Gamma identity\n\nR\u2208Rj\n\ngR\n\nj\n\ne\u2212gR\n\nR\n\nR\u2208R g\u03b1R(\u0398)+m\u00b7R\u00b7\u22121\n(cid:19)\u2212mj\u00b7\u00b7\n(cid:90) \u221e\n\n(cid:18)(cid:80)\n\n(cid:19)(cid:81)\n\n(cid:18)(cid:80)\n\n(cid:19)\u2212mj\u00b7\u00b7\n\n(cid:18)(cid:81)\n\n(cid:19)\n(cid:19)\n\n\u2212P\ne\nwe have that (16) is the marginal of {gR}R\u2208R of the distribution:\n\nAmj\u00b7\u00b7\u22121\n\n\u0393(mj\u00b7\u00b7)\n\nR\u2208Rj\n\ngR\n\n=\n\n0\n\nj\n\nq({gR}R\u2208R,{Aj}) \u221d (cid:89)\n\ng\u03b1R(\u0398)+m\u00b7R\u00b7\u22121\n\ne\u2212gR(cid:89)\ngR|others \u223cGamma(\u03b1R(\u0398) + m\u00b7R\u00b7, 1 +(cid:80)\nAj|others \u223cGamma(mj\u00b7\u00b7,(cid:80)\n\nR\u2208R\n\ngR)\n\nR\n\nj\n\nj\n\nNow we can Gibbs sample the gR\u2019s and Aj\u2019s:\n\nR\u2208Rj\nHere JR is the collection of indices j such that R \u2208 Rj.\n\nR\u2208Rj\n\ngRAj dAj\n\u2212P\n\nR\u2208Rj\n\n(17)\n\ngRAj\n\n(18)\n\nAmj\u00b7\u00b7\u22121\n\ne\n\nj\u2208JR\n\nAj)\n\n(19)\n(20)\n\n4.2 Metropolis-Hastings Proposals\n\nTo improve convergence and mixing of the Markov chain, we introduce three Metropolis-Hastings\n(MH) proposals in addition to the Gibbs sampling updates described above. These propose non-\nincremental changes in the assignment of observations to clusters and regions, allowing the Markov\nchain to traverse to different modes that are hard to reach using Gibbs sampling.\nThe \ufb01rst proposal (Algorithm 1) proceeds like the split-merge proposal of [12]. It either splits an\nexisting cluster in a region into two new clusters in the same region, or merges two existing clusters\nin a region into a single cluster. To improve the acceptance probability, we use 5 rounds of restricted\nGibbs sampling [12].\nThe second proposal (Algorithm 2) seeks to move a picked cluster from one region to another.\nThe new region is chosen from a region neighbouring the current one (for example in Figure 1\nthe neigbors are the four regions diagonally neighbouring the current one). To improve acceptance\nprobability we also resample the gR\u2019s associated with the current and proposed regions. The move\ncan be invalid if the cluster contains an observation from a time point not associated with the new\nregion; in this case the move is simply rejected.\nThe third proposal (Algorithm 3) we considered seeks to combine into one step what would take\ntwo steps under the previous two proposals: splitting a cluster and moving it to a new region (or the\nreverse: moving a cluster into a new region and merging it with a cluster therein).\n\n5\n\n\fAlgorithm 1 Split and Merge in the Same Region (MH1)\n1: Let S0 be the current state of the Markov chain.\n2: Pick a region R with probability proportional to m\u00b7R\u00b7 and two distinct observations in R\n3: Construct a launch state S(cid:48) by creating two new clusters, each containing one of the two obser-\n\nvations, and running restricted Gibbs sampling\n\nPropose split: run one last round of restricted Gibbs sampling to reach the proposed state S1\n\n4: if the two observations belong to the same cluster in S0 then\n5:\n6: else\n7:\n8: end if\n9: Accept proposed state S1 according to acceptance probability min\n\nPropose merge: the proposed state S1 is the (unique) state merging the two clusters\n1, p(S1)q(S(cid:48)\u2192S0)\np(S0)q(S(cid:48)\u2192S1)\n\nwhere\np(S) is the posterior probability of state S and q(S(cid:48) \u2192 S) is the probability of proposing state\nS from the launch state S(cid:48).\n\n(cid:18)\n\n(cid:19)\n\nAlgorithm 2 Move (MH2)\n1: Pick a cluster c in region R0 with probability proportional to m\u00b7R0c\n2: Pick a region R1 neighbouring R0 and propose moving c to R1\n3: Propose new weights gR0, gR1 by sampling both from (19)\n4: Accept or reject the move\n\nAlgorithm 3 Split/merged Move (MH3)\n1: Pick a region R0, a cluster c contained in R, and a neighbouring region R1 with probability\n\nproportional to the number of observations in c that cannot be assigned to a cluster in R1\n\nPropose assigning these observations to a new cluster in R1\n\n2: if c contains observations than can be moved to R1 then\n3:\n4: else\n5:\n6: end if\n7: Propose new weights gR0 , gR1 by sampling from (19)\n8: Accept or reject the proposal\n\nPick a cluster from those in R1 and propose merging it into c\n\n5 Experiments\n\nSynthetic data In the \ufb01rst of our experiments, we arti\ufb01cially generated 60 data points at each of\n5 times by sampling from a mixture of 10 Gaussians. Each component was assigned a timespan,\nranging from a single time to the entire range of \ufb01ve times. We modelled this data as a collection of\n\ufb01ve DP mixture of Gaussians, with a SN\u0393P prior over the \ufb01ve dependent DPs. We used the set-up as\ndescribed in the second example. To encourage clusters to be shared across times (i.e. to avoid sim-\nilar clusters with non-overlapping timespans), we chose the distribution over window lengths \u03b2(w)\nto give larger probabilities to larger timespans. Even in this simple model, Gibbs sampling alone\nusually did not converge to a good optimum; remaining stuck around local maxima. Figure 2 shows\nthe evolution of the log-likelihood for 5 different samplers: plain Gibbs sampling, Gibbs sampling\naugmented with each of MH proposals 1, 2 and 3, and \ufb01nally a sampler that interleaved all three\nMH samplers with Gibbs sampling. Not surprisingly, the complete sampler converged fastest, with\nGibbs sampling with MH-proposal 2 (Gibbs+MH2) performing nearly as well. Gibbs+MH1 seemed\nconverge no faster than just Gibbs sampling, with Gibbs+MH3 giving performance somewhere in\nbetween. The fact that Gibbs+MH2 performs so well can be explained by the easy clustering struc-\nture of the problem, so that exploring region assignments of clusters rather than cluster assignments\nof observations was the challenge faced by the sampler (note its high acceptance rate in Figure 4).\nTo demonstrate how the additional MH proposals help mixing, we examined how the cluster as-\nsignment of observations varied over iterations. At each iteration, we construct a 600 by 600 binary\nmatrix, with element (i, j) being 1 if observations i and j are assigned to the same cluster.\nIn\nFigure 3, we plot the average L1 difference between matrices at different iteration lags. Some-\nwhat counterintuitively, Gibbs+MH1 does much better than Gibbs sampling with all MH proposals.\n\n6\n\n\fFigure 4: Acceptance rates of the MH proposals\nfor Gibbs+MH1+MH2+MH3 after burn-in (per-\ncentages).\n\nProposal\n\nMH-Proposal 1\nMH-Proposal 2\nMH-Proposal 3\n\nSynthetic\n\n0.51\n11.7\n0.22\n\nNIPS\n0.6621\n0.6548\n0.0249\n\nFigure 2: log-likelihoods (the coloured lines are\nordered at iteration 80 like the legend).\n\nFigure 3: Dissimilarity in clustering structure vs\nlag (the coloured lines are ordered like the leg-\nend).\n\nFigure 5: Evolution of the timespan of a cluster.\nFrom top to bottom: Gibbs+MH1+MH2+MH3,\nGibbs+MH2\n(pink),\nGibbs+MH3 (black) and Gibbs (magenta).\n\nGibbs+MH1\n\nand\n\nThis is because the latter is simultaneously exploring the region assignment of clusters as well. In\nGibbs+MH1, clusters split and merge frequently since they stay in the same regions, causing the\ncluster matrix to vary rapidly. In Gibbs+MH1+MH2+MH3, after a split the new clusters often move\ninto separate regions; so it takes longer before they can merge again. Nonetheless, this demonstrates\nthe importance of split/merge proposals like MH1 and MH3; [12] studied this in greater detail. We\nnext examined how well the proposals explore the region assignment of clusters. In particular, at\neach step of the Markov chain, we pick the cluster with mean closest to the mean of one of the true\nGaussian mixture components, and tracked how its timespan evolved. Figure 5 shows that without\nMH proposal 2, the clusters remain essentially frozen in their initial regions.\n\nNIPS dataset For our next experiment we modelled the proceedings of the \ufb01rst 13 years of NIPS.\nThe number of word tokens was about 2 million spread over 1740 documents, with about 13000\nunique words. We used a model that involves both the SN\u0393P (to capture changes in topic distri-\nbutions across the years) and the hierarchical Dirichlet process (HDP) [3] (to capture differences\namong documents). Each document is modeled using a different DP, with the DPs in year i sharing\nthe same base distribution Di. On top of this, we place a SN\u0393P (with structure given by the second\nexample in Section 3.1) prior on {Di}13\ni=1. Consequently, each topic is associated with a distribution\nover words, and has a particular timespan. Each document in year i is a mixture over the topics\nwhose timespan include year i. Our model allows statistical strength to be shared in a more re\ufb01ned\nmanner than the HDP. Instead of all DPs having the same base distribution, we have 13 dependent\nbase distributions drawn from the SN\u0393P. The concentration parameters of our DPs were chosen to\nencourage shared topics, their magnitude chosen to produce about a 100 topics over the whole cor-\npus on average. Figure 6 shows some of the topics identi\ufb01ed by the model and their timespans. For\ninference, we used Gibbs sampling, interleaved with all three MH proposals to update the SN\u0393P. the\nMarkov chain was initialized randomly except that all clusters were assigned to the top-most region\n(spanning the 13 years). We calculated per-word perplexity [3] on test documents (about half of all\ndocuments, withheld during training). We obtained an average perplexity of 3023.4, as opposed to\nabout 3046.5 for the HDP.\n\n7\n\n0100200300400500600700800\u22121000\u2212950\u2212900\u2212850\u2212800\u2212750 Gibbs+MH1MH2+MH3 Gibbs+MH2Gibbs+MH3Gibbs+MH1Gibbs123456789105001000150020002500300035004000 Gibbs+MH1Gibbs+MH1MH2+MH3 Gibbs+MH3Gibbs+MH2Gibbs050010001500200025003000350040004500500056789050010001500200025003000350040004500500068100500100015002000250030003500400045005000678\fTopic A\n\nTopic B\n\nTopic C\n\nTopic D\n\nTopic E\n\nTopic F\n\nTopic G\n\nTopic H\n\nTopic I\n\nfunction, model, data, error, learning, probability, distribution\n\nmodel, visual, \ufb01gure, image, motion, object, \ufb01eld\n\nnetwork, memory, neural, state, input, matrix, hop\ufb01eld\n\nrules, rule, language, tree, representations, stress, grammar\n\nclassi\ufb01er, genetic, memory, classi\ufb01cation, tree, algorithm, data\n\nmap, brain, \ufb01sh, electric, retinal, eye, tectal\n\nrecurrent, time, context, sequence, gamma, tdnn, sequences\n\nchain, protein, region, mouse, human, markov, sequence\n\nrouting, load, projection, forecasting, shortest, demand, packet\n\nFigure 6: Inferred topics with their timespans (the horizontal lines). In parentheses are the number\nof words assigned to each topic. On the right are the top ten most probable words in the topics.\n\nComputationally, the 3 MH steps are much cheaper than a round of Gibbs sampling. When trying to\nsplit a large cluster (or merge 2 large clusters), MH proposal 1 can still be fairly expensive because\nof the rounds of restricted Gibbs sampling. MH proposal 3 does not face this problem. However\nwe \ufb01nd that after the burn-in period it tends to have low acceptance rate. We believe we need to\nredesign MH proposal 3 to produce more intelligent splits to increase the acceptance rate. Finally,\nMH-proposal 2 is the cheapest, both in terms of computation and book-keeping, and has reasonably\nhigh acceptance rate. We ran MH-proposal 2 a hundred times between successive Gibbs sampling\nupdates. The acceptance rates of the MH proposals (given in Figure 4) are slightly lower than those\nreported by [12], where a plain DP mixture model was applied to a simple synthetic data set, and\nwhere split/merge acceptance rates were on the order of 1 to 5 percent.\n\n6 Discussion\n\nWe described a conceptually simple and elegant framework for the construction of dependent DPs\nbased on normalized gamma processes. The resulting collection of random probability measures has\na number of useful properties: the marginal distributions are DPs and the weights of shared atoms\ncan vary across DPs. We developed auxiliary variable Gibbs and Metropolis-Hastings samplers for\nthe model and applied it to time-varying topic modelling where each topic has its own time-span.\nSince [6] there has been strong interest in building dependent sets of random measures. Interestingly,\nthe property of each random measure being marginally DP, as originally proposed by [6], is often not\nmet in the literature, where dependent stochastic processes are de\ufb01ned through shared and random\nparameters [3, 14, 15, 11]. Useful dependent DPs had not been found [16] until recently, when\na \ufb02urry of models were proposed [17, 18, 19, 20, 21, 22, 23]. However most of these proposals\nhave been de\ufb01ned only for the real line (interpreted as the time line) and not for arbitrary spaces.\n[24, 25, 26, 13] proposed a variety of spatial DPs where the atoms and weights of the DPs are\ndependent through Gaussian processes. A model similar to ours was proposed recently in [23],\nusing the same basic idea of introducing dependencies between DPs through spatially overlapping\nregions. This model differs from ours in the content of these shared regions (breaks of a stick in that\ncase vs a (restricted) Gamma process in ours) and the construction of the DPs (they use the stick\nbreaking construction of the DP, we normalize the restricted Gamma process). Consequently, the\nnature of the dependencies between the DPs differ; for instance, their model cannot be interpreted\nas a mixture of DPs like ours.\nThere are a number of interesting future directions. First, we can allow, at additional complexity, the\nlocations of atoms to vary using the spatial DP approach [13]. Second, more work need still be done\nto improve inference in the model, e.g. using a more intelligent MH proposal 3. Third, although\nwe have only described spatial normalized gamma processes, it should be straightforward to extend\nthe approach to spatial normalized random measures [7, 8]. Finally, further investigations into the\nproperties of the SN\u0393P and its generalizations, including the nature of the dependency between DPs\nand asymptotic behavior, are necessary for a complete understanding of these processes.\n\n8\n\n11323456789101112(60385 words)(173268 words)(98342 words)(20290 words)(7021 words)(3223 words)(2074 words)(5334 words)(780 words)topic Btopic Ctopic Dtopic Etopic Ftopic Gtopic Itopic Hscaletopic Ayear\fReferences\n[1] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209\u2013230,\n\n1973.\n\n[2] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The in\ufb01nite hidden Markov model. In Advances in\n\nNeural Information Processing Systems, volume 14, 2002.\n\n[3] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[4] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of concepts with\nan in\ufb01nite relational model. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, volume 21,\n2006.\n\n[5] M. Johnson, T. L. Grif\ufb01ths, and S. Goldwater. Adaptor grammars: A framework for specifying compo-\nsitional nonparametric Bayesian models. In Advances in Neural Information Processing Systems, vol-\nume 19, 2007.\n\n[6] S. MacEachern. Dependent nonparametric processes. In Proceedings of the Section on Bayesian Statisti-\n\ncal Science. American Statistical Association, 1999.\n\n[7] L. E. Nieto-Barajas, I. Pruenster, and S. G. Walker. Normalized random measures driven by increasing\n\nadditive processes. Annals of Statistics, 32(6):2343\u20132360, 2004.\n\n[8] L. F. James, A. Lijoi, and I. Pruenster. Bayesian inference via classes of normalized random measures.\nICER Working Papers - Applied Mathematics Series 5-2005, ICER - International Centre for Economic\nResearch, April 2005.\n\n[9] J. F. C. Kingman. Completely random measures. Paci\ufb01c Journal of Mathematics, 21(1):59\u201378, 1967.\n[10] J. F. C. Kingman. Poisson Processes. Oxford University Press, 1993.\n[11] P. M\u00a8uller, F. A. Quintana, and G. Rosner. A method for combining inference across related nonparametric\n\nBayesian models. Journal of the Royal Statistical Society, 66:735\u2013749, 2004.\n\n[12] S. Jain and R. M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process\n\nmixture model. Technical report, Department of Statistics, University of Toronto, 2004.\n\n[13] J. A. Duan, M. Guindani, and A. E. Gelfand. Generalized spatial Dirichlet process models. Biometrika,\n\n94(4):809\u2013825, 2007.\n\n[14] A. Rodr\u00b4\u0131guez, D. B. Dunson, and A. E. Gelfand. The nested Dirichlet process. Technical Report 2006-19,\n\nInstitute of Statistics and Decision Sciences, Duke University, 2006.\n\n[15] D. B. Dunson, Y. Xue, and L. Carin. The matrix stick-breaking process: Flexible Bayes meta anal-\nysis. Technical Report 07-03, Institute of Statistics and Decision Sciences, Duke University, 2007.\nhttp://ftp.isds.duke.edu/WorkingPapers/07-03.html.\n\n[16] N. Srebro and S. Roweis. Time-varying topic models using dependent Dirichlet processes. Technical\n\nReport UTML-TR-2005-003, Department of Computer Science, University of Toronto, 2005.\n\n[17] J. E. Grif\ufb01n and M. F. J. Steel. Order-based dependent Dirichlet processes. Journal of the American\n\nStatistical Association, Theory and Methods, 101:179\u2013194, 2006.\n\n[18] J. E. Grif\ufb01n. The Ornstein-Uhlenbeck Dirichlet process and other time-varying processes for Bayesian\n\nnonparametric inference. Technical report, Department of Statistics, University of Warwick, 2007.\n\n[19] F. Caron, M. Davy, and A. Doucet. Generalized Polya urn for time-varying Dirichlet process mixtures. In\n\nProceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, volume 23, 2007.\n\n[20] A. Ahmed and E. P. Xing. Dynamic non-parametric mixture models and the recurrent Chinese restaurant\n\nprocess. In Proceedings of The Eighth SIAM International Conference on Data Mining, 2008.\n\n[21] J. E. Grif\ufb01n and M. F. J. Steel. Bayesian nonparametric modelling with the Dirichlet process regression\n\nsmoother. Technical report, University of Kent and University of Warwick, 2008.\n\n[22] J. E. Grif\ufb01n and M. F. J. Steel. Generalized spatial Dirichlet process models. Technical report, University\n\nof Kent and University of Warwick, 2009.\n\n[23] Y. Chung and D. B. Dunson. The local Dirichlet process. Annals of the Institute of Mathematical Statistics,\n\n2009. to appear.\n\n[24] S.N. MacEachern, A. Kottas, and A.E. Gelfand. Spatial nonparametric Bayesian models. In Proceedings\n\nof the 2001 Joint Statistical Meetings, 2001.\n\n[25] C. E. Rasmussen and Z. Ghahramani.\n\nIn\ufb01nite mixtures of Gaussian process experts.\n\nNeural Information Processing Systems, volume 14, 2002.\n\nIn Advances in\n\n[26] A. E. Gelfand, A. Kottas, and S. N. MacEachern. Bayesian nonparametric spatial modeling with Dirichlet\n\nprocess mixing. Journal of the American Statistical Association, 100(471):1021\u20131035, 2005.\n\n9\n\n\f", "award": [], "sourceid": 744, "authors": [{"given_name": "Vinayak", "family_name": "Rao", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}