{"title": "Construction of Dependent Dirichlet Processes based on Poisson Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1396, "page_last": 1404, "abstract": "", "full_text": "Construction of Dependent Dirichlet Processes\n\nbased on Poisson Processes\n\nDahua Lin\nCSAIL, MIT\n\ndhlin@mit.edu\n\nEric Grimson\nCSAIL, MIT\n\nwelg@csail.mit.edu\n\nJohn Fisher\nCSAIL, MIT\n\nfisher@csail.mit.edu\n\nAbstract\n\nWe present a novel method for constructing dependent Dirichlet processes. The\napproach exploits the intrinsic relationship between Dirichlet and Poisson pro-\ncesses in order to create a Markov chain of Dirichlet processes suitable for use\nas a prior over evolving mixture models. The method allows for the creation, re-\nmoval, and location variation of component models over time while maintaining\nthe property that the random measures are marginally DP distributed. Addition-\nally, we derive a Gibbs sampling algorithm for model inference and test it on both\nsynthetic and real data. Empirical results demonstrate that the approach is effec-\ntive in estimating dynamically varying mixture models.\n\n1\n\nIntroduction\n\nAs the cornerstone of Bayesian nonparametric modeling, Dirichlet processes (DP) [22] have been\napplied to a wide variety of inference and estimation problems [3, 10, 20] with Dirichlet process\nmixtures (DPMs) [15, 17] being one of the most successful. DPMs are a generalization of \ufb01nite\nmixture models that allow an inde\ufb01nite number of mixture components. The traditional DPM model\nassumes that each sample is generated independently from the same DP. This assumption is limiting\nin cases when samples come from many, yet dependent, DPs. HDPs [23] partially address this\nmodeling aspect by providing a way to construct multiple DPs implicitly depending on each other\nvia a common parent. However, their hierarchical structure may not be appropriate in some problems\n(e.g. temporally varying DPs).\nConsider a document model where each document is generated under a particular topic and each\ntopic is characterized by a distribution over words. Over time, topics change: some old topics fade\nwhile new ones emerge. For each particular topic, the word distribution may evolve as well. A\nnatural approach to model such topics is to use a Markov chain of DPs as a prior, such that the DP\nat each time is generated by varying the previous one in three possible ways: creating a new topic,\nremoving an existing topic, and changing the word distribution of a topic.\nSince MacEachern introduced the notion of dependent Dirichlet processes (DDP) [12], a vari-\nety of DDP constructions have been developed, which are based on either weighted mixtures of\nDPs [6, 14, 18], generalized Chinese restaurant processes [4, 21, 24], or the stick breaking construc-\ntion [5, 7]. Here, we propose a fundamentally different approach, taking advantage of the intrinsic\nrelationship between Dirichlet processes and Poisson processes: a Dirichlet process is a normal-\nized Gamma process, while a Gamma process is essentially a compound Poisson process. The key\nidea is motivated by the following: observations that preserve complete randomness when applied\nto Poisson processes result in a new process that remains Poisson. Consequently, one can obtain\na Dirichlet process which is dependent on other DPs by applying such operations to their underly-\ning compound Poisson processes. In particular, we discuss three speci\ufb01c operations: superposition,\nsubsampling, and point transition. We develop a Markov chain of DPs by combining these opera-\ntions, leading to a framework that allows creation, removal, and location variation of particles. This\n\n1\n\n\fconstruction inherently comes with an elegant property that the random measure at each time is\nmarginally DP distributed. Our approach relates to previous efforts in constructing dependent DPs\nwhile overcoming inherent limitations. A detailed comparison is given in section 4.\n\n2 Poisson, Gamma, and Dirichlet Processes\n\nOur construction of dependent Dirichlet processes rests upon the connection between Poisson,\nGamma, and Dirichlet processes, as well as the concept of complete randomness. We brie\ufb02y re-\nview these concepts; Kingman [9] provides a detailed exposition of the relevant theory.\nLet (\u2126,F\u2126) be a measurable space, and \u03a0 be a random point process on \u2126. Each realization of \u03a0\nuniquely corresponds to a counting measure N\u03a0 de\ufb01ned by N\u03a0(A) (cid:44) #(\u03a0 \u2229 A) for each A \u2208 F\u2126.\nHence, N\u03a0 is a measure-valued random variable or simply a random measure. A Poisson process\n\u03a0 on \u2126 with mean measure \u00b5, denoted \u03a0 \u223c PoissonP(\u00b5), is de\ufb01ned to be a point process such\nthat N\u03a0(A) has a Poisson distribution with mean \u00b5(A) and that for any disjoint measurable sets\nA1, . . . , An, N\u03a0(A1), . . . , N\u03a0(An) are independent. The latter property is referred to as complete\nrandomness. Poisson processes are the only point process that satis\ufb01es this property [9]:\nTheorem 1. A random point process \u03a0 on a regular measure space is a Poisson process if and only\nif N\u03a0 is completely random. If this is true, the mean measure is given by \u00b5(A) = E(N\u03a0(A)).\nConsider \u03a0\u2217 \u223c PoissonP(\u00b5\u2217) on a product space \u2126 \u00d7 R+. For each realization of \u03a0\u2217, We de\ufb01ne\n\u03a3\u2217 : F\u2126 \u2192 [0, +\u221e] as\n\nw\u03b8\u03b4\u03b8\n\n(\u03b8,w\u03b8)\u2208\u03a0\u2217\n\n(1)\nIntuitively, \u03a3\u2217(A) sums up the values of w\u03b8 with \u03b8 \u2208 A. Note that \u03a3\u2217 is also a completely random\nmeasure (but not a point process in general), and is essentially a generalization of the compound\nPoisson process. As a special case, if we choose \u00b5\u2217 to be\n\n\u00b5\u2217 = \u00b5 \u00d7 \u03b3 with \u03b3(dw) = w\u22121e\u2212wdw,\n\n(2)\nThen the random measure as de\ufb01ned in Eq.(1) is called a Gamma process with base measure \u00b5,\ndenoted by G \u223c \u0393P(\u00b5). Normalizing any realization of G \u223c \u0393P(\u00b5) yields a sample of a Dirichlet\nprocess, as\n\n(3)\nIn conventional parameterization, \u00b5 is often decomposed into two parts: a base distribution p\u00b5 (cid:44)\n\u00b5/\u00b5(\u2126), and a concentration parameter \u03b1\u00b5 (cid:44) \u00b5(\u2126).\n\nD (cid:44) G/G(\u2126) \u223c DP(\u00b5).\n\n\u03a3\u2217 (cid:44) (cid:88)\n\n3 Construction of Dependent Dirichlet Processes\n\nMotivated by the relationship between Poisson and Dirichlet processes, we develop a new approach\nfor constructing dependent Dirichlet processes (DDPs). Our approach can be described as follows:\ngiven a collection of Dirichlet processes, one can apply operations that preserve the complete ran-\ndomness of their underlying Poisson processes. This yields a new Poisson process (due to theorem 1)\nand a related DP which depends on the source. In particular, we consider three such operations: su-\nperposition, subsampling, and point transition.\nSuperposition of Poisson processes: Combining a set of independent Poisson processes yields a\nPoisson process whose mean measure is the sum of mean measures of the individual ones.\nTheorem 2 (Superposition Theorem [9]). Let \u03a01, . . . , \u03a0m be independent Poisson processes on \u2126\nwith \u03a0k \u223c PoissonP(\u00b5k), then their union has\n\n\u03a01 \u222a \u00b7\u00b7\u00b7 \u222a \u03a0m \u223c PoissonP(\u00b51 + \u00b7\u00b7\u00b7 + \u00b5m).\n\n(4)\n\n(cid:33)\nGiven a collection of independent Gamma processes G1, . . . , Gm, where for each k = 1, . . . , m,\nk \u223c PoissonP(\u00b5k \u00d7 \u03b3). By theorem 2, we have\nGk \u223c \u0393P(\u00b5k) with underlying Poisson process \u03a0\u2217\n(\u00b5k \u00d7 \u03b3)\n\n(cid:32)(cid:32) m(cid:88)\n\n(cid:32) m(cid:88)\n\nk \u223c PoissonP\n\u03a0\u2217\n\n= PoissonP\n\nm(cid:91)\n\n(cid:33)\n\n(cid:33)\n\n(5)\n\n\u00d7 \u03b3\n\n.\n\n\u00b5k\n\nk=1\n\nk=1\n\nk=1\n\n2\n\n\fDue to the relationship between Gamma processes and their underlying Poisson processes, such a\ncombination is equivalent to the direct superposition of the Gamma processes themselves, as\n\nG(cid:48) := G1 + \u00b7\u00b7\u00b7 + Gm \u223c \u0393P(\u00b51 + \u00b7\u00b7\u00b7 + \u00b5m).\n\n(6)\n\nHere, ck = gk/(cid:80)m\n\nLet Dk = Gk/Gk(\u2126), and gk = Gk(\u2126), then Dk is independent of gk, and thus\n\nD(cid:48) := G(cid:48)/G(cid:48)(\u2126) = (g1D1 + \u00b7\u00b7\u00b7 + gmDm)/(g1 + \u00b7\u00b7\u00b7 + gm) = c1D1 + \u00b7\u00b7\u00b7 + cmDm.\n\n(7)\nl=1 gl, which has (c1, . . . , cm) \u223c Dir(\u00b51(\u2126), . . . , \u00b5m(\u2126)). Consequently, one\ncan construct a Dirichlet process through a random convex combination of independent Dirichlet\nprocesses. This result is summarized by the following theorem:\nTheorem 3. Let D1, . . . , Dm be independent Dirichlet processes on \u2126 with Dk \u223c DP(\u00b5k), and\n(c1, . . . , cm) \u223c Dir(\u00b51(\u2126), . . . , \u00b5m(\u2126)) be independent of D1, . . . , Dm, then\n\nD1 \u2295 \u00b7\u00b7\u00b7 \u2295 Dm := c1D1 + \u00b7\u00b7\u00b7 cmDm \u223c DP(\u00b51 + \u00b7\u00b7\u00b7 + \u00b5m).\n\n(8)\nHere, we use the symbol \u2295 to indicate superposition via a random convex combination. Let \u03b1k =\n\n\u00b5k(\u2126) and \u03b1(cid:48) =(cid:80)m\n\nk=1 \u03b1k, then for each measurable subset A,\n\nE(D(cid:48)(A)) =\n\n\u03b1k\n\n\u03b1(cid:48) E(Dk(A)),\n\nand Cov(D(cid:48)(A), Dk(A)) = \u03b1k\n\n\u03b1(cid:48) Var(Dk(A)).\n\n(9)\n\nm(cid:88)\n\nk=1\n\nSubsampling Poisson processes: Random subsampling of a Poisson process via independent\nBernoulli trials yields a new Poisson process.\nTheorem 4 (Subsampling Theorem). Let \u03a0 \u223c PoissonP(\u00b5) be a Poisson process on the space \u2126,\nand q : \u2126 \u2192 [0, 1] be a measurable function. If we independently draw z\u03b8 \u2208 {0, 1} for each \u03b8 \u2208 \u03a00\nwith P(z\u03b8 = 1) = q(\u03b8), and let \u03a0k = {\u03b8 \u2208 \u03a0 : z\u03b8 = k} for k = 0, 1, then \u03a00 and \u03a01 are\nindependent Poisson processes on \u2126, with \u03a00 \u223c PoissonP((1 \u2212 q)\u00b5) and \u03a01 \u223c PoissonP(q\u00b5)1.\nWe emphasize that subsampling is via independent Bernoulli trials rather than choosing a \ufb01xed\nnumber of particles. We use Sq(\u03a0) := \u03a01 to denote the result of subsampling, where q is referred\nto as the acceptance function. Note that subsampling the underlying Poisson process of a Gamma\ni=1 wi\u03b4\u03b8i, and for each i, we\ndraw zi with P(zi = 1) = q(\u03b8i). Then, we have\n\nprocess G is equivalent to subsampling the terms of G. Let G = (cid:80)\u221e\n\nG(cid:48) = Sq(G) := (cid:88)\n\ni:zi=1\n\nwi\u03b4\u03b8i \u223c \u0393P(q\u00b5).\n\nLet D be a Dirichlet process given by D = G/G(\u2126), then we can construct a new Dirichlet pro-\ncess D(cid:48) = G(cid:48)/G(cid:48)(\u2126) by subsampling the terms of D and renormalizing their coef\ufb01cients. This is\nsummarized by the following theorem.\ni=1 ri\u03b4\u03b8i and q : \u2126 \u2192 [0, 1] be a measur-\n\nTheorem 5. Let D \u223c DP(\u00b5) be represented by D =(cid:80)n\n\nable function. For each i we independently draw zi with P(zi = 1) = q(\u03b8i), then\n\nD(cid:48) = Sq(D) := (cid:88)\n\ni:zi=1\n\ni\u03b4\u03b8i \u223c DP(q\u00b5),\nr(cid:48)\n\nwhere r(cid:48)\nLet \u03b1 = \u00b5(\u2126) and \u03b1(cid:48) = (q\u00b5)(\u2126), then for each measurable subset A,\n\nj:zj =1 rj are the re-normalized coef\ufb01cients for those i with zi = 1.\n\ni := ri/(cid:80)\n\nE(D(cid:48)(A)) =\n\n(q\u00b5)(A)\n(q\u00b5)(\u2126)\n\n=\n\nA qd\u00b5\n\u2126 qd\u00b5\n\n,\n\nand Cov(D(cid:48)(A), D(A)) = \u03b1(cid:48)\n\nVar(D(cid:48)(A)).\n\n(12)\n\n\u03b1\n\n(cid:82)\n(cid:82)\n\n(10)\n\n(11)\n\nPoint transition of Poisson processes: The third operation moves each point independently fol-\nlowing a probabilistic transition. Formally, a probabilistic transition is de\ufb01ned to be a function\nT : \u2126\u00d7F\u2126 \u2192 [0, 1] such that for each \u03b8 \u2208 F\u2126, T (\u03b8,\u00b7) is a probability measure on \u2126 that describes\nthe distribution of where \u03b8 moves, and for each A \u2208 F\u2126, T (\u00b7, A) is integrable. T can be considered\nas a transformation of measures over \u2126, as\n\n(cid:90)\n\n1q\u00b5 is a measure on \u2126 given by (q\u00b5)(A) =R\n\n(T \u00b5)(A) :=\n\nT (\u03b8, A)\u00b5(d\u03b8).\n\n(13)\n\n\u2126\n\nA qd\u00b5, or equivalently (q\u00b5)(d\u03b8) = q(\u03b8)\u00b5(d\u03b8).\n\n3\n\n\fTheorem 6 (Transition Theorem). Let \u03a0 \u223c PoissonP(\u00b5) and T be a probabilistic transition, then\n(14)\n\nT (\u03a0) := {T (\u03b8) : \u03b8 \u2208 \u03a0} \u223c PoissonP(T \u00b5).\n\nWith a slight abuse of notation, we use T (\u03b8) to denote an independent sample from T (\u03b8,\u00b7).\n\nTheorem 7. Let D =(cid:80)\u221e\n\nAs a consequence, we can derive a Gamma process and thus a Dirichlet process by applying the\nprobabilistic transition to the location of each term, leading to the following:\ni=1 ri\u03b4\u03b8i \u223c DP(\u00b5) be a Dirichlet process on \u2126, then\n\n\u221e(cid:88)\n\nT (D) :=\n\nri\u03b4T (\u03b8i) \u223c DP(T \u00b5).\n\n(15)\n\ni=1\n\nTheorems 1 and 2 are immediate consequences of the results in [9]. We derive Theorems 3 to The-\norem 7 independently as part of the proposed approach. Detailed explanation of relevant concepts\nand the proofs of Theorem 2 to Theorem 7 are provided in the supplement.\n\n3.1 A Markov Chain of Dirichlet Processes\n\nIntegrating these three operations, we construct a Markov chain of DPs formulated as\n\nwith Ht \u223c DP(\u03bd).\n\nDt = T (Sq(Dt\u22121)) \u2295 Ht,\n\n(16)\nThe model can be explained as follows: given Dt\u22121, we choose a subset of terms by subsampling,\nthen move their locations via a probabilistic transition T , and \ufb01nally superimpose a new DP Ht on\nthe resultant process to form Dt. Hence, creating new particles, removing existing particles, and\nvarying particle locations are all allowed, respectively, via superposition, subsampling, and point\ntransition. Note that while they are based on the operations of the underlying Poisson processes, due\nto theorems 3, 5, and 7, we operate directly on the DPs, without the need of explicitly instantiating\nthe associated Poisson processes or Gamma processes. Let \u00b5t be the base measure of Dt, then\n\n(17)\nParticularly, if the acceptance probability q is a constant, then \u03b1t = q\u03b1t\u22121 + \u03b1\u03bd. Here, \u03b1t = \u00b5t(\u2126)\nand \u03b1\u03bd = \u03bd(\u2126) are the concentration parameters. One may hold \u03b1t \ufb01xed over time by choosing\nappropriate values for q and \u03b1\u03bd. Furthermore, it can be shown that\n\n\u00b5t = T (q\u00b5t\u22121) + \u03bd.\n\nCov(Dt+n(A), Dt(A)) \u2264 qnVar(Dt(A)).\n\n(18)\nThe covariance with previous DPs decays exponentially when q < 1. This is often a desirable\nproperty in practice. Moreover, we note that \u03bd and q play different roles in controlling the process.\nGenerally, \u03bd determines how frequently new terms appear; while q governs the life span of a term\nwhich has a geometric distribution with mean (1 \u2212 q)\u22121.\nWe aim to use the Markov chain of DPs as a prior of evolving mixture models. This provides\na mechanism with which new component models can be brought in, existing components can be\nremoved, and the model parameters can vary smoothly over time.\n\n4 Comparison with Related Work\n\nIn his pioneering work [12], MacEachern proposed the \u201csingle-p DDP model\u201d. It considers DDP\nas a collection of stochastic processes, but does not provide a natural mechanism to change the\ncollection size over time. M\u00a8uller et al [14] formulated each DP as a weighted mixture of a common\nDP and an independent DP. This formulation was extended by Dunson [6] in modeling latent trait\ndistributions. Zhu et al [24] presented the Time-sensitive DP, in which the contribution of each DP\ndecays exponentially. Teh et al [23] proposed the HDP where each child DP takes its parent DP as\nthe base measure. Ren [18] combines the weighted mixture formulation with HDP to construct the\ndynamic HDP. In contrast to the model proposed here, a fundamental difference of these models is\nthat the marginal distribution at each node is generally not a DP.\nCaron et al [4] developed a generalized Polya Urn scheme while Ahmed and Xing [1] developed the\nrecurrent Chinese Restaurant process (CRP). Both generalize the CRP to allow time-variation, while\n\n4\n\n\fretaining the property of being marginally DP. The motivation underlying these methods fundamen-\ntally differs from ours, leading to distinct differences in the sampling algorithm. In particular, [4]\nsupports innovation and deletion of particles, but does not support variation of locations. Moreover,\nits deletion scheme is based on the distribution in history, but not on whether a component model\n\ufb01ts the new observation. While [1] does support innovation and point transition, there is no explicit\nway to delete old particles. It can be considered a special case of the proposed framework in which\nsubsampling operation is not incorporated. We note that [1] is motivated from an algorithmic rather\nthan theoretical perspective.\nGri\ufb01n and Steel [7] present the \u03c0DDP based on the stick breaking construction [19], reordering\nthe stick breaking ratios for each time so as to obtain different distributions over the particles. This\nwork is further extended [8] to a generic stick breaking processes. Chung et al [5] propose a local DP\nthat generalizes \u03c0DDP. Rather than reordering the stick breaking ratios, they regroup them locally\nsuch that dependent DPs can be constructed over a general covariate space. Inference in these mod-\nels requires sampling a series of auxiliary variables, considerably increasing computational costs.\nMoreover, the local DP relies on a truncated approximation to devise the sampling scheme.\nRecently, Rao and Teh [16] proposed the spatially normalized Gamma process. They construct a\nuniversal Gamma process in an auxiliary space and obtain dependent DPs by normalizing it within\noverlapped local regions. The theoretical foundation differs in that it does not exploit the relationship\nbetween the Gamma and Poisson process which is at the heart of the proposed model. In [16], the\ndependency is established through region overlapping; while in our work, this is accomplished by\nexplicitly transferring particles from one DP to another. In addition, this work does not support\nlocation variation, as it relies on a universal particle pool that is \ufb01xed over time.\n\n5 The Sampling Algorithm\n\nWe develop a Gibbs sampling procedure based on the construction of DDPs introduced above. The\nkey idea is to derive sampling steps by exploiting the fact that our construction maintains the property\nof being marginally DP via connections to the underlying Poisson processes. Furthermore, the\nderived procedure uni\ufb01es distinct aspects (innovation, removal, and transition) of our model. Let\nD \u223c DP(\u00b5) be a Dirichlet process on \u2126. Then given a set of samples \u03a6 \u223c D, in which \u03c6i appears\nci times, we have D|\u03a6 \u223c DP(\u00b5 + c1\u03b4\u03c61 + \u00b7\u00b7\u00b7 + cn\u03b4\u03c6n). Let D(cid:48) be a Dirichlet process depending\non D as in Eq.(16), \u03b10 = (q\u00b5)(\u2126), and qi = q(\u03b8i). Given \u03a6 \u223c D, we have\nqkckT (\u03c6k,\u00b7)\n\nD(cid:48)|\u03a6 \u223c DP\n\n\u03b1\u03bdp\u03bd + \u03b10pq\u00b5 +\n\nm(cid:88)\n\n(cid:33)\n\n(cid:32)\n\n(19)\n\n.\n\nSampling from D(cid:48). Let \u03b81 \u223c D(cid:48). Marginalizing over D(cid:48), we get\nT (\u03c6k,\u00b7) with \u03b1(cid:48)\n\nm(cid:88)\n\npq\u00b5 +\n\n\u03b81|\u03a6 \u223c \u03b1\u03bd\n\u03b1(cid:48)\n\n1\n\np\u03bd + \u03b10\n\u03b1(cid:48)\n\n1\n\nqkck\n\u03b1(cid:48)\n\n1\n\nk=1\n\nk=1\n\n1 = \u03b1\u03bd + \u03b10 +\n\nm(cid:88)\n\nk=1\n\nqkck.\n\n(20)\n\nThus we sample \u03b81 from three types of sources: the innovation distribution p\u03bd, the q-subsampled\nbase distribution pq\u00b5, and the transition distribution T (\u03c6k,\u00b7). In doing so, we \ufb01rst sample a variable\nu1 that indicates which source to sample from. Speci\ufb01cally, when u1 = \u22121, u1 = 0, or u1 = l > 0,\nwe respectively sample \u03b81 from p\u03bd, pq\u00b5, or T (\u03c6l,\u00b7). The probabilities of these cases are \u03b1\u03bd/\u03b1(cid:48)\n1,\n\u03b10/\u03b1(cid:48)\n1 respectively. After u1 is obtained, we then draw \u03b81 from the indicated source.\nThe next issue is how to update the posterior given \u03b81 and u1. The answer depends on the value of\nu1. When u1 = \u22121 or 0, \u03b81 is a new particle, and we have\n\n1, and qici/\u03b1(cid:48)\n\nD(cid:48)|\u03b81,{u1 \u2264 0} \u223c DP\n\n\u03b1\u03bdp\u03bd + \u03b10pq\u00b5 +\n\nqkckT (\u03c6k,\u00b7) + \u03b4\u03b81\n\n.\n\n(21)\n\n(cid:32)\n\nm(cid:88)\n\nk=1\n\n(cid:33)\n\n\uf8f6\uf8f8 .\n\nIf u1 = l > 0, we know that the particle \u03c6l is retained in the subsampling process (i.e. the corre-\nsponding Bernoulli trial outputs 1), and the transited version T (\u03c6l) is determined to be \u03b81. Hence,\n\nD(cid:48)|\u03b81,{u1 = l > 0} \u223c DP\n\nqkckT (\u03b8k,\u00b7) + (cl + 1)\u03b4\u03b81\n\n(22)\n\n\uf8eb\uf8ed\u03b1\u03bdp\u03bd + \u03b10pq\u00b5 +(cid:88)\n\nk(cid:54)=l\n\n5\n\n\f\u03b81, . . . , \u03b8n \u223c D(cid:48) i.i.d.,\n\nxi \u223c L(\u03b8i), i = 1, . . . , n.\n\nWith this posterior distribution, we can subsequently draw the second sample and so on. This process\ngeneralizes the Chinese restaurant process in several ways: (1) it allows either inheriting previous\nparticles or drawing new ones; (2) it uses qk to control the chance that we sample a previous particle;\n(3) the transition T allows smooth variation when we inherit a previous particle.\nInference with Mixture Models. We use the Markov chain of DPs as the prior of evolving mixture\nmodels. The generation process is formulated as\nand\n\n(23)\nHere, L(\u03b8i) is the observation model parameterized by \u03b8i. According to the analysis above, we\nderive an algorithm to sample \u03b81, . . . , \u03b8n conditioned on the observations x1, . . . , xn as follows.\nInitialization.\n(1) Let \u02dcm denote the number of particles, which is initialized to be m and will\nincrease as we draw new particles from p\u03bd or pq\u00b5. (2) Let wk denote the prior weights of different\nsampling sources which may also change during the sampling. Particularly, we set wk = qkck for\nk > 0, w\u22121 = \u03b1\u03bd, and w0 = \u03b10. (3) Let \u03c8k denote the particles, whose value is decided when a\nnew particle or the transited version of a previous one is sampled. (4) The label li indicates to which\nparticle \u03b8i corresponds and the counter rk records the number of times that \u03c8k has been sampled\n(set to 0 initially). (5) We compute the expected likelihood, as given by F (k, i) := Epk(f(xi|\u03b8)).\nHere, f(xi|\u03b8) is the likelihood of xj with respect to the parameter \u03b8, and pk is p\u03bd, pq\u00b5 or T (\u03c6k,\u00b7)\nrespectively when k = \u22121, k = 0 and k \u2265 1.\nSequential Sampling. For each i = 1, . . . , n, we \ufb01rst draw the indicator ui with probability P(ui =\nk) \u221d wkF (k, i). Depending on the value of ui, we sample \u03b8i from different sources. For brevity,\nlet p|x to denote the posterior distribution derived from the prior distribution p conditioned on the\nobservation x. (1) If ui = \u22121 or 0, we draw \u03b8i from p\u03bd|xi or pq\u00b5|xi, respectively, and then add it\nas a new particle. Concretely, we increase \u02dcm by 1, let \u03c8 \u02dcm = \u03b8j, r \u02dcm = w \u02dcm = 1, and set li = \u02dcm.\nMoreover, we compute F (m, i) = f(xi|\u03c8 \u02dcm) for each i. (2) Suppose ui = k > 0. If rk = 0 then it is\nthe \ufb01rst time we have drawn ui = k. Since \u03c8k has not been determined, we sample \u03b8i \u223c T (\u03c6k,\u00b7)|xi,\nthen set \u03c8k = \u03b8i. If rk > 0, the k-th particle has been sampled before. Thus, we can simply set\n\u03b8i = \u03c8k. In both cases, we set the label li = k, increase the weight wi and the counter ri by 1, and\nupdate F (k, i) to f(xi|\u03c8k) for each i.\nNote that this procedure is inef\ufb01cient in that it samples each particle \u03c6k merely based on the \ufb01rst\nobservation with label k. Therefore, we use this procedure for bootstrapping, and then run a Gibbs\nsampling scheme that iterates between parameter update and label update.\n(Parameter update): We resample each particle \u03c8k from its source distribution conditioned on all\nsamples with label k. In particular, for k \u2208 [1, m] with rk > 0, we draw \u03c8k \u223c T (\u03c6k,\u00b7)|{xi : li =\nk}, and for k \u2208 [m + 1, \u02dcm], we draw \u03c8k \u223c p|{xi : li = k}, where p = pq\u00b5 or p\u03bd, depending which\nsource \u03c8k was initially sampled from. After updating \u03c8k, we need to update F (k, i) accordingly.\n(Label update): The label updating is similar to the bootstrapping procedure described above. The\nonly difference is that when we update a label from k to k(cid:48), we need to decrease the weight and\ncounter for k. If rk decreases to zero, we remove \u03c8k, and reset wk to qkck when k \u2264 m.\nAt the end of each phase t, we sample \u03c8k \u223c T (\u03c6k,\u00b7) for each k with rk = 0. In addition, for each\nsuch particle, we update the acceptance probability as qk \u2190 qk\u00b7q(\u03c6k), which is the prior probability\nthat the particle \u03c6k will survive in next phase. MATLAB code is available in the following website:\nhttp://code.google.com/p/ddpinfer/.\n\n6 Experimental Results\n\nHere we present experimental results on both synthetic and real data.\nIn the synthetic case, we\ncompare our method with dynamic FMM in modeling mixtures of Gaussians whose number and\ncenters evolve over time. For real data, we test the approach in modeling the motion of people in\ncrowded scenes and the trends of research topics re\ufb02ected in index terms.\n\n6.1 Simulations on Synthetic Data\n\nThe data for simulations were synthesized as follows. We initialized the model with two Gaussian\ncomponents, and added new components following a temporal Poisson process (one per 20 phases\n\n6\n\n\f(a) Comparison with D-FMM (b) For different acceptance prob.\n\n(c) For different diffusion var.\n\nFigure 1: The simulation results: (a) compares the performance between D-DPMM and D-FMM with differing\nnumbers of components. The upper graph shows the median of distance between the resulting clusters and the\nground truth at each phase. The lower graph shows the actual numbers of clusters. (b) shows the performance of\nD-DPMM with different values of acceptance probability, under different data sizes. (c) shows the performance\nof D-DPMM with different values of diffusion variance, under different data sizes.\n\non average). For each component, the life span has a geometric distribution with mean 40, the mean\nevolves independently as a Brownian motion, and the variance is \ufb01xed to 1. We performed the\nsimulation for 80 phases, and at each phase, we drew 1000 samples for each active component. At\neach phase, we sample for 5000 iterations, discarding the \ufb01rst 2000 for burn-in, and collecting a\nsample every 100 iterations for performance evaluation. The particles of the last iteration at each\nphase were incorporated into the model as a prior for sampling in the next phase. We obtained\nthe label for each observation by majority voting based on the collected samples, and evaluated\nthe performance by measuring the dissimilarity between the resultant clusters and the ground truth\nusing the variation of information [13] criterion. Under each parameter setting, we repeated the\nexperiment 20 times, utilizing the median of the dissimilarities for comparison.\nWe compare our approach (D-DPMM) with dynamic \ufb01nite mixtures (D-FMM), which assumes a\n\ufb01xed number of Gaussians whose centers vary as Brownian motion. From Figure 1(a), we observe\nthat when the \ufb01xed number K of components equals the actual number, they yield comparable per-\nformance; while when they are not equal, the errors of D-FMM substantially increase. Particularly,\nK less than the actual number results in signi\ufb01cant under\ufb01tting (e.g. D-FMM with K = 2 or 3 at\nphases 30\u221250 and 66\u221276); when K is greater than the actual number, samples from the same com-\nponent are divided into multiple groups and assigned to different components (e.g. D-FMM with\nK = 5 at phases 1\u2212 10 and 30\u2212 50). In all cases, D-DPMM consistently outperforms D-FMM due\nto its ability to adjust the number of components to adapt to the change of observations.\nWe also studied how design parameters impact performance. In Figure 1(b), we see that an ac-\nceptance probability q to 0.1 creates new components rather than inheriting from previous phases,\nleading to poor performance when the number of samples is limited. If we set q = 0.9, the com-\nponents in previous phases have a higher survival rate, resulting in more reliable estimation of the\ncomponent parameters from multiple phases. Figure 1(c) shows the effect of the diffusion variance\nthat controls the parameter variation. When it is small, the parameter in the next phase is tied tightly\nwith the previous value; when it is large, the estimation basically relies on new observations. Both\ncases lead to performance degradation on small datasets, which indicates that it is important to main-\ntain a balance between inheritance and innovation. Our framework provides the \ufb02exibility to attain\nsuch a balance. Cross-validation can be used to set these parameters automatically.\n\n6.2 Real Data Applications\n\nModeling People Flows. It was observed [11] that the majority of people walking in crowded areas\nsuch as a rail station tend to follow motion \ufb02ows. Typically, there are several \ufb02ows at a time, and\neach \ufb02ow may last for a period. In this experiment, we apply our approach to extract the \ufb02ows.\nThe test was conducted on video acquired in New York Grand Central Station, which comprises\n90, 000 frames for one hour (25 fps). A low level tracker was used to obtain the tracks of people,\nwhich were then processed by a rule-based \ufb01lter that discards obviously incorrect tracks. We adopt\nthe \ufb02ow model described in [11], which uses an af\ufb01ne \ufb01eld to capture the motion patterns of each\n\ufb02ow. The observation for this model is in the form of location-velocity pairs. We divided the entire\n\n7\n\n0102030405060708000.050.10.150.2median distance D\u2212DPMMD\u2212FMM (K = 2)D\u2212FMM (K = 3)D\u2212FMM (K = 5)0102030405060708005tactual # comp.05010015020000.050.10.150.2# samples/componentmedian distance q=0.1q=0.9q=105010015020000.10.20.30.40.50.60.70.8# samples/componentmedian distance var=0.0001var=0.1var=100\f(a) People \ufb02ows\n\n(b) PAMI topics\n\nFigure 2: The experiment results on real data. (a) left: the timelines of the top 20 \ufb02ows; right: illustration of\n\ufb01rst two \ufb02ows. (Illustrations of larger sizes are in the supplement.) (b) left: the timelines of the top 10 topics;\nright: the two leading keywords for these topics. (A list with more keywords is in the supplement.)\n\nsequence into 60 phases (each for one minute), extract location-velocity pairs from all tracks, and\nrandomly choose 3000 pairs for each phase for model inference. The algorithm infers 37 \ufb02ows in\ntotal, while at each phase, the numbers of active \ufb02ows range from 10 to 18. Figure 2(a) shows the\ntimelines of the top 20 \ufb02ows (in terms of the numbers of assigned observations). We compare the\nperformance of our method with D-FMM by measuring the average likelihood on a disjoint dataset.\nThe value for our method is \u22123.34, while those for D-FMM are \u22126.71, \u22125.09, \u22123.99, \u22123.49, and\n\u22123.34, when K are respectively set to 10, 20, 30, 40, and 50. Consequently, with a much smaller\nnumber of components (12 active components on average), our method attains a similar modeling\naccuracy as a D-FMM with 50 components.\nModeling Paper Topics. Next we analyze the evolution of paper topics for IEEE Trans. on PAMI.\nBy parsing the webpage of IEEE Xplore, we collected the index terms for 3014 papers published in\nPAMI from Jan, 1990 to May, 2010. We \ufb01rst compute the similarity between each pair of papers\nin terms of relative fraction of overlapped index terms. We derive a 12-dimensional feature vector\nusing spectral embedding [2] over the similarity matrix for each paper. We run our algorithm on\nthese features with each phase corresponding to a year. Each cluster of papers is deemed a topic.\nWe compute the histogram of index terms and sorted them in decreasing order of frequency for each\ntopic. Figure 2(b) shows the timelines of top 10 topics, and together with the top two index terms\nfor each of them. Not surprisingly, we see that topics such as \u201cneural networks\u201d arise early and then\ndiminish while \u201cimage segmentation\u201d and \u201cmotion estimation\u201d persist.\n\n7 Conclusion and Future Directions\n\nWe developed a principled framework for constructing dependent Dirichlet processes. In contrast to\nmost DP-based approaches, our construction is motivated by the intrinsic relation between Dirichlet\nprocesses and compound Poisson processes. In particular, we discussed three operations: super-\nposition, subsampling, and point transition, which produce DPs depending on others. We further\ncombined these operations to derive a Markov chain of DPs, leading to a prior of mixture models\nthat allows creation, removal, and location variation of component models under a uni\ufb01ed formula-\ntion. We also presented a Gibbs sampling algorithm for inferring the models. The simulations on\nsynthetic data and the experiments on modeling people \ufb02ows and paper topics clearly demonstrate\nthat the proposed method is effective in estimating mixture models that evolve over time.\nThis framework can be further extended along different directions. The fact that each completely\nrandom point process is a Poisson process suggests that any operation that preserves the complete\nrandomness can be applied to obtain dependent Poisson processes, and thus dependent DPs. Such\noperations are de\ufb01nitely not restricted to the three ones discussed in this paper. For example, random\nmerging and random splitting of particles also possess this property, which would lead to an extended\nframework that allows merging and splitting of component models. Furthermore, while we focused\non Markov chain in this paper, the framework can be straightforwardly generalized to any acyclic\nnetwork of DPs. It is also interesting to study how it can be generalized to the case with undirected\nnetwork or even continuous covariate space. We believe that as a starting point, this paper would\nstimulate further efforts to exploit the relation between Poisson processes and Dirichlet processes.\n\n8\n\n010203040506002468101214161820timeindexflow 1 flow 2 1990199520002005201001234567891011timeindex1 motion estimation, video sequences 2 pattern recognition, pattern clustering 3 statistical models, optimization problem 4 discriminant analysis, information theory 5 image segmentation, image matching 6 face recognition, biological 7 image representation, feature extraction 8 photometry, computational geometry 9 neural nets, decision theory 10 image registration, image color analysis \fReferences\n[1] A. Ahmed and E. Xing. Dynamic Non-Parametric Mixture Models and The Recurrent Chinese Restaurant\n\nProcess : with Applications to Evolutionary Clustering. In Proc. of SDM\u201908, 2008.\n\n[2] F. R. Bach and M. I. Jordan. Learning spectral clustering. In Proc. of NIPS\u201903, 2003.\n[3] J. Boyd-Graber and D. M. Blei. Syntactic Topic Models. In Proc. of NIPS\u201908, 2008.\n[4] F. Caron, M. Davy, and A. Doucet. Generalized Polya Urn for Time-varying Dirichlet Process Mixtures.\n\nIn Proc. of UAI\u201907, number 6, 2007.\n\n[5] Y. Chung and D. B. Dunson. The local Dirichlet Process. Annals of the Inst. of Stat. Math., (October\n\n2007), January 2009.\n\n[6] D. B. Dunson. Bayesian Dynamic Modeling of Latent Trait Distributions. Biostatistics, 7(4), October\n\n2006.\n\n[7] J. E. Grif\ufb01n and M. F. J. Steel. Order-Based Dependent Dirichlet Processes. Journal of the American\n\nStatistical Association, 101(473):179\u2013194, March 2006.\n\n[8] J. E. Grif\ufb01n and M. F. J. Steel. Time-Dependent Stick-Breaking Processes. Technical report, 2009.\n[9] J. F. C. Kingman. Poisson Processes. Oxford University Press, 1993.\n[10] J. J. Kivinen, E. B. Sudderth, and M. I. Jordan. Learning Multiscale Representations of Natural Scenes\n\nUsing Dirichlet Processes. In Proc. of ICCV\u201907, 2007.\n\n[11] D. Lin, E. Grimson, and J. Fisher. Learning Visual Flows: A Lie Algebraic Approach.\n\nCVPR\u201909, 2009.\n\nIn Proc. of\n\n[12] S. N. MacEachern. Dependent Nonparametric Processes.\n\nStatistical Science, 1999.\n\nIn Proceedings of the Section on Bayesian\n\n[13] M. Meila. Comparing clusterings - An Axiomatic View. In Proc. of ICML\u201905, 2005.\n[14] P. Muller, F. Quintana, and G. Rosner. A Method for Combining Inference across Related Nonparametric\n\nBayesian Models. J. R. Statist. Soc. B, 66(3):735\u2013749, August 2004.\n\n[15] R. M. Neal. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of compu-\n\ntational and graphical statistics, 9(2):249\u2013265, 2000.\n\n[16] V. Rao and Y. W. Teh. Spatial Normalized Gamma Processes. In Proc. of NIPS\u201909, 2009.\n[17] C. E. Rasmussen. The In\ufb01nite Gaussian Mixture Model. In Proc. of NIPS\u201900, 2000.\n[18] L. Ren, D. B. Dunson, and L. Carin. The Dynamic Hierarchical Dirichlet Process. In Proc. of ICML\u201908,\n\nNew York, New York, USA, 2008. ACM Press.\n\n[19] J. Sethuraman. A Constructive De\ufb01nition of Dirichlet Priors. Statistica Sinica, 4(2):639\u2013650, 1994.\n[20] K.-a. Sohn and E. Xing. Hidden Markov Dirichlet process: modeling genetic recombination in open\n\nancestral space. In Proc. of NIPS\u201907, 2007.\n\n[21] N. Srebro and S. Roweis. Time-Varying Topic Models using Dependent Dirichlet Processes, 2005.\n[22] Y. W. Teh. Dirichlet Process, 2007.\n[23] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet Processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[24] X. Zhu and J. Lafferty. Time-Sensitive Dirichlet Process Mixture Models, 2005.\n\n9\n\n\f", "award": [], "sourceid": 4151, "authors": [{"given_name": "Dahua", "family_name": "Lin", "institution": null}, {"given_name": "Eric", "family_name": "Grimson", "institution": null}, {"given_name": "John", "family_name": "Fisher", "institution": null}]}