{"title": "Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2699, "page_last": 2707, "abstract": "Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning.  In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.", "full_text": "Truly Nonparametric Online Variational Inference\n\nfor Hierarchical Dirichlet Processes\n\nMichael Bryant and Erik B. Sudderth\n\nDepartment of Computer Science, Brown University, Providence, RI\n\nmbryantj@gmail.com, sudderth@cs.brown.edu\n\nAbstract\n\nVariational methods provide a computationally scalable alternative to Monte Carlo\nmethods for large-scale, Bayesian nonparametric learning. In practice, however,\nconventional batch and online variational methods quickly become trapped in lo-\ncal optima. In this paper, we consider a nonparametric topic model based on the\nhierarchical Dirichlet process (HDP), and develop a novel online variational in-\nference algorithm based on split-merge topic updates. We derive a simpler and\nfaster variational approximation of the HDP, and show that by intelligently split-\nting and merging components of the variational posterior, we can achieve substan-\ntially better predictions of test data than conventional online and batch variational\nalgorithms. For streaming analysis of large datasets where batch analysis is in-\nfeasible, we show that our split-merge updates better capture the nonparametric\nproperties of the underlying model, allowing continual learning of new topics.\n\n1 Introduction\n\nBayesian nonparametric methods provide an increasingly important framework for unsupervised\nlearning from structured data. For example, the hierarchical Dirichlet process (HDP) [1] provides\na general approach to joint clustering of grouped data, and leads to effective nonparametric topic\nmodels. While nonparametric methods are best motivated by their potential to capture the details\nof large datasets, practical applications have been limited by the poor computational scaling of\nconventional Monte Carlo learning algorithms.\n\nMean \ufb01eld variational methods provide an alternative, optimization-based framework for nonpara-\nmetric learning [2, 3]. Aiming at larger-scale applications, recent work [4] has extended online\nvariational methods [5] for the parametric, latent Dirichlet allocation (LDA) topic model [6] to the\nHDP. While this online approach can produce reasonable models of large data streams, we show that\nthe variational posteriors of existing algorithms often converge to poor local optima. Multiple runs\nare usually necessary to show robust performance, reducing the desired computational gains. Fur-\nthermore, by applying a \ufb01xed truncation to the number of posterior topics or clusters, conventional\nvariational methods limit the ability of purportedly nonparametric models to fully adapt to the data.\n\nIn this paper, we propose novel split-merge moves for online variational inference for the HDP\n(oHDP) which result in much better predictive performance. We validate our approach on two cor-\npora, one with millions of documents. We also propose an alternative, direct assignment HDP repre-\nsentation which is faster and more accurate than the Chinese restaurant franchise representation used\nin prior work [4]. Additionally, the inclusion of split-merge moves during posterior inference allows\nus to dynamically vary the truncation level throughout learning. While conservative truncations can\nbe theoretically justifed for batch analysis of \ufb01xed-size datasets [2], our data-driven adaptation of\nthe trunction level is far better suited to large-scale analysis of streaming data.\n\nSplit-merge proposals have been previously investigated for Monte Carlo analysis of nonparametric\nmodels [7, 8, 9]. They have also been used for maximum likelihood and variational analysis of\n\n1\n\n\f\u03b1\n\n\u03b3\n\n\u03c0j\n\nzjn\n\nwjn\n\n\u03b2\n\n\u03b7\n\nNj D\n\n\u03c6k\n\n\u221e\n\nFigure 1: Directed graphical representation of a hierarchical Dirichlet process topic model, in which an un-\nbounded collection of topics \u03c6k model the Nj words in each of D documents. Topics occur with frequency \u03c0j\nin document j, and with frequency \u03b2 across the full corpus.\n\nparametric models [10, 11, 12, 13]. These deterministic algorithms validate split-merge proposals\nby evaluating a batch objective on the entire dataset, an approach which is unexplored for nonpara-\nmetric models and infeasible for online learning. We instead optimize the variational objective via\nstochastic gradient ascent, and split or merge based on only a noisy estimate of the variational lower\nbound. Over time, these local decisions lead to global estimates of the number of topics present in a\ngiven corpus. We review the HDP and conventional variational methods in Sec. 2, develop our novel\nsplit-merge procedure in Sec. 3, and evaluate on various document corpora in Sec. 4.\n\n2 Variational Inference for Bayesian Nonparametric Models\n\n2.1 Hierarchical Dirichlet processes\n\nThe HDP is a hierarchical nonparametric prior for grouped mixed-membership data. In its simplest\nform, it consists of a top-level DP and a collection of D bottom-level DPs (indexed by j) which\nshare the top-level DP as their base measure:\n\nG0 \u223c DP(\u03b3H),\n\nGj \u223c DP(\u03b1G0),\n\nj = 1, . . . , D.\n\nHere, H is a base measure on some parameter space, and \u03b3 > 0, \u03b1 > 0 are concentration parameters.\nUsing a stick-breaking representation [1] of the global measure G0, the HDP can be expressed as\n\nG0 =\n\n\u221e\n\nX\n\nk=1\n\n\u03b2k\u03b4\u03c6k ,\n\nGj =\n\n\u221e\n\nX\n\nk=1\n\n\u03c0jk\u03b4\u03c6k .\n\nThe global weights \u03b2 are drawn from a stick-breaking distribution \u03b2 \u223c GEM(\u03b3), and atoms are\nindependently drawn as \u03c6k \u223c H. Each Gj shares atoms with the global measure G, and the lower-\nlevel weights are drawn \u03c0j \u223c DP(\u03b1\u03b2). For this direct assignment representation, the k indices for\neach Gj index directly into the global set of atoms. To complete the de\ufb01nition of the general HDP,\nparameters \u03c8jn \u223c Gj are then drawn for each observation n in group j, and observations are drawn\nxjn \u223c F (\u03c8jn) for some likelihood family F . Note that \u03c8jn = \u03c6zjn for some discrete indicator zjn.\n\nIn this paper we focus on an application of the HDP to modeling document corpora. The top-\nics \u03c6k \u223c Dirichlet(\u03b7) are distributions on a vocabulary of W words. The global topic weights,\n\u03b2 \u223c GEM(\u03b3), are still drawn from a stick-breaking prior. For each document j, document-speci\ufb01c\ntopic frequencies are drawn \u03c0j \u223c DP(\u03b1\u03b2). Then for each word index n in document j, a topic\nindicator is drawn zjn \u223c Categorical(\u03c0j), and \ufb01nally a word is drawn wjn \u223c Categorical(\u03c6zjn ).\n\n2.2 Batch Variational Inference for the HDP\n\nWe use variational inference [14] to approximate the posterior of the latent variables (\u03c6, \u03b2, \u03c0, z) \u2014\nthe topics, global topic weights, document-speci\ufb01c topic weights, and topic indicators, respectively\n\u2014 with a tractable distribution q, indexed by a set of free variational parameters. Appealing to mean\n\ufb01eld methods, our variational distribution is fully factorized, and is of the form\n\nq(\u03c6, \u03b2, \u03c0, z | \u03bb, \u03b8, \u03d5) = q(\u03b2)\n\n\u221e\n\nY\n\nk=1\n\nq(\u03c6k | \u03bbk)\n\nD\n\nY\n\nj=1\n\nq(\u03c0j | \u03b8j)\n\nNj\n\nY\n\nn=1\n\nq(zjn | \u03d5jn),\n\n(1)\n\n2\n\n\fwhere D is the number of documents in the corpus and Nj is the number of words in document j.\nIndividual distributions are selected from appropriate exponential families:\n\nq(\u03b2) = \u03b4\u03b2 \u2217(\u03b2)\n\nq(\u03c6k | \u03bbk) = Dirichlet(\u03c6k | \u03bbk)\nq(\u03c0j | \u03b8j) = Dirichlet(\u03c0j | \u03b8j)\n\nq(zjn) = Categorical(zjn | \u03d5jn)\n\nwhere \u03b4\u03b2 \u2217(\u03b2) denotes a degenerate distribution at the point \u03b2\u2217.1 In our update derivations below,\nwe use \u03d5jw to denote the shared \u03d5jn for all word tokens in document j of type w.\n\nSelection of an appropriate truncation strategy is crucial to the accuracy of variational methods for\nnonparametric models. Here, we truncate the topic indicator distributions by \ufb01xing q(zjn = k) = 0\nfor k > K, where K is a threshold which varies dynamically in our later algorithms. With this\nassumption, the topic distributions with indices greater than K are conditionally independent of\nthe observed data; we may thus ignore them and tractably update the remaining parameters with\nrespect to the true, in\ufb01nite model. A similar truncation has been previously used in the context of an\notherwise more complex collapsed variational method [3]. Desirably, this truncation is nested such\nthat increasing K always gives potentially improved bounds, but does not require the computation\nof in\ufb01nite sums, as in [16]. In contrast, approximations based on truncations of the stick-breaking\ntopic frequency prior [2, 4] are not nested, and their artifactual placement of extra mass on the \ufb01nal\ntopic K is less suitable for our split-merge online variational inference.\n\nVia standard convexity arguments [14], we lower bound the marginal log likelihood of the observed\ndata using the expected complete-data log likelihood and the entropy of the variational distribution,\nL(q) def= Eq[log p(\u03c6, \u03b2, \u03c0, z, w | \u03b1, \u03b3, \u03b7)] \u2212 Eq[log q(\u03c6, \u03c0, z | \u03bb, \u03b8, \u03d5)]\n\n= Eq[log p(w | z, \u03c6)] + Eq[log p(z | \u03c0)] + Eq[log p(\u03c0 | \u03b1\u03b2)] + Eq[log p(\u03c6 | \u03b7)]\n+ Eq[log p(\u03b2 | \u03b3)] \u2212 Eq[log q(z | \u03d5)] \u2212 Eq[log q(\u03c0 | \u03b8)] \u2212 Eq[log q(\u03c6 | \u03bb)]\n\n=\n\nD\n\nX\n\nj=1\n\nnEq[log p(wj | zj, \u03c6)] + Eq[log p(zj | \u03c0j)] + Eq[log p(\u03c0j | \u03b1\u03b2)] \u2212 Eq[log q(zj | \u03d5j)]\n\n\u2212 Eq[log q(\u03c0j | \u03b8j)] +\n\n1\n\nD(cid:16)Eq[log p(\u03c6 | \u03b7)] + Eq[log p(\u03b2 | \u03b3)] \u2212 Eq[log q(\u03c6 | \u03bb)](cid:17)o,\n\n(2)\n\nand maximize this quantity by coordinate ascent on the variational parameters. The expectations\nare with respect to the variational distribution. Each expectation is dependent on only a subset\nof the variational parameters; we leave off particular subscripts for notational clarity. Note that\nthe expansion of the variational lower bound in (2) contains all terms inside a summation over\ndocuments. This is the key observation that allowed [5] to develop an online inference algorithm\nfor LDA. A full expansion of the variational objective is given in the supplemental material. Taking\nderivatives of L(q) with respect to each of the variational parameters yields the following updates:\n\n(3)\n\n(4)\n\n(5)\n\n\u03d5jwk \u221d exp {Eq[log \u03c6kw] + Eq[log \u03c0jk]}\n\u03b8jk \u2190 \u03b1\u03b2k + PW\n\u03bbkw \u2190 \u03b7 + PD\n\nj=1 nw(j)\u03d5jwk,\n\nw=1 nw(j)\u03d5jwk\n\nHere, nw(j) is the number of times word w appears in document j. The expectations in (3) are\n\nEq[log \u03c6kw] = \u03a8(\u03bbkw) \u2212 \u03a8(Pi \u03bbki),\n\nEq[log \u03c0jk] = \u03a8(\u03b8jk) \u2212 \u03a8(Pi \u03b8ji),\nwhere \u03a8(x) is the digamma function, the \ufb01rst derivative of the log of the gamma function.\nIn evaluating our objective, we represent \u03b2\u2217 as a (K + 1)-dim. vector containing the probabilities\nof the \ufb01rst K topics, and the total mass of all other topics. While \u03b2\u2217 cannot be optimized in closed\nform, it can be updated via gradient-based methods; we use a variant of L-BFGS. Drawing a par-\nallel between variational inference and the expectation maximization (EM) algorithm, we label the\ndocument-speci\ufb01c updates of (\u03d5j, \u03b8j) the E-step, and the corpus-wide updates of (\u03bb, \u03b2) the M-step.\n\n1We expect \u03b2 to have small posterior variance in large datasets, and using a point estimate \u03b2 \u2217 simpli\ufb01es\nvariational derivations for our direct assignment formulation. As empirically explored for the HDP-PCFG [15],\nupdates to the global topic weights have much less predictive impact than improvements to topic distributions.\n\n3\n\n\f2.3 Online Variational Inference\n\nBatch variational inference requires a full pass through the data at each iteration, making it compu-\ntationally infeasible for large datasets and impossible for streaming data. To remedy this, we adapt\nand improve recent work on online variational inference algorithms [4, 5].\n\nThe form of the lower bound in (2), as a scaled expectation with respect to the document collec-\ntion, suggests an online learning algorithm. Given a learning rate \u03c1t satisfying P\u221e\nt=0 \u03c1t = \u221e and\nP\u221e\nt < \u221e, we can optimize the variational objective stochastically. Each update begins by\nsampling a \u201cmini-batch\u201d of documents S, of size |S|. After updating the mini-batch of document-\nspeci\ufb01c parameters (\u03d5j, \u03b8j) by iterating (3,4), we update the corpus-wide parameters as\n\nt=0 \u03c12\n\n\u03bbkw \u2190 (1 \u2212 \u03c1t)\u03bbkw + \u03c1t\u02c6\u03bbkw,\n\u03b2\u2217\nk \u2190 (1 \u2212 \u03c1t)\u03b2\u2217\n\nk + \u03c1t \u02c6\u03b2k,\n\nwhere \u02c6\u03bbkw is a set of suf\ufb01cient statistics for topic k, computed from a noisy estimate of (5):\n\n\u02c6\u03bbkw = \u03b7 +\n\nD\n|S| X\n\nj\u2208S\n\nnw(j)\u03d5jwk.\n\n(6)\n\n(7)\n\n(8)\n\nThe candidate topic weights \u02c6\u03b2 are found via gradient-based optimization on S. The resulting infer-\nence algorithm is similar to conventional batch methods, but is applicable to streaming, big data.\n\n3 Split-Merge Updates for Online Variational Inference\n\nWe develop a data-driven split-merge algorithm for online variational inference for the HDP, re-\nferred to as oHDP-SM. The algorithm dynamically expands and contracts the truncation level K by\nsplitting and merging topics during specialized moves which are interleaved with standard online\nvariational updates. The resulting model truly allows the number of topics to grow with the data. As\nsuch, we do not have to employ the technique of [4, 3] and other truncated variational approaches of\nsetting K above the expected number of topics and relying on the inference to infer a smaller num-\nber. Instead, we initialize with small K and let the inference discover new topics as it progresses,\nsimilar to the approach used in [17]. One can see how this property would be desirable in an online\nsetting, as documents seen after many inference steps may still create new topics.\n\n3.1 Split: Creation of New Topics\n\nGiven the result of analyzing one mini-batch q\u2217 = (cid:8)(\u03d5j, \u03b8j)|S|\nj=1, \u03bb, \u03b2\u2217(cid:9), and the corresponding\nvalue of the lower bound L(q\u2217), we consider splitting topic k into two topics k\u2032, k\u2032\u2032.2 The split\nprocedure proceeds as follows: (1) initialize all variational posteriors to break symmetry between\nthe new topics, using information from the data; (2) re\ufb01ne the new variational posteriors using a\nrestricted iteration; (3) accept or reject the split via the change in variational objective value.\n\nInitialize new variational posteriors To break symmetry, we initialize the new topic posteriors\n(\u03bbk\u2032 , \u03bbk\u2032\u2032 ), and topic weights (\u03b2\u2217\n\nk\u2032\u2032 ), using suf\ufb01cient statistics from the previous iteration:\n\nk\u2032 , \u03b2\u2217\n\n\u03bbk\u2032 = (1 \u2212 \u03c1t)\u03bbk,\n\u03b2\u2217\nk\u2032 = (1 \u2212 \u03c1t)\u03b2\u2217\nk,\n\n\u03bbk\u2032\u2032 = \u03c1t\u02c6\u03bbk,\nk\u2032\u2032 = \u03c1t \u02c6\u03b2k.\n\u03b2\u2217\n\nIntuitively, we expect the suf\ufb01cient statistics to provide insight into how a topic was actually used\nduring the E-step. The minibatch-speci\ufb01c parameters {\u03d5j, \u03b8j}|S|\n\nj=1 are then initialized as follows,\n\n\u03d5jwk\u2032 = \u03c9k\u03d5jwk,\n\n\u03b8jk\u2032 = \u03c9k\u03b8jk,\n\n\u03d5jwk\u2032\u2032 = (1 \u2212 \u03c9k)\u03d5jwk,\n\n\u03b8jk\u2032\u2032 = (1 \u2212 \u03c9k)\u03b8jk,\n\nwith the weights de\ufb01ned as \u03c9k = \u03b2k\u2032 /(\u03b2k\u2032 + \u03b2k\u2032\u2032 ).\n\n2Technically, we replace topic k with topic k\u2032 and add k\u2032\u2032 as a new topic. In practice, we found that the\n\norder of topics in the global stick-breaking distribution had little effect on overall algorithm performance.\n\n4\n\n\finitialize (\u03d5j , \u03b8j) for \u2113 \u2208 {k\u2032, k\u2032\u2032}\nwhile not converged do\n\nAlgorithm 1 Restricted iteration\n1: initialize (\u03bb\u2113, \u03b2\u2113) for \u2113 \u2208 {k\u2032, k\u2032\u2032}\n2: for j \u2208 S do\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nupdate (\u03d5j , \u03b8j) for \u2113 \u2208 {k\u2032, k\u2032\u2032} using (3, 4)\n\nend while\nupdate (\u03bb\u2113, \u03b2\u2113) for \u2113 \u2208 {k\u2032, k\u2032\u2032} using (6, 7)\n\nRestricted iteration After initializing the variational parameters for the new topics, we update\nthem through a restricted iteration of online variational inference. The restricted iteration consists\nof restricted analogues to both the E-step and the M-step, where all parameters except those for the\nnew topics are held constant. This procedure is similar to, and inspired by, the \u201cpartial E-step\u201d for\nsplit-merge EM [10] and restricted Gibbs updates for split-merge MCMC methods [7].\n\nAll values of \u03d5jw\u2113 and \u03b8j\u2113, \u2113 /\u2208 {k\u2032, k\u2032\u2032}, remain unchanged. It is important to note that even though\nthese values are not updated, they are still used in the calculations for both the variational expectation\nof \u03c0j and the normalization of \u03d5. In particular,\n\n\u03d5jwk\u2032 =\n\n,\n\nexp {Eq[log \u03c6k\u2032w] + Eq[log \u03c0jk\u2032 ]}\n\nP\u2113\u2208T exp {Eq[log \u03c6\u2113w] + Eq[log \u03c0j\u2113]}\n\nEq[log \u03c0jk\u2032 ] = \u03a8(\u03b8jk\u2032 ) \u2212 \u03a8(Pk\u2208T \u03b8jk),\n\nwhere T is the original set of topics, minus k, plus k\u2032 and k\u2032\u2032. The expected log word probabilities\nEq[log \u03c6k\u2032 w] and Eq[log \u03c6k\u2032\u2032 w] are computed using the newly updated \u03bb values.\n\nEvaluate Split Quality Let \u03d5split for minibatch S be \u03d5 as de\ufb01ned above, but with \u03d5jwk replaced\nby the \u03d5jwk\u2032 and \u03d5jwk\u2032\u2032 learned in the restricted E-step. Let \u03b8split, \u03bbsplit and \u03b2\u2217\nsplit be de\ufb01ned similarly.\nNow we have a new model state qsplit(k) = (cid:8)(\u03d5split, \u03b8split)|S|\nsplit(cid:9). We calculate L(cid:0)qsplit(k)(cid:1),\nand if L(cid:0)qsplit(k)(cid:1) > L(q\u2217), we update the new model state q\u2217 \u2190 qsplit(k), accepting the split. If\nL(cid:0)qsplit(k)(cid:1) < L(q\u2217), then we go back and test another split, until all splits are tested. In practice we\n\nlimit the maximum number of allowed splits each iteration to a small constant. If we wish to allow\nthe model to expand the number of topics more quickly, we can increase this number. Finally, it is\nimportant to note that all aspects of the split procedure are driven by the data \u2014 the new topics are\ninitialized using data-driven proposals, re\ufb01ned by re-running the variational E-step, and accepted\nbased on an unbiased estimate of the change in the variational objective.\n\nj=1, \u03bbsplit, \u03b2\u2217\n\n3.2 Merge: Removal of Redundant Topics\n\nConsider a candidate merge of two topics, k\u2032 and k\u2032\u2032, into a new topic k. For batch variational meth-\nods, it is straightforward to determine whether such a merge will increase or decrease the variational\nobjective by combining all parameters for all documents,\n\n\u03d5jwk = \u03d5jwk\u2032 + \u03d5jwk\u2032\u2032 ,\n\n\u03b8jk = \u03b8jk\u2032 + \u03b8jk\u2032\u2032 ,\n\n\u03b2k = \u03b2k\u2032 + \u03b2k\u2032\u2032 ,\n\n\u03bbk = \u03bbk\u2032 + \u03bbk\u2032\u2032 ,\n\nand computing the difference in the variational objective before and after the merge. Because many\nterms cancel, computing this bound change is fairly computationally inexpensive, but it can still be\ncomputationally infeasible to consider all pairs of topics for large K. Instead, we identify potential\nmerge candidates by looking at the sample covariance of the \u03b8j vectors across the corpus (or mini-\nbatch). Topics with positive covariance above a certain threshold have the quantitative effects of\ntheir merge evaluated. Intuitively, if there are two copies of a topic or a topic is split into two pieces,\nthey should tend to be used together, and therefore have positive covariance. For consistency in\nnotation, we call the model state with topics k\u2032 and k\u2032\u2032 merged qmerge(k\u2032,k\u2032\u2032).\n\nCombining this merge procedure with the previous split proposals leads to the online variational\nmethod of Algorithm 2. In an online setting, we can only compute unbiased noisy estimates of the\ntrue difference in the variational objective; split or merge moves that increase the expected varia-\ntional objective are not guaranteed to do so for the objective evaluated over the entire corpus. The\n\n5\n\n\ffor j \u2208 minibatch S do\n\nAlgorithm 2 Online variational inference for the HDP + split-merge\n1: initialize (\u03bb, \u03b2\u2217)\n2: for t = 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nend for\nfor pairs of topics {k\u2032, k\u2032\u2032} \u2208 K \u00d7 K with Cov(\u03b8jk\u2032 , \u03b8jk\u2032\u2032 ) > 0 do\n\ninitialize (\u03d5j, \u03b8j)\nwhile not converged do\n\nupdate (\u03d5j, \u03b8j) using (3, 4)\n\nend while\n\nif L(cid:0)qmerge(k\u2032,k\u2032\u2032)(cid:1) > L(q) then\n\nq \u2190 qmerge(k\u2032,k\u2032\u2032)\n\nend if\nend for\nupdate (\u03bb, \u03b2\u2217) using (6, 7)\nfor k = 1, 2, . . . , K do\n\ncompute L(cid:0)qsplit(k)(cid:1) via restricted iteration\nif L(cid:0)qsplit(k)(cid:1) > L(q) then\n\nq \u2190 qsplit(k)\n\n10:\n\n11:\n12:\n13:\n14:\n15:\n16:\n\n17:\n\n18:\n19:\n20:\n21: end for\n\nend if\nend for\n\nuncertainty associated with the online method can be mitigated to some extent by using large mini-\nbatches. Con\ufb01dence intervals for the expected change in the variational objective can be computed,\nand might be useful in a more sophisticated acceptance rule. Note that our usage of a nested family\nof variational bounds is key to the accuracy and stability of our split-merge acceptance rules.\n\n4 Experimental Results\n\nTo demonstrate the effectiveness of our split-merge moves, we compare three algorithms: batch\nvariational inference (bHDP), online variational inference without split-merge (oHDP), and online\nvariational inference with split-merge (oHDP-SM). On the NIPS corpus we also compare these\nthree methods to collapsed Gibbs sampling (CGS) and the CRF-style oHDP model (oHDP-CRF)\nproposed by [4].3 We test the models on one synthetic and two real datasets:\nBars A 20-topic bars dataset of the type introduced in [18], where topics can be viewed as bars on\na 10 \u00d7 10 grid. The vocabulary size is 100, with a training set of 2000 documents and a test set of\n200 documents, 250 words per document.\nNIPS 1,740 documents from the Neural Information Processing Systems conference proceedings,\n1988-2000. The vocabulary size is 13,649, and there are 2.3 million tokens in total. We randomly\ndivide the corpus into a 1,392-document training set and a 348-document test set.\nNew York Times The New York Times Annotated Corpus4 consists of over 1.8 million articles\nappearing in the New York Times between 1987 and 2007. The vocabulary is pruned to 8,000 words.\nWe hold out a randomly selected subset of 5,000 test documents, and use the remainder for training.\n\nAll values of K given for oHDP-SM models are initial values \u2014 the actual truncation levels \ufb02uctuate\nduring inference. While the truncation level K is different from the actual number of topics assigned\nnon-negligible mass, the split-merge model tends to merge away unused topics, so these numbers\nare usually fairly close. Hyperparameters are initialized to consistent values across all algorithms\nand datasets, and learned via Newton-Raphson updates (or in the case of CGS, resampled). We use\na constant learning rate across all online algorithms. As suggested by [4], we set \u03c1t = (\u03c4 + t)\u2212\u03ba\nwhere \u03c4 = 1, \u03ba = 0.5. Empirically, we found that slower learning rates could result in greatly\nreduced performance, across all models and datasets.\n\n3For CGS we use the code available at http://www.gatsby.ucl.ac.uk/\u223cywteh/research/npbayes/npbayes-\nr21.tgz, and for oHDP-CRF we use the code at http://www.cs.princeton.edu/\u223cchongw/software/onlinehdp.tar.gz.\n\n4http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2008T19\n\n6\n\n\fTo compare algorithm performance, we use per-word heldout likelihood, similarly to the metrics\nof [3, 19, 4]. We randomly split each test document in Dtest into 80%-20% pieces, wj1 and wj2.\nThen, using \u00af\u03c6 as the variational expectation of the topics from training, we learn \u00af\u03c0j on wj1 and\napproximate the probability of wj2 as Qw\u2208wj2 Pk \u00af\u03c0jk \u00af\u03c6kw. The overall test metric is then\n\nE = Pj\u2208Dtest Pw\u2208wj2 log(cid:0)Pk \u00af\u03c0jk \u00af\u03c6kw(cid:1)\n\nPj\u2208Dtest |wj2|\n\n4.1 Bars\n\nFor the bars data, we initialize eight oHDP-SM runs with K = {2, 5, 10, 20, 40, 50, 80, 100}, eight\nruns of oHDP with K = 20, and eight runs with K = 50. As seen in Figure 2(a), the oHDP\nalgorithm converges to local optima, while the oHDP-SM runs all converge to the global optimum.\nMore importantly, all split-merge methods converge to the correct number of topics, while oHDP\nuses either too few or too many topics. Note that the data-driven split-merge procedure allows\nsplitting and merging of topics to mostly cease once the inference has converged (Figure 2(d)).\n\n4.2 NIPS\n\nWe compare oHDP-SM, oHDP, bHDP, oHDP-CRF, and CGS in Figure 2. Shown are two runs of\noHDP-SM with K = {100, 300}, two runs each of oHDP and bHDP with K = {300, 1000}, and\none run each of oHDP-CRF and CGS with K = 300. All the runs displayed are the best runs from a\nlarger sample of trials. Since oHDP and bHDP will use only a subset of topics under the truncation,\nsetting K much higher results in comparable numbers of topics as oHDP-SM. We set |S| = 200 for\nthe online algorithms, and run all methods for approximately 40 hours of CPU time.\n\nThe non split-merge methods reach poor local optima relatively quickly, while the split-merge al-\ngorithms continue to improve. Notably, both oHDP-CRF and CGS perform much worse than any\nof our methods. It appears that the CRF model performs very poorly for small datasets, and CGS\nreaches a mode quickly but does not mix between modes. Even though the split-merge algorithms\nimprove in part by adding topics, they are using their topics much more effectively (Figure 2(h)).\nWe speculate that for the NIPS corpus especially, the reason that models achieve better predictive\nlikelihoods with more topics is due to the bursty properties of text data [20]. Figure 3 illustrates the\ntopic re\ufb01nement and specialization which occurs in successful split proposals.\n\n4.3 New York Times\n\nAs batch variational methods and samplers are not feasible for such a large dataset, we compare\ntwo runs of oHDP with K = {300, 500} to a run of oHDP-SM with K = 200 initial topics. We\nalso use a larger minibatch size of |S| = 10,000; split-merge acceptance decisions can sometimes\nbe unstable with overly small minibatches. Figure 2(c) shows an inherent problem with oHDP for\nvery large datasets \u2014 when truncated to K = 500, the algorithms uses all of its available topics and\nexhibits over\ufb01tting. For the oHDP-SM, however, predictive likelihood improves over a substantially\nlonger period and over\ufb01tting is greatly reduced.\n\n5 Discussion\n\nWe have developed a novel split-merge online variational algorithm for the hierarchical DP. This\napproach leads to more accurate models and better predictive performance, as well as a model that\nis able to adapt the number of topics more freely than conventional approximations based on \ufb01xed\ntruncations. Our moves are similar in spirit to split-merge samplers, but by evaluating their quality\nstochastically using streaming data, we can rapidly adapt model structure to large-scale datasets.\n\nWhile many papers have tried to improve conventional mean \ufb01eld methods via higher-order vari-\national expansions [21], local optima can make the resulting algorithms compare unfavorably to\nMonte Carlo methods [3]. Here we pursue the complementary goal of more robust, scalable op-\ntimization of simple variational objectives. Generalization of our approach to more complex hier-\narchies of DPs, or basic DP mixtures, is feasible. We believe similar online learning methods will\nprove effective for the combinatorial structures of other Bayesian nonparametric models.\n\nAcknowledgments We thank Dae Il Kim for his assistance with the experimental results.\n\n7\n\n\f\u22122.5\n\n\u22123\n\n\u22123.5\n\n\u22124\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n \n\ng\no\n\nl\n \n\nd\nr\no\nw\n\u2212\nr\ne\np\n\nBars\n\n \n\noHDP\u2212SM\noHDP, K=50\noHDP, K=20\n\n\u22124.5\n\n \n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n400\n\n\u22127.4\n\n\u22127.5\n\n\u22127.6\n\n\u22127.7\n\n\u22127.8\n\n\u22127.9\n\n\u22128\n\n\u22128.1\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n \n\ng\no\n\nl\n \n\nd\nr\no\nw\n\u2212\nr\ne\np\n\n \n0\n\niteration\n\n(a)\n\noHDP\u2212SM, K=2,100\noHDP, K=50\noHDP, K=20\n\n \n\n600\n\nd\ne\ns\nu\n \ns\nc\np\no\n\ni\n\nt\n \n\n#\n\n500\n\n400\n\n300\n\n200\n\n100\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nd\ne\ns\nu\n\n \ns\nc\np\no\n\ni\n\nt\n \n\n#\n\nNIPS\n\n \n\n \n\nNew York Times\n\n\u22127.56\n\n\u22127.58\n\n\u22127.6\n\n\u22127.62\n\n\u22127.64\n\n\u22127.66\n\n\u22127.68\n\n\u22127.7\n\n\u22127.72\n\n\u22127.74\n\n\u22127.76\n\n\u22127.78\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n \n\ng\no\n\nl\n \n\nd\nr\no\nw\n\u2212\nr\ne\np\n\noHDP\u2212SM, K=100\noHDP\u2212SM, K=300\noHDP, K=300\noHDP, K=1000\nbHDP, K=300\nbHDP, K=1000\noHDP\u2212CRF, K=300\nCGS, K=300\n\noHDP, K=300\noHDP, K=500\noHDP\u2212SM, K=200\n\n2.5\n\n5\n\n7.5\n\n10\n\n12.5\n\ndocuments seen\n\n15\n5\nx 10\n\n\u22127.8\n\n \n0\n\n0.5\n\n1\n\n1.5\n2.5\ndocuments seen\n\n2\n\n3\n\n3.5\n\n4\n6\nx 10\n\n(b)\n\n(c)\n\n550\n\n500\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\nd\ne\ns\nu\n \ns\nc\np\no\nt\n \n#\n\ni\n\n0\n\n \n0\n\n50\n\n100\n\niteration\n\n(d)\n\n150\n\n200\n\n0\n\n0\n\n2.5\n\n5\n\n7.5\n\n10\n\n12.5\n\ndocuments seen\n\n15\n5\nx 10\n\n150\n\n0\n\n0.5\n\n1\n\n1.5\n2.5\ndocuments seen\n\n2\n\n3\n\n3.5\n\n4\n6\nx 10\n\n(e)\n\n(f)\n\n\u22122.5\n\n\u22123\n\n\u22123.5\n\n\u22124\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n \n\ng\no\n\nl\n \n\nd\nr\no\nw\n\u2212\nr\ne\np\n\n\u22124.5\n\n0\n\n\u22127.4\n\n\u22127.5\n\n\u22127.6\n\n\u22127.7\n\n\u22127.8\n\n\u22127.9\n\n\u22128\n\n\u22128.1\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n \n\ng\no\n\nl\n \n\nd\nr\no\nw\n\u2212\nr\ne\np\n\n\u22127.56\n\n\u22127.58\n\n\u22127.6\n\n\u22127.62\n\n\u22127.64\n\n\u22127.66\n\n\u22127.68\n\n\u22127.7\n\n\u22127.72\n\n\u22127.74\n\n\u22127.76\n\n\u22127.78\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n \ng\no\nl\n \nd\nr\no\nw\n\u2212\nr\ne\np\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n\u22127.8\n\n150\n\n200\n\n250\n\n300\n\n350\n\n400\n\n450\n\n500\n\n550\n\n# topics used\n\n(g)\n\n# topics used\n\n(h)\n\n# topics used\n\n(i)\n\nFigure 2: Trace plots of heldout likelihood and number of topics used. Across all datasets, common color\nindicates common algorithm, while for NIPS and New York Times, line type indicates different initializations.\nTop: Test log likelihood for each dataset. Middle: Number of topics used per iteration. Bottom: A plot of\nper-word log likelihood against number of topics used. Note particularly plot (h), where for every cardinality\nof used topics shown, there is a split-merge method outperforming a conventional method.\n\nOriginal topic\n\npatterns\npattern\ncortex\nneurons\nneuronal\n\nsingle\n\nresponses\n\ninputs\ntype\n\nactivation\n\n40,000\npatterns\npattern\ncortex\nneurons\nneuronal\nresponses\n\nsingle\ninputs\n\ntemporal\nactivation\npatterns\nneuronal\npattern\nneurons\ncortex\ninputs\n\nactivation\n\ntype\n\npreferred\n\npeak\n\n80,000\npatterns\npattern\ncortex\nneurons\nneuronal\nresponses\n\nsingle\n\ntemporal\n\ninputs\ntype\n\nneuronal\npatterns\npattern\nneurons\ncortex\n\nactivation\ndendrite\ninputs\npeak\n\npreferred\n\n120,000\npatterns\npattern\ncortex\nneurons\nresponses\nneuronal\n\nsingle\ntype\n\nnumber\ntemporal\nneuronal\nneurons\nactivation\n\ncortex\ndendrite\npreferred\npatterns\n\npeak\n\npyramidal\n\ninputs\n\n160,000\npatterns\npattern\ncortex\nneurons\nresponses\n\ntype\n\n200,000\npatterns\npattern\ncortex\nneurons\nresponses\n\ntype\n\n240,000\npatterns\npattern\ncortex\n\nresponses\n\ntypes\ntype\n\nbehavioral\n\nbehavioral\n\nbehavioral\n\ntypes\n\nneuronal\n\nsingle\n\nneuronal\ndendritic\n\npeak\n\nactivation\n\ncortex\n\npyramidal\n\nmsec\n\ufb01re\n\ntypes\nform\n\nneuronal\nneuronal\ndendritic\n\n\ufb01re\npeak\n\nactivation\n\nmsec\n\npyramidal\n\ncortex\n\nform\n\nneurons\n\nareas\n\nneuronal\ndendritic\n\npostsynaptic\n\n\ufb01re\n\ncortex\n\nactivation\n\npeak\nmsec\n\ndendrites\n\ninputs\n\npostsynaptic\n\ninputs\n\npyramidal\n\ninputs\n\nFigure 3: The evolution of a split topic. The left column shows the topic directly prior to the split. After\n240,000 more documents have been analyzed, subtle differences become apparent: the top topic covers terms\nrelating to general neuronal behavior, while the bottom topic deals more speci\ufb01cally with neuron \ufb01ring.\n\n8\n\n\fReferences\n\n[1] Y.W. Teh, M. Jordan, and M. Beal. Hierarchical Dirichlet processes. JASA, 2006.\n\n[2] D. Blei and M. Jordan. Variational methods for Dirichlet process mixtures. Bayesian Analysis, 1:121\u2013144,\n\n2005.\n\n[3] Y.W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for HDP. NIPS, 2008.\n\n[4] C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet process.\n\nAISTATS, 2011.\n\n[5] M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation. NIPS, 2010.\n\n[6] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 2003.\n\n[7] S. Jain and R. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture\n\nmodel. Journal of Computational and Graphical Statistics, 13:158\u2013182, 2004.\n\n[8] D.B. Dahl. Sequentially-allocated merge-split sampler for conjugate and nonconjugate Dirichlet process\n\nmixture models. Technical report, Texas A&M University, 2005.\n\n[9] C. Wang and D. Blei. A split-merge MCMC algorithm for the hierarchical Dirichlet process. ArXiv\n\ne-prints, January 2012.\n\n[10] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton. SMEM algorithm for mixture models. Neural\n\nComputation, 2000.\n\n[11] K. Kurihara and M. Welling. Bayesian K-means as a \u2019Maximization-Expectation\u2019 algorithm. SIAM\n\nconference on data mining SDM06, 2006.\n\n[12] N. Ueda and Z. Ghahramani. Bayesian model search for mixture models based on optimizing variational\n\nbounds. Neural Networks, 15, 2002.\n\n[13] Z. Ghahramani and M. Beal. Variational inference for Bayesian mixtures of factor analysers. NIPS, 2000.\n\n[14] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical\n\nmodels. Machine Learning, 1999.\n\n[15] P. Liang, S. Petrov, D. Klein, and M. Jordan. The in\ufb01nite PCFG using hierarchical Dirichlet processes.\n\nEmpirical Methods in Natural Language Processing, 2007.\n\n[16] K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational Dirichlet process mixtures. NIPS, 2007.\n\n[17] D. Blei and C. Wang. Variational inference for the nested Chinese restaurant process. NIPS, 2009.\n\n[18] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. PNAS, 101:5228\u20135235, 2004.\n\n[19] A. Asuncion, M. Welling, P. Smyth, and Y.W. Teh. On smoothing and inference for topic models. UAI,\n\n2009.\n\n[20] G. Doyle and C. Elkan. Accounting for word burstiness in topic models. ICML, 2009.\n\n[21] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1:1\u2013305, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1251, "authors": [{"given_name": "Michael", "family_name": "Bryant", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}]}