{"title": "Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 1129, "page_last": 1136, "abstract": "Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as $O(k-1) + \\log m /m$, where $k$ is the number of topics in the model and $m$ is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation.", "full_text": "Relative Performance Guarantees for\n\nApproximate Inference in Latent Dirichlet Allocation\n\nIndraneel Mukherjee\n\nDavid M. Blei\n\nDepartment of Computer Science\n\n{imukherj,blei}@cs.princeton.edu\n\nPrinceton, NJ 08540\n\nPrinceton University\n\n35 Olden Street\n\nAbstract\n\nHierarchical probabilistic modeling of discrete data has emerged as a powerful\ntool for text analysis. Posterior inference in such models is intractable, and prac-\ntitioners rely on approximate posterior inference methods such as variational in-\nference or Gibbs sampling. There has been much research in designing better\napproximations, but there is yet little theoretical understanding of which of the\navailable techniques are appropriate, and in which data analysis settings. In this\npaper we provide the beginnings of such understanding. We analyze the improve-\nment that the recently proposed collapsed variational inference (CVB) provides\nover mean \ufb01eld variational inference (VB) in latent Dirichlet allocation. We prove\nthat the difference in the tightness of the bound on the likelihood of a document\n\ndecreases as O(k\u22121)+(cid:112)log m/m, where k is the number of topics in the model\n\nand m is the number of words in a document. As a consequence, the advantage of\nCVB over VB is lost for long documents but increases with the number of topics.\nWe demonstrate empirically that the theory holds, using simulated text data and\ntwo text corpora. We provide practical guidelines for choosing an approximation.\n\n1 Introduction\n\nHierarchical probabilistic models of discrete data have emerged as powerful tool for large-scale text\nanalysis. Based on latent semantic indexing (LSI) [1] and probabilistic latent semantic indexing\n(pLSI) [2], hierarchical topic models [3, 4] have been extended and applied to sequential settings [5,\n6], authorship [7], email [8], social networks [9, 10], computer vision [11, 12], bioinformatics [5,\n13], information retrieval [14], and other application areas [15, 16, 17, 18]. See [19] for a good\nreview.\nA topic model posits a generative probabilistic process of a document collection using a small num-\nber of distributions over words, which are called topics. Conditioned on the observed documents,\nthe distribution of the underlying latent variables is inferred to probabilistically partition the data ac-\ncording to their hidden themes. Research in topic models has involved tailoring the latent structure\nto new kinds of data and designing new posterior inference algortihms to infer that latent structure.\nIn generative models, such as latent Dirichlet allocation (LDA) and its extensions, inferring the\nposterior of the latent variables is intractable [3, 4]. (Some topic models, such as LSI and pLSI,\nare not fully generative.) Several algorithms have emerged in recent years to approximate such\nposteriors, including mean-\ufb01eld variational inference [3], expectation propagation [20], collapsed\nGibbs sampling [19] and, most recently, collapsed variational inference [21]. Choosing from among\nthe several available algorithms is dif\ufb01cult. There has been some empirical comparison in the topic\nmodeling literature [4, 19], but little theoretical guidance.\n\n1\n\n\fWe provide some of the \ufb01rst theoretical understanding of which of the available techniques is appro-\npriate, and in which data analysis settings. We analyze two variational inference algorithms for topic\nmodels, mean \ufb01eld variational inference (VB) [3] and collapsed variational inference (CVB) [21].\n\u201cCollapsing,\u201d or marginalizing out, a latent variable is a known technique for speeding up the con-\nvergence of Gibbs samplers, and CVB brought this idea to the world of variational algorithms.\nEmpirically, CVB was more accurate than VB for LDA [21]. The advantage of CVB applied to\nDirichlet process mixtures was less conclusive [22].\nVariational algorithms minimize the distance between a simple distribution of the latent variables\nand the true posterior. This is equivalent to maximizing a lower bound on the log probability of a\ndocument. We prove that the uncollapsed variational bound on the log probability of a document\napproaches the collapsed variational bound as the number of words in the document increases. This\nsupports the empirical improvement observed for LDA, where documents are relatively short, and\nthe smaller improvement observed in the DP mixture, which is akin to inference in a single long\ndocument. We also show how the number of topics and the sparsity of those topics affects the\nperformance of the two algorithms.\n\nWe prove that the difference between the two bounds decreases as O(k \u2212 1) +(cid:112)log m/m, where\n\nk is the number of topics in the model, and m is the number of words in the document. Thus,\nthe advantage of CVB over VB is lost for longer documents. We examine the consequences of the\ntheory on both simulated and real text data, exploring the relative advantage of CVB under different\ndocument lengths, topic sparsities, and numbers of topics. The consequences of our theory lead to\npractical guidelines for choosing an appropriate variational algorithm.\n\n2 Posterior inference for latent Dirichlet allocation\n\nLatent Dirichlet allocation (LDA) is a model of an observed corpus of documents. Each document is\na collection of m words x1:m, where each word is from a \ufb01xed vocabulary \u03c7 of size N. The model\nparameters are k topics, \u03b21, . . . , \u03b2k, each of which is a distribution on \u03c7, and a k-vector (cid:126)\u03b1, which is\nthe parameter to a Dirichlet over the (k \u2212 1)-simplex. The topic matrix \u03b2 denotes the N \u00d7 k matrix\nwhose columns are the topic distributions.\nGiven the topic matrix and Dirichlet parameters, LDA assumes that each document arises from the\nfollowing process. First, choose topic proportions \u03b8 \u223c D((cid:126)\u03b1). Then, for each word choose a topic\nassignment zi \u223c \u03b8. Finally, choose the word xi \u223c \u03b2zi. This describes a joint probability distribution\nof the observed and latent variables p((cid:126)x, (cid:126)z, \u03b8|(cid:126)\u03b1, \u03b2).\nAnalyzing data with LDA involves two tasks. In parameter estimation, we \ufb01nd the topics and Dirich-\nlet parameters that maximize the likelihood of an observed corpus. In posterior inference, we \ufb01x the\nmodel and compute the posterior distribution of the latent structure that underlies a particular docu-\nment. Here, we focus on posterior inference. (Parameter estimation crucially depends on posterior\ninference via the expectation-maximization algorithm.)\nGiven a document (cid:126)x, the posterior distribution of the latent variables is p(\u03b8, (cid:126)z|(cid:126)x) = p(\u03b8,(cid:126)z,(cid:126)x)\n. This\ndistribution is infeasible to compute exactly because of the dif\ufb01culty in computing the normalizing\nconstant, i.e., the marginal probability of the document,\n\np((cid:126)x)\n\n(cid:90) (cid:32)(cid:89)\n\n(cid:88)\n\n\u0393((cid:80)\n(cid:81)\n\nz \u03b1z)\nz \u0393(\u03b1z)\n\np((cid:126)x) =\n\n\u03b8\u03b1z\u22121\n\nz\n\n\u03b2zi,xi\u03b8zi\n\nd\u03b8.\n\n(cid:126)z\n\nz\n\ni\n\n(cid:33)(cid:32)(cid:89)\n\n(cid:33)\n\nApproximating the posterior is equivalent to approximating the normalizing constant.\nVariational methods approximate an intractable posterior by \ufb01nding the member of a simpler family\nof distributions that is closest to it, where closeness is measured by relative entropy. This is equiv-\nalent to minimizing the Jensen\u2019s bound on the negative log probability of the data [23]. We will\nanalyze two alternative variational methods.\n\nVariational inference for LDA In the variational inference algorithm for LDA introduced in [3]\n(VB), the posterior p(\u03b8, (cid:126)z|(cid:126)x) is approximated by a fully-factorized variational distribution\n\nq(\u03b8, (cid:126)z|(cid:126)\u03b3, \u03c61:m) = q(\u03b8|(cid:126)\u03b3)(cid:81)\n\ni q(zi|\u03c6i).\n\n2\n\n\fHere q(\u03b8|(cid:126)\u03b3) is a Dirichlet distribution with parameters (cid:126)\u03b3, and each q(zi|\u03c6i) is a multinomial dis-\ntribution on the set of K topic indices. This family does not contain the true posterior. In the true\nposterior, the latent variables are dependent; in this family of distributions, they are independent [3].\nThe algorithm seeks to \ufb01nd the variational parameters that minimize the relative entropy between the\ntrue posterior and the approximation, RE(q(\u03b8, (cid:126)z|(cid:126)\u03b3, \u03c61:m) (cid:107) p(\u03b8, (cid:126)z|(cid:126)x)). This is equivalent to \ufb01nding\nthe optimal parameters (cid:126)\u03b3\u2217, \u03c6\u2217\n\n1:m as follows:\n\n((cid:126)\u03b3\u2217, \u03c6\u2217\n\n1:m) = arg min\n(cid:126)\u03b3,\u03c61:m\n\nEq(\u03b8,(cid:126)z|(cid:126)\u03b3,\u03c61:m) log\n\n(cid:20)\n\n(cid:18) q(\u03b8, (cid:126)z|(cid:126)\u03b3, \u03c61:m)\n\n(cid:19)(cid:21)\n\np(\u03b8, (cid:126)z, (cid:126)x)\n\n.\n\nThe expression minimized by (cid:126)\u03b3\u2217, \u03c6\u2217\n1:m is also known as the variational free energy of ((cid:126)\u03b3, \u03c61:m) and\nwill be denoted by F((cid:126)x, (cid:126)\u03b3, \u03c61:m). Note that F((cid:126)x, (cid:126)\u03b3\u2217, \u03c6\u2217\n1:m) is the Jensen\u2019s bound on the negative\nlog probability of (cid:126)x. The value of the objective function is a measure of the quality of the VB\napproximation. We denote this with\n\nVB((cid:126)x) \u2206= min\n(cid:126)\u03b3,\u03c61:m\n\nF((cid:126)x, (cid:126)\u03b3, \u03c61:m).\n\n(1)\n\nCollapsed variational inference for LDA The collapsed variational inference algorithm (CVB)\nreformulates the LDA model by marginalizing out the topic proportions \u03b8. This yields a formulation\nwhere the topic assignments z are fully dependent, but where the dimensionality of the latent space\nhas been reduced.\nThe variational family in CVB is a fully-factorized product of multinomial distributions,\n\nCVB \ufb01nds the optimal variational parameters \u03c6\u2217\n\nq(z) =(cid:89)\n(cid:20)\n\nq(zi|\u03c6i).\n\ni\n1:m as follows:\n\n(cid:18) q((cid:126)z|\u03c61:m)\n\n(cid:19)(cid:21)\n\n.\n\np((cid:126)z, (cid:126)x)\n\n\u03c6\u2217\n1:m = arg min\n\u03c61:m\n\nEq((cid:126)z|\u03c61:m) log\n\nIt approximates the negative log probability of (cid:126)x with the collapsed variational free energy F((cid:126)x, (cid:126)\u03b3),\nwhich is the expression that \u03c6\u2217\n1:m minimizes. Analogous to VB, CVB\u2019s performance is measured by\n\nCVB((cid:126)x) \u2206= min\n\u03c61:m\n\nF((cid:126)x, \u03c61:m).\n\n(2)\n\nBoth CVB((cid:126)x) and VB((cid:126)x) approximate the negative log probability of (cid:126)x by Jensen\u2019s inequality. It has\nbeen shown that CVB((cid:126)x) will always be a better bound than VB((cid:126)x) [21].\n\nEf\ufb01ciency of the algorithms Both VB and CVB proceed by coordinate ascent to reach a local\nminimum of their respective free energies. CVB achieves higher accuracy at the price of increased\ncomputation. Each coordinate update for VB requires in O(mk) time, where m is the length of a\ndocument and k is the number of topics. Each coordinate update for CVB requires O(m2k) time.\nThe CVB updates are prohibitive for large documents and, moreover, are numerically unstable. Both\nshortcomings are overcome in [21] by substituting exact computations with an ef\ufb01cient second-order\nTaylor approximation. This approximation, however, does not yield a proper bound.1\nIt is thus\ninappropriate for computing held out probability, a typical measure of quality of a topic model. For\nsuch a quantity, exact CVB implementation takes quadratic time.\n\n3 Relative performance of VB and CVB\n\nWe try to obtain a theoretical handle on the size of the advantage of CVB over VB, and how it\nis affected by the length of the document, the number of topics, and the structure of those topics.\nOur main result states that for suf\ufb01ciently large documents, the difference in approximation quality\ndecreases with document length and converges to a constant that depends on the number of topics.\n\n1The \ufb01rst-order Taylor approximation yields an upper-bound, but these turn out to be too inaccurate. Such\n\nan estimate can yield bounds worse than those achieved by VB.\n\n3\n\n\fTheorem 1. Consider any LDA model with k topics, and a document consisting of m words\nx1, . . . , xm, where m is suf\ufb01ciently large. Recall that VB((cid:126)x) and CVB((cid:126)x), de\ufb01ned in (1) and (2), are\nthe free energies measured by VB and CVB respectively. Then,\n\n0 \u2264 [VB((cid:126)x) \u2212 CVB((cid:126)x)] \u2264 O(k \u2212 1) + o(1)\n\n(3)\n\nfor this model. Here o(1) goes to 0 at least as fast as\n\n(cid:113) log m\n\nm .\n\nA strength of Theorem 1 is that it holds for any document, and not necessarily one generated by\nan LDA model. In previous work on analyzing mean-\ufb01eld variational inference, [24] analyze the\nperformance of VB for posterior inference in a Gaussian mixture model. Unlike the assumptions in\nTheorem 1, their analysis requires that the data be generated by a speci\ufb01c model.\nTopic models are often evaluated and compared by approximation of the per-word log probability.\nConcerning this quantity, the following corollary is immediate because the total free energy scales\nwith the length of the document.\nCorollary 1. The per word free energy change, as well as the percentage free energy change, be-\ntween VB and CVB goes to zero with the length of the document.\n\nOur results are stated in log-space. The bounds on the difference in free energy is equivalent to a\nbound on the ratio of probability obtained by VB and CVB. Since the probability of a document falls\nexponentially fast with the number of words, the additive difference in the probability estimates of\nVB and CVB is again negligible for large documents.\nCorollary 2. For suf\ufb01ciently long documents, the difference in probability estimates of CVB and VB\ndecrease as cm\u2212k for some constant c < 1 whose value depends on the model parameters \u03b2.\n\nThe upper-bound in (3) is nearly tight. When all topics are uniform distributions, the difference in\nthe free energy estimates is \u2126(k) for long documents.\n\n3.1 Proof Sketch\n\nWe sketch the proof of Theorem 1. The full proof is in the supporting material. We \ufb01rst introduce\nsome notation. We denote a vector with an arrow, like (cid:126)\u03bd. All vectors have k real coordinates. \u03bdj\nwill denote its coordinates, with j \u2208 [k] = {1, . . . , k}. When iterating over indices in [k], we will\nuse the variable j. To iterate from 1 to m we will use i.\nWe state three lemmas which are needed to prove (3). The left inequality in (3) follows from the fact\nthat CVB optimizes over a larger family of distributions [21]. We concentrate on the right inequality.\nThe \ufb01rst step is to carry out calculations similar to [24] to arrive at the following.\ni qi(zi) is the optimal approximation to the posterior p((cid:126)z|(cid:126)x). Then,\n(4)\n\n(cid:0)Eq((cid:126)z)[log \u0393(mj + \u03b1j)] \u2212 log \u0393(\u03b3j + \u03b1j)(cid:1)\n\nLemma 1. Suppose q((cid:126)z) =(cid:81)\nwhere \u03b3j =(cid:80)\n\nVB((cid:126)x) \u2212 CVB((cid:126)x) \u2264(cid:88)\n\ni qi(Zi = j),\u2200j \u2208 [k], and mj is the number of occurrences of the topic j in (cid:126)z.\n\nz\n\nNote that to analyze the term Eq((cid:126)z)[log \u0393(mj + \u03b1j)] corresponding to a particular topic j, we need\nconsider only those positions i where qi(Zi = j) (cid:54)= 0; we denote the number of such positions by\nNz. The dif\ufb01culty in analyzing arbitrary documents lay in working with the right hand side of (4)\nwithout any prior knowledge about the qi\u2019s. This was overcome by the following lemma.\nLemma 2. Suppose Xi is Bernoulli random with probability qi, for i = 1 to m. Let f : R \u2192 R be\nconvex, and \u03b3 \u2208 [0, m]. Then the following optimization problem is solved when each qi = \u03b3\n\nm\n\nmaxq1,...,qm\n\ns.t.\n\nE[f(X1 + . . . + Xm)]\nqi \u2208 [0, 1]\nq1 + . . . + qm = \u03b3.\n\nAs an immediate corollary of the previous two lemmas and the fact that log \u0393 is convex, we get\n\nVB((cid:126)x) \u2212 CVB((cid:126)x) \u2264(cid:88)\n\nE[log \u0393(mj + \u03b1j)] \u2212 log \u0393(\u03b3j + \u03b1j).\n\nj\n\n4\n\n\f(a) Difference in total free energy estimates\n\n(b) Percentage difference in free energy estimates\n\nFigure 1: Results on synthetic text data. We sample k topics from a symmetric Dirichlet distribution\nwith parameter \u03b2param. We sample 10 documents from LDA models with these topics. We consider\npre\ufb01xes of varying lengths for each document. For each pre\ufb01x length, the VB and CVB free energies\nare averaged over the 10 documents.The curves obtained show how the advantage of CVB over VB\nchanges with the length of a document, number of topics and sparsity of topics.\n\nwhere mj is now a Binomial random variable with probability \u03b3j\npiece of the proof is the following concentration lemma.\nLemma 3. Let X be the number of heads in m coin tosses each with probability q. We require\nm > q\u2212(2+o(1)). Let a > 0 be constants. Then\n\nm and number of trials m. The last\n\n0 \u2264 E[log \u0393(X + a)] \u2212 log \u0393(E[X + a]) \u2264 O(1 \u2212 q) + o(1)\n\n(5)\n\n(cid:113) log m\n\nm ).\n\nHere o(1) = O(\n\nThe requirement of m > 1/q2+o(1) is necessary, and translates to the condition that document\nlengths be greater than (Nj/\u03b3j)2+o(1) for Theorem 1 to hold. This gives an implicit lower bound on\nthe required length of a document which depends on the sparsity of the topics. (Sparse topics place\ntheir mass on few words, i.e., low entropy, and dense topics spread their mass on more words, i.e.,\nhigh entropy). When the vocabulary is large, dense topics require long documents for the theory to\ntake effect. This is supported by our simulations.\n\n4 Empirical results\n\nWe studied the results of this theory on synthetic and real text data. We implemented the algorithms\ndescribed in [3] and [21]. While these algorithms are only guaranteed to \ufb01nd a local optimum of the\nobjective, we aim to study whether our theorem about the global optimum is borne out in practice.\n\n5\n\n0100020003000400050000.0000.0050.0100.0150.0200.025bbparam = 1e\u221204# wordsfree energy change010002000300040005000010203040506070bbparam = 0.01# wordsfree energy change010002000300040005000050100150bbparam = 0.1# wordsfree energy change0100020003000400050000.00000.00050.00100.0015bbparam = 1e\u221204# wordsfree energy changek51025500100020003000400050000.00.51.01.52.0bbparam = 0.01# wordsfree energy changek51025500100020003000400050000.00.10.20.30.4bbparam = 0.1# wordsfree energy changek5102550\fSynthetic data The synthetic data was generated as follows. We \ufb01rst sampled k topics \u03b21, . . . , \u03b2k\nindependently from a symmetric Dirichlet distribution with parameter \u03b2param. We then sampled a\ncorpus of 10 documents, each of length 5000 from an LDA model with these topics and Dirichlet\nhyper-parameter 1/k. The vocabulary size was 10000.\nFor each document, we considered sub-documents of the \ufb01rst m words with lengths as small as 100.\nOn each sub-document, we ran both VB and CVB initialized from a common point. For every sub-\ndocument length, the average converged values of the free energy was recorded for both algorithms.\nThus, we obtain a trajectory representing how the advantage of CVB over VB changes with the\nnumber of words m.\nWe repeated this simulation with different values of k to reveal the dependence of this advantage on\nthe number of topics. Moreover, we investigated the dependence of the advantage on topic sparsity.\nWe repeat the above experiment, with three different values of the Dirichlet parameter \u03b2param for the\ntopic matrix. The topics become sparse rapidly as \u03b2param decreases.\nThe results of this study are in Figure 1. We see similar trends across all data. The advantage\ndecreases with document length m and increases with the number of topics k. The theory predicts\nthat the difference in free energy converges to a constant, implying that the percentage advantage\ndecays as O(1)/m. Figure 1 reveals this phenomenon. Moreover, the constant is estimated to be\non the order of k, implying that the advantage is higher for more topics. Comparing the curves for\ndifferent values of k reveals this fact. Finally, for denser topic models the performances of CVB\nand VB converge only for very long documents, as was discussed at the end of Section 3.1. When\n\u03b2param = 0.1, CVB retains its advantage even for 5000 word long documents.\n\nReal-world corpora We studied the relative performance of the algorithms on two text data sets.\nFirst, we examined 3800 abstracts from the ArXiv, an on-line repository of scienti\ufb01c pre-prints.\nWe restricted attention to 5000 vocabulary terms, removing very frequent and very infrequent terms.\nSecond, we examined 1000 full documents from the Yale Law Journal. Again, we used a vocabulary\nof 5000 terms. Each data set was split into a training and test corpus. The ArXiv test corpus\ncontained 2000 short documents. The Yale Law test corpus contained 200 documents of lengths\nbetween a thousand and 10, 000 words.\nFor each data set, we \ufb01t LDA models of different numbers of topics to the training corpus\n(k = 5, 10, 25, 50), and then evaluated the model on the held-out test set. In Figure 2, we plot\nthe percentage difference of the per-word variational free energies achieved by CVB and VB as a\nfunction of document length and number of topics. We also plot the difference in the total free\nenergy. As for the simulated data, the graphs match our theory; the percent decrease in per word\nfree energy goes to zero with increasing document length, and the absolute difference approaches a\nconstant. The difference is more pronounced as the number of topics increases.\nThe predicted trends occur even for short documents containing around a hundred words. Topics\nestimated from real-world data tend to be sparse. The issues seen with dense topics on simulated\ndata are not relevant for real-world applications.\n\n5 Conclusion\n\nWe have provided a theoretical analysis of the relative performance of the two variational inference\nalgorithms for LDA. We showed that the advantage of CVB decreases as document length increases,\nand increases with the number of topics and density of the topic distributions. Our simulations on\nsynthetic and real-world data empirically con\ufb01rm our theoretical bounds and their consequences.\nUnlike previous analyses of variational methods, our theorem does not require that the observed\ndata arise from the assumed model.\nSince the approximation to the likelihood based on CVB is more expensive to compute than for\nVB, this theory can inform our choice of a good variational approximation. Shorter documents and\nmodels with more topics lend themselves to analysis with CVB. Longer documents and models with\nfewer topics lend themselves to VB. One might use both, within the same data set, depending on the\nlength of the document.\n\n6\n\n\fFigure 2: Experiments with the two text data sets described in Section 4. We \ufb01t LDA models\nwith numbers of topics equal to 5, 10, 25, 50, and evaluated the models on a held-out corpus. We\nplot the percentage difference of the per-word variational free energies achieved by CVB and VB\nas a function of document length. We also plot the difference in the total free energy. The %-age\ndecrease in per word free energy goes to zero with increasing document length, and the absolute\ndifference approaches a constant. The difference is higher for larger k.\n\n(a) ArXiv data-set\n\n(b) Yale Law data-set\n\nIn one strain of future work, we will analyze the consequences of the approximate posterior inference\nalgorithm on parameter estimation. Our results regarding the sparsity of topics indicate that CVB\nis a better algorithm early in the EM algorithm, when topics are dense, and that VB will be more\nef\ufb01cient as the \ufb01tted topics become more sparse.\n\nReferences\n\n[1] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman.\n\nIndexing by latent\nsemantic analysis. Journal of the American Society of Information Science, 41(6):391\u2013407,\n1990.\n\n7\n\n0204060801001201400.00.51.01.52.02.5VB vs CVB: per word free energy (10 mov. avgd.)#words%age change in free energyk510255002040608010012014012345VB \u2212 CVB: total free energies (10 mov. avgd.)#wordstotal free energy diff2000400060008000100000.000.020.040.060.080.100.12VB vs CVB: per word free energy (1000 mov. avgd.)#words%age change in free energyk51025502000400060008000100002468101214VB \u2212 CVB: total free energies (1000 mov. avgd.)#wordstotal free energy diff\f[2] T. Hofmann. Probabilistic latent semantic analysis. In UAI, 1999.\n[3] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[4] W. Buntine and A. Jakulin. Discrete component analysis. In Subspace, Latent Structure and\n\nFeature Selection. Springer, 2006.\n\n[5] M. Girolami and A. Kaban. Simplicial mixtures of Markov chains: Distributed modelling of\n\ndynamic user pro\ufb01les. In NIPS 16, pages 9\u201316. MIT Press, 2004.\n\n[6] H. Wallach. Topic modeling: Beyond bag of words. In Proceedings of the 23rd International\n\nConference on Machine Learning, 2006.\n\n[7] M. Rosen-Zvi, T. Grif\ufb01ths, M. Steyvers, and P. Smith. The author-topic model for authors and\ndocuments. In Proceedings of the 20th Conference on Uncertainty in Arti\ufb01cial Intelligence,\npages 487\u2013494. AUAI Press, 2004.\n\n[8] A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for\ntopic and role discovery in social networks: Experiments with Enron and academic email.\nTechnical report, University of Massachusetts, Amherst, 2004.\n\n[9] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels.\n\narXiv, May 2007.\n\n[10] D. Zhou, E. Manavoglu, J. Li, C. Giles, and H. Zha. Probabilistic models for discovering\n\ne-communities. In WWW Conference, pages 173\u2013182, 2006.\n\n[11] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories.\n\nIEEE Computer Vision and Pattern Recognition, pages 524\u2013531, 2005.\n\n[12] B. Russell, A. Efros, J. Sivic, W. Freeman, and A. Zisserman. Using multiple segmentations\nto discover objects and their extent in image collections. In IEEE Conference on Computer\nVision and Pattern Recognition, pages 1605\u20131614, 2006.\n\n[13] S. Rogers, M. Girolami, C. Campbell, and R. Breitling. The latent process decomposition of\ncDNA microarray data sets. IEEE/ACM Transactions on Computational Biology and Bioin-\nformatics, 2(2):143\u2013156, 2005.\n\n[14] X. Wei and B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR, 2006.\n[15] D. Mimno and A. McCallum. Organizing the OCA: Learning faceted subjects from a library\n\nof digital books. In Joint Conference on Digital Libraries, 2007.\n\n[16] B. Marlin. Collaborative \ufb01ltering: A machine learning perspective. Master\u2019s thesis, University\n\nof Toronto, 2004.\n\n[17] C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and speci\ufb01c aspects of doc-\n\numents with a probabilistic topic model. In NIPS 19, 2006.\n\n[18] D. Andrzejewski, A. Mulhern, B. Liblit, and X. Zhu. Statistical debugging using latent topic\n\nmodels. In European Conference on Machine Learning, 2007.\n\n[19] T. Grif\ufb01ths and M. Steyvers. Probabilistic topic models.\n\nIn T. Landauer, D. McNamara,\nS. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. Laurence\nErlbaum, 2006.\n\n[20] T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Uncer-\n\ntainty in Arti\ufb01cial Intelligence (UAI), 2002.\n\n[21] Y. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for\n\nlatent dirichlet allocation. In NIPS, pages 1353\u20131360, 2006.\n\n[22] K. Kurihara, M. Welling, and Y. Teh. Collapsed variational Dirichlet process mixture models.\n\n2007.\n\n[23] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for\n\ngraphical models. Machine Learning, 37:183\u2013233, 1999.\n\n[24] K. Watanabe and S. Watanabe. Stochastic complexities of gaussian mixtures in variational\n\nbayesian approximation. Journal of Machine Learning Research, 7:625\u2013644, 2006.\n\n8\n\n\f", "award": [], "sourceid": 434, "authors": [{"given_name": "Indraneel", "family_name": "Mukherjee", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}