{"title": "Rethinking LDA: Why Priors Matter", "book": "Advances in Neural Information Processing Systems", "page_first": 1973, "page_last": 1981, "abstract": "Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such smoothing parameters\" have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document-topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic-word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling.\"", "full_text": "Rethinking LDA: Why Priors Matter\n\nHanna M. Wallach David Mimno Andrew McCallum\n\nDepartment of Computer Science\n\nUniversity of Massachusetts Amherst\n\n{wallach,mimno,mccallum}@cs.umass.edu\n\nAmherst, MA 01003\n\nAbstract\n\nImplementations of topic models typically use symmetric Dirichlet priors with\n\ufb01xed concentration parameters, with the implicit assumption that such \u201csmoothing\nparameters\u201d have little practical effect. In this paper, we explore several classes\nof structured priors for topic models. We \ufb01nd that an asymmetric Dirichlet prior\nover the document\u2013topic distributions has substantial advantages over a symmet-\nric prior, while an asymmetric prior over the topic\u2013word distributions provides no\nreal bene\ufb01t. Approximation of this prior structure through simple, ef\ufb01cient hy-\nperparameter optimization steps is suf\ufb01cient to achieve these performance gains.\nThe prior structure we advocate substantially increases the robustness of topic\nmodels to variations in the number of topics and to the highly skewed word fre-\nquency distributions common in natural language. Since this prior structure can be\nimplemented using ef\ufb01cient algorithms that add negligible cost beyond standard\ninference techniques, we recommend it as a new standard for topic modeling.\n\n1\n\nIntroduction\n\nTopic models such as latent Dirichlet allocation (LDA) [3] have been recognized as useful tools for\nanalyzing large, unstructured collections of documents. There is a signi\ufb01cant body of work apply-\ning LDA to an wide variety of tasks including analysis of news articles [14], study of the history of\nscienti\ufb01c ideas [2, 9], topic-based search interfaces1 and navigation tools for digital libraries [12].\nIn practice, users of topic models are typically faced with two immediate problems: First, extremely\ncommon words tend to dominate all topics. Second, there is relatively little guidance available on\nhow to set T , the number of topics, or studies regarding the effects of using a suboptimal setting\nfor T . Standard practice is to remove \u201cstop words\u201d before modeling using a manually constructed,\ncorpus-speci\ufb01c stop word list and to optimize T by either analyzing probabilities of held-out doc-\numents or resorting to a more complicated nonparametric model. Additionally, there has been rel-\natively little work in the machine learning literature on the structure of the prior distributions used\nin LDA: most researchers simply use symmetric Dirichlet priors with heuristically set concentration\nparameters. Asuncion et al. [1] recently advocated inferring the concentration parameters of these\nsymmetric Dirichlets from data, but to date there has been no rigorous scienti\ufb01c study of the priors\nused in LDA\u2014from the choice of prior (symmetric versus asymmetric Dirichlets) to the treatment\nof hyperparameters (optimize versus integrate out)\u2014and the effects of these modeling choices on\nthe probability of held-out documents and, more importantly, the quality of inferred topics. In this\npaper, we demonstrate that practical implementation issues (handling stop words, setting the number\nof topics) and theoretical issues involving the structure of Dirichlet priors are intimately related.\nWe start by exploring the effects of classes of hierarchically structured Dirichlet priors over the\ndocument\u2013topic distributions and topic\u2013word distributions in LDA. Using MCMC simulations, we\n\ufb01nd that using an asymmetric, hierarchical Dirichlet prior over the document\u2013topic distributions and\n\n1http://rexa.info/\n\n1\n\n\fa symmetric Dirichlet prior over the topic\u2013word distributions results in signi\ufb01cantly better model\nperformance, measured both in terms of the probability of held-out documents and in the quality\nof inferred topics. Although this hierarchical Bayesian treatment of LDA produces good results,\nit is computationally intensive. We therefore demonstrate that optimizing the hyperparameters of\nasymmetric, nonhierarchical Dirichlets as part of an iterative inference algorithm results in similar\nperformance to the full Bayesian model while adding negligible computational cost beyond standard\ninference techniques. Finally, we show that using optimized Dirichlet hyperparameters results in\ndramatically improved consistency in topic usage as T is increased. By decreasing the sensitivity\nof the model to the number of topics, hyperparameter optimization results in robust, data-driven\nmodels with substantially less model complexity and computational cost than nonparametric models.\nSince the priors we advocate (an asymmetric Dirichlet over the document\u2013topic distributions and a\nsymmetric Dirichlet over the topic\u2013word distributions) have signi\ufb01cant modeling bene\ufb01ts and can\nbe implemented using highly ef\ufb01cient algorithms, we recommend them as a new standard for LDA.\n\n2 Latent Dirichlet Allocation\nLDA is a generative topic model for documents W = {w(1), w(2), . . . , w(D)}. A \u201ctopic\u201d t is a\ndiscrete distribution over words with probability vector \u03c6t. A Dirichlet prior is placed over \u03a6 =\n{\u03c61, . . . \u03c6T}. In almost all previous work on LDA, this prior is assumed to be symmetric (i.e., the\nbase measure is \ufb01xed to a uniform distribution over words) with concentration parameter \u03b2:\n\nP (\u03a6) =(cid:81)\n\nt Dir (\u03c6t; \u03b2u) =(cid:81)\n\nt\n\n(cid:81)\n\n\u0393(\u03b2)(cid:81)\n\nw \u0393( \u03b2\nW )\n\n\u03b4(cid:0)(cid:80)\n\nw \u03c6w|t \u2212 1(cid:1) .\n\n\u03b2\n\nW \u22121\nw|t\n\nw \u03c6\n\n(1)\n\nEach document, indexed by d, has a document-speci\ufb01c distribution over topics \u03b8d. The prior over\n\u0398 ={\u03b81, . . . \u03b8D} is also assumed to be a symmetric Dirichlet, this time with concentration param-\neter \u03b1. The tokens in every document w(d) = {w(d)\nn=1 are associated with corresponding topic\nassignments z(d) = {z(d)\nn=1, drawn i.i.d. from the document-speci\ufb01c distribution over topics,\nwhile the tokens are drawn i.i.d. from the topics\u2019 distributions over words \u03a6 = {\u03c61, . . . , \u03c6T}:\n\nn }Nd\n\nn }Nd\n\nP (z(d) | \u03b8d) =(cid:81)\n\nn \u03b8z\n\n(d)\n\nn |d\n\nand P (w(d) | z(d), \u03a6) =(cid:81)\n\nn \u03c6w\n\n(d)\n\nn |z\n\n(d)\nn\n\n.\n\n(2)\n\nDirichlet\u2013multinomial conjugacy allows \u0398 and \u03a6 to be marginalized out.\nFor real-world data, documents W are observed, while the corresponding topic assignments Z are\nunobserved. Variational methods [3, 16] and MCMC methods [7] are both effective at inferring the\nlatent topic assignments Z. Asuncion et al. [1] demonstrated that the choice of inference method\nhas negligible effect on the probability of held-out documents or inferred topics. We use MCMC\nmethods throughout this paper\u2014speci\ufb01cally Gibbs sampling [5]\u2014since the internal structure of\nhierarchical Dirichlet priors are typically inferred using a Gibbs sampling algorithm, which can be\neasily interleaved with Gibbs updates for Z given W. The latter is accomplished by sequentially\nn from its conditional posterior given W, \u03b1u, \u03b2u and Z\\d,n\nresampling each topic assignment z(d)\n(the current topic assignments for all tokens other than the token at position n in document d):\nn |Z\\d,n, \u03b1u)\n\nP (z(d)\n\nn ,W\\d,n,Z\\d,n, \u03b2u) P (z(d)\nn | z(d)\nn |W,Z\\d,n, \u03b1u, \u03b2u) \u221d P (w(d)\n\\d,n\n+ \u03b2\nn |z\n(d)\n(d)\nW\nw\nn\n\\d,n\nN\n(d)\nz\nn\n\n\\d,n\n+ \u03b1\nN\nn |d\n(d)\nT\nz\nNd \u2212 1 + \u03b1\n\n\u221d N\n\n+ \u03b2\n\n,\n\nwhere sub- or super-script \u201c\\d, n\u201d denotes a quantity excluding data from position n in document d.\n\n(3)\n\n3 Priors for LDA\n\nThe previous section outlined LDA as it is most commonly used\u2014namely with symmetric Dirich-\nlet priors over \u0398 and \u03a6 with \ufb01xed concentration parameters \u03b1 and \u03b2, respectively. The simplest\nway to vary this choice of prior for either \u0398 or \u03a6 is to infer the relevant concentration parameter\nfrom data, either by computing a MAP estimate [1] or by using an MCMC algorithm such as slice\nsampling [13]. A broad Gamma distribution is an appropriate choice of prior for both \u03b1 and \u03b2.\n\n2\n\n\f(a)\n\n(b)\n\n(d)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 1: (a)-(e): LDA with (a) symmetric Dirichlet priors over \u0398 and \u03a6, (b) a symmetric Dirichlet prior over\n\u0398 and an asymmetric Dirichlet prior over \u03a6, (d) an asymmetric Dirichlet prior over \u0398 and a symmetric Dirichlet\nprior over \u03a6, (e) asymmetric Dirichlet priors over \u0398 and \u03a6. (c) Generating {z(d)\nn=1 = (t, t(cid:48), t, t) from the\nasymmetric, predictive distribution for document d; (f) generating {z(d)\nn=1 =\n(t(cid:48), t(cid:48), t(cid:48), t(cid:48)) from the asymmetric, hierarchical predictive distributions for documents d and d(cid:48), respectively.\n\nn=1 = (t, t(cid:48), t, t) and {z(d(cid:48))\n\nn }4\n\nn }4\n\nn }4\n\nAlternatively, the uniform base measures in the Dirichlet priors over \u0398 and \u03a6 can be replaced with\nnonuniform base measures m and n, respectively. Throughout this section we use the prior over \u0398\nas a running example, however the same construction and arguments also apply to the prior over \u03a6.\nIn section 3.1, we describe the effects on the document-speci\ufb01c conditional posterior distributions,\nor predictive distributions, of replacing u with a \ufb01xed asymmetric (i.e., nonuniform) base measure\nm. In section 3.2, we then treat m as unknown, and take a fully Bayesian approach, giving m a\nDirichlet prior (with a uniform base measure and concentration parameter \u03b1(cid:48)) and integrating it out.\n\n3.1 Asymmetric Dirichlet Priors\n\nIf \u0398 is given an asymmetric Dirichlet prior with concentration parameter \u03b1 and an known (nonuni-\nform) base measure m, the predictive probability of topic t occurring in document d given Z is\n\nP (z(d)\n\nNd+1 = t|Z, \u03b1m) =\n\nd\u03b8d P (t| \u03b8d) P (\u03b8d |Z, \u03b1m) =\n\nNt|d + \u03b1mt\n\nNd + \u03b1\n\n.\n\n(4)\n\n(cid:90)\n\nIf topic t does not occur in z(d), then Nt|d will be zero, and the probability of generating z(d)\nNd+1 = t\nIn other words, under an asymmetric prior, Nt|d is smoothed with a topic-speci\ufb01c\nwill be mt.\nquantity \u03b1mt. Consequently, different topics can be a priori more or less probable in all documents.\nOne way of describing the process of generating from (4) is to say that generating a topic assignment\nz(d)\nn is equivalent to setting the value of z(d)\nn to the the value of some document-speci\ufb01c draw from\nm. While this interpretation provides little bene\ufb01t in the case of \ufb01xed m, it is useful for describing\nthe effects of marginalizing over m on the predictive distributions (see section 3.2). Figure 1c\ndepicts the process of drawing {z(d)\n1 , there are no\nexisting document-speci\ufb01c draws from m, so a new draw \u03b31 must be generated, and z(d)\nassigned\nthe value of this draw (t in \ufb01gure 1c). Next, z(d)\nis drawn by either selecting \u03b31, with probability\nproportional to the number of topic assignments that have been previously \u201cmatched\u201d to \u03b31, or a\nnew draw from m, with probability proportional to \u03b1. In \ufb01gure 1c, a new draw is selected, so \u03b32 is\nassigned its value, in this case t(cid:48). The next topic assignment is drawn in the\ndrawn from m and z(d)\nsame way: existing draws \u03b31 and \u03b32 are selected with probabilities proportional to the numbers of\ntopic assignments to which they have previously been matched, while with probability proportional\nto \u03b1, z(d)\nis assigned the\nvalue of \u03b31. In general, the probability of a new topic assignment being assigned the value of an\nexisting document-speci\ufb01c draw \u03b3i from m is proportional to N (i)\nd , the number of topic assignments\n\nis matched to a new draw from m. In \ufb01gure 1c, \u03b31 is selected and z(d)\n\nn=1 using this interpretation. When drawing z(d)\n\nn }4\n\n2\n\n3\n\n1\n\n3\n\n2\n\n3\n\nu\u03b1\u03c6t\u03b8dznwn\u03b2uTDNu\u03b1\u03c6t\u03b8dznwn\u03b2un\u03b2(cid:48)TDNt|dt(cid:48)|dt|dt|d\u03b31=t\u03b32=t(cid:48)\u03b33=tmu\u03b1m\u03b1(cid:48)\u03c6t\u03b8dznwn\u03b2uTDNu\u03b1m\u03b1(cid:48)\u03c6t\u03b8dznwn\u03b2un\u03b2(cid:48)TDNt|dt(cid:48)|dt|dt|d\u03b31=t\u03b32=t(cid:48)\u03b33=tt(cid:48)|d(cid:48)t(cid:48)|d(cid:48)t(cid:48)|d(cid:48)t(cid:48)|d(cid:48)\u03b31=t(cid:48)\u03b32=t(cid:48)\u03b31=t\u03b32=t(cid:48)\u03b33=t(cid:48)u\fpreviously matched to \u03b3i. The predictive probability of topic t in document d is therefore\n\nP (z(d)\n\nNd+1 = t|Z, \u03b1m) =\n\nmatched to a draw from m,(cid:80)I\n\nwhere I is the current number of draws from m for document d. Since every topic assignment is\nd \u03b4 (\u03b3i \u2212 t) = Nt|d. Consequently, (4) and (5) are equivalent.\n\ni=1 N (i)\n\n,\n\n(5)\n\ni=1 N (i)\n\nd \u03b4 (\u03b3i \u2212 t) + \u03b1mt\nNd + \u03b1\n\n(cid:80)I\n\n3.2\n\nIntegrating out m\n\n(cid:90)\n\nP (z(d)\n\nIn practice, the base measure m is not \ufb01xed a priori and must therefore be treated as an unknown\nquantity. We take a fully Bayesian approach, and give m a symmetric Dirichlet prior with concentra-\ntion parameter \u03b1(cid:48) (as shown in \ufb01gures 1d and 1e). This prior over m induces a hierarchical Dirichlet\nprior over \u0398. Furthermore, Dirichlet\u2013multinomial conjugacy then allows m to be integrated out.\nGiving m a symmetric Dirichlet prior and integrating it out has the effect of replacing m in (5) with\na \u201cglobal\u201d P\u00b4olya conditional distribution, shared by the document-speci\ufb01c predictive distributions.\nFigure 1f depicts the process of drawing eight topic assignments\u2014four for document d and four\nfor document d(cid:48). As before, when a topic assignment is drawn from the predictive distribution\nfor document d, it is assigned the value of an existing (document-speci\ufb01c) internal draw \u03b3i with\nprobability proportional to the number of topic assignments previously matched to that draw, and to\nthe value of a new draw \u03b3i(cid:48) with probability proportional to \u03b1. However, since m has been integrated\nout, the new draw must be obtained from the \u201cglobal\u201d distribution. At this level, \u03b3i(cid:48) treated as if\nit were a topic assignment, and assigned the value of an existing global draw \u03b3j with probability\nproportional to the number of document-level draws previously matched to \u03b3j, and to a new global\ndraw, from u, with probability proportional to \u03b1(cid:48). Since the internal draws at the document level are\ntreated as topic assignments the global level, there is a path from every topic assignment to u, via\nthe internal draws. The predictive probability of topic t in document d given Z is now\nNd+1 = t|Z, \u03b1m) P (m|Z, \u03b1(cid:48)u)\n(cid:80)\n\u02c6Nt + \u03b1(cid:48)\n\u02c6Nt + \u03b1(cid:48)\nd \u03b4 (\u03b3i \u2212 t) as before and \u02c6Nt =(cid:80)J\n\nNd+1 = t|Z, \u03b1, \u03b1(cid:48)u) =\n\ndm P (z(d)\n\ni=1 N (i)\n\nwhere I and J are the current numbers of document-level and global internal draws, respectively,\nj=1 N (j) \u03b4 (\u03b3j \u2212 t). The quantity N (j) is the\ntotal number of document-level internal draws matched to global internal draw \u03b3j. Since some topic\nd \u03b4 (Nt|d > 0) \u2264 \u02c6Nt \u2264 Nt,\n\nNt|d =(cid:80)I\nassignments will be matched to existing document-level draws, (cid:80)\nwhere(cid:80)\nAn important property of (6) is that if concentration parameter \u03b1(cid:48) is large relative to(cid:80)\ncounts \u02c6Nt and(cid:80)\n\nd \u03b4 (Nt|d > 0) is the number of unique documents in Z in which topic t occurs.\n\n\u02c6Nt, then\nIn other words, as \u03b1(cid:48) \u2192 \u221e the hierarchical, asym-\n\n\u02c6Nt are effectively ignored.\n\nNt|d + \u03b1\n\nmetric Dirichlet prior approaches a symmetric Dirichlet prior with concentration parameter \u03b1.\nFor any given Z for real-world documents W, the internal draws and the paths from Z to u are\nunknown. Only the value of each topic assignment is known, and hence Nt|d for each topic t and\ndocument d. In order to compute the conditional posterior distribution for each topic assignment\n(needed to resample Z) it is necessary to infer \u02c6Nt for each topic t. These values can be inferred by\nGibbs sampling the paths from Z to u [4, 15]. Resampling the paths from Z to u can be interleaved\nwith resampling Z itself. Removing z(d)\nn = t from the model prior to resampling its value consists\nof decrementing Nt|d and removing its current path to u. Similarly, adding a newly sampled value\nn = t(cid:48) into the model consists of incrementing Nt(cid:48)|d and sampling a new path from z(d)\nz(d)\n\nn to u.\n\nt\n\nNd + \u03b1\n\nT\n\n,\n\n(6)\n\n=\n\nt\n\nt\n\n4 Comparing Priors for LDA\n\nTo investigate the effects of the priors over \u0398 and \u03a6, we compared the four combinations of sym-\nmetric and asymmetric Dirichlets shown in \ufb01gure 1: symmetric priors over both \u0398 and \u03a6 (denoted\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) log P (W,Z | \u2126) (patent abstracts) for SS, SA, AS and AS, computed every 20 iterations and\naveraged over 5 Gibbs sampling runs. AS (red) and AA (black) perform similarly and converge to higher\nvalues of log P (W,Z | \u2126) than SS (blue) and SA (green).\n(b) Histograms of 4000 (iterations 1000-5000)\nconcentration parameter values for AA (patent abstracts). Note the log scale for \u03b2(cid:48): the prior over \u03a6 approaches\na symmetric Dirichlet, making AA equivalent to AS. (c) log P (W,Z | \u2126) for all three data sets at T = 50. AS\nis consistently better than SS. SA is poor (not shown). AA is capable of matching AS, but does not always.\n\nData set\nPatent abstracts\n20 Newsgroups\nNYT articles\n\nD\n1016\n540\n1768\n\n\u00afNd\n\n101.87\n148.17\n270.06\n\nN\n\n103499\n80012\n477465\n\nW\n6068\n14492\n41961\n\nStop\nyes\nno\nno\n\nTable 1: Data set statistics. D is the number of documents, \u00afNd is the mean document length, N is the number\nof tokens, W is the vocabulary size. \u201cStop\u201d indicates whether stop words were present (yes) or not (no).\n\nSS), a symmetric prior over \u0398 and an asymmetric prior over \u03a6 (denoted SA), an asymmetric prior\nover \u0398 and a symmetric prior over \u03a6 (denoted AS), and asymmetric priors over both \u0398 and \u03a6 (de-\nnoted AA). Each combination was used to model three collections of documents: patent abstracts\nabout carbon nanotechnology, New York Times articles, and 20 Newsgroups postings. Due to the\ncomputationally intensive nature of the fully Bayesian inference procedure, only a subset of each\ncollection was used (see table 1). In order to stress each combination of priors with respect to skewed\ndistributions over word frequencies, stop words were not removed from the patent abstracts.\nThe four models (SS, SA, AS, AA) were implemented in Java, with integrated-out base measures,\nwhere appropriate. Each model was run with T \u2208 {25, 50, 75, 100} for \ufb01ve runs of 5000 Gibbs\nsampling iterations, using different random initializations. The concentration parameters for each\nmodel (denoted by \u2126) were given broad Gamma priors and inferred using slice sampling [13].\nDuring inference, log P (W,Z | \u2126) was recorded every twenty iterations. These values, averaged\nover the \ufb01ve runs for T = 50, are shown in \ufb01gure 2a. (Results for other values of T are similar.)\nThere are two distinct patterns: models with an asymmetric prior over \u0398 (AS and AA; red and\nblack, respectively) perform very similarly, while models with a symmetric prior over \u0398 (SS and\nSA; blue and green, respectively) also perform similarly, with signi\ufb01cantly worse performance than\nAS and AA. Results for all three data sets are summarized in \ufb01gure 2c, with the log probability\ndivided by the number of tokens in the collection. SA performs extremely poorly on NYT and 20\nNewsgroups, and is not therefore shown. AS consistently achieves better likelihood than SS. The\nfully asymmetric model, AA, is inconsistent, matching AS on the patents and 20 Newsgroups but\ndoing poorly on NYT. This is most likely due to the fact that although AA can match AS, it has\nmany more degrees of freedom and therefore a much larger space of possibilities to explore.\nWe also calculated the probability of held-out documents using the \u201cleft-to-right\u201d evaluation method\ndescribed by Wallach et al. [17]. These results are shown in \ufb01gure 3a, and exhibit a similar pattern\nto the results in \ufb01gure 2a\u2014the best-performing models are those with an asymmetric priors over \u0398.\nWe can gain intuition about the similarity between AS and AA by examining the values of the sam-\n\u02c6Nt\n\u02c6Nw, an asymmetric Dirichlet prior approaches a symmetric Dirichlet with concentration pa-\nrameter \u03b1 or \u03b2. Histograms of 4000 concentration parameter values (from iterations 1000-4000)\nfrom the \ufb01ve Gibbs runs of AA with T = 50 are shown in \ufb01gure 2b. The values for \u03b1, \u03b1(cid:48) and \u03b2\n\npled concentration parameters. As explained in section 3.2, as \u03b1(cid:48) or \u03b2(cid:48) grows large relative to(cid:80)\nor(cid:80)\n\nw\n\nt\n\n5\n\n0100030005000\u2212760000\u2212720000\u221268000050 topicsIterationLog Probability\u03b1Frequency3.54.55.5050150\u03b1'Frequency050100150050150\u03b2Frequency607080900200400log \u03b2'Frequency1030507004080140-6.90-6.85-6.80-6.75-6.70-6.65Patent abstractspatents[, 1]-9.28-9.27-9.26-9.25-9.24-9.23-9.22NYTnyt[, 1]-8.33-8.32-8.31-8.30-8.29-8.2820News\f(a)\n\n(b)\n\nFigure 3: (a) Log probability of held-out documents (patent abstracts). These results mirror those in \ufb01gure 2a.\nAS (red) and AA (black) again perform similarly, while SS (blue) and SA (green) are also similar, but exhibit\nmuch worse performance. (b) \u03b1mt values and the most probable words for topics obtained with T = 50. For\neach model, topics were ranked according to usage and the topics at ranks 1, 5, 10, 20 and 30 are shown. AS\nand AA are robust to skewed word frequency distributions and tend to sequester stop words in their own topics.\n\nare all relatively small, while the values for \u03b2(cid:48) are extremely large, with a median around exp 30. In\nother words, given the values of \u03b2(cid:48), the prior over \u03a6 is effectively a symmetric prior over \u03a6 with con-\ncentration parameter \u03b2. These results demonstrate that even when the model can use an asymmetric\nprior over \u03a6, a symmetric prior gives better performance. We therefore advocate using model AS.\nIt is worth noting the robustness of AS to stop words. Unlike SS and SA, AS effectively sequesters\nstop words in a small number of more frequently used topics. The remaining topics are relatively\nunaffected by stop words. Creating corpus-speci\ufb01c stop word lists is seen as an unpleasant but nec-\nessary chore in topic modeling. Also, for many specialized corpora, once standard stop words have\nbeen removed, there are still other words that occur with very high probability, such as \u201cmodel,\u201d\n\u201cdata,\u201d and \u201cresults\u201d in machine learning literature, but are not technically stop words.\nIf LDA\ncannot handle such words in an appropriate fashion then they must be treated as stop words and re-\nmoved, despite the fact that they play meaningful semantic roles. The robustness of AS to stop words\nhas implications for HMM-LDA [8] which models stop words using a hidden Markov model and\n\u201ccontent\u201d words using LDA, at considerable computational cost. AS achieves the same robustness\nto stop words much more ef\ufb01ciently. Although there is empirical evidence that topic models that\nuse asymmetric Dirichlet priors with optimized hyperparameters, such as Pachinko allocation [10]\nand Wallach\u2019s topic-based language model [18], are robust to the presence of extremely common\nwords, these studies did not establish whether the robustness was a function of a more complicated\nmodel structure or if careful consideration of hyperparameters alone was suf\ufb01cient. We demonstrate\nthat AS is capable of learning meaningful topics even with no stop word removal. For ef\ufb01ciency,\nwe do not necessarily advocate doing away with stop word lists entirely, but we argue that using\nan asymmetric prior over \u0398 allows practitioners to use a standard, conservative list of determiners,\nprepositions and conjunctions that is applicable to any document collection in a given language,\nrather than hand-curated corpus-speci\ufb01c lists that risk removing common but meaningful terms.\n\n5 Ef\ufb01ciency: Optimizing rather than Integrating Out\n\nInference in the full Bayesian formulation of AS is expensive because of the additional complexity\nin sampling the paths from Z to u and maintaining hierarchical data structures. It is possible to\nretain the theoretical and practical advantages of using AS without sacri\ufb01cing the advantages of\nsimple, ef\ufb01cient models by directly optimizing m, rather than integrating it out. The concentration\nparameters \u03b1 and \u03b2 may also be optimized (along with m for \u03b1 and by itself for \u03b2). In this section,\nwe therefore compare the fully Bayesian version of AS with optimized AS, using SS as a baseline.\nWallach [19] compared several methods for jointly the maximum likelihood concentration parameter\nand asymmetric base measure for a Dirichlet\u2013multinomial model. We use the most ef\ufb01cient of\nthese methods. The advantage of optimizing m is considerable: although it is likely that further\noptimizations would reduce the difference, 5000 Gibbs sampling iterations (including sampling \u03b1,\n\n6\n\n-6.18-6.16-6.14-6.12-6.10Held-out probabilityTopicsNats / token2550751000.080 a \ufb01eld emission an electron the0.080 a the carbon and gas to an0.080 the of a to and about at0.080 of a surface the with in contact0.080 the a and to is of liquid0.895 the a of to and is in0.187 carbon nanotubes nanotube catalyst0.043 sub is c or and n sup0.061 fullerene compound fullerenes0.044 material particles coating inorganic0.042 a \ufb01eld the emission and carbon is0.042 the carbon catalyst a nanotubes 0.042 a the of substrate to material on0.042 carbon single wall the nanotubes0.042 the a probe tip and of to1.300 the a of to and is in0.257 and are of for in as such0.135 a carbon material as structure nanotube0.065 diameter swnt about nm than \ufb01ber swnts0.029 compositions polymers polymer containAsymmetric \u03b1Symmetric \u03b1Asymmetric \u03b2Symmetric \u03b2\fPatents\n\nASO -6.65 \u00b1 0.04\n-6.62 \u00b1 0.03\nAS\n-6.91 \u00b1 0.01\nSS\n\nNYT\n\n-9.24 \u00b1 0.01\n-9.23 \u00b1 0.01\n-9.26 \u00b1 0.01\n\n20 NG\n\n-8.27 \u00b1 0.01\n-8.28 \u00b1 0.01\n-8.31 \u00b1 0.01\n\n25\nASO -6.18\n-6.15\n-6.18\n\nAS\nSS\n\n50\n-6.12\n-6.13\n-6.18\n\n75\n-6.12\n-6.11\n-6.16\n\n100\n-6.08\n-6.10\n-6.13\n\nTable 2: log P (W,Z | \u2126) / N for T = 50 (left) and log P (W test |W,Z, \u2126) / N test for varying values of T\n(right) for the patent abstracts. AS and ASO (optimized hyperparameters) consistently outperform SS except\nfor ASO with T = 25. Differences between AS and ASO are inconsistent and within standard deviations.\n\nSS\n\nASO\n\nSS\n\nASO\n\n4.34 \u00b1 0.09\n4.18 \u00b1 0.09\n\nAS\n\n\u2014\n\n\u2014\n\u2014\n\nASO 4.37 \u00b1 0.08\nAS\nSS\n\n3.50 \u00b1 0.07\n3.56 \u00b1 0.07\n3.49 \u00b1 0.04\nTable 3: Average VI distances between multiple runs of each model with T = 50 on (left) patent abstracts and\n(right) 20 newsgroups. ASO partitions are approximately as similar to AS partitions as they are to other ASO\npartitions. ASO and AS partitions are both are further from SS partitions, which tend to be more dispersed.\n\n5.43 \u00b1 0.05\n5.39 \u00b1 0.06\n5.93 \u00b1 0.03\n\nASO 3.36 \u00b1 0.03\nAS\nSS\n\n\u2014\n\u2014\n\n3.43 \u00b1 0.05\n3.36 \u00b1 0.02\n\nAS\n\n\u2014\n\n\u03b1(cid:48) and \u03b2) for the patent abstracts using fully Bayesian AS with T = 25 took over four hours, while\n5000 Gibbs sampling iterations (including hyperparameter optimization) took under 30 minutes.\nIn order to establish that optimizing m is a good approximation to integrating it out, we computed\nlog P (W,Z | \u2126) and the log probability of held-out documents for fully Bayesian AS, optimized\nAS (denoted ASO) and as a baseline SS (see table 2). AS and ASO consistently outperformed SS,\nexcept for ASO when T = 25. Since twenty-\ufb01ve is a very small number of topics, this is not a cause\nfor concern. Differences between AS and ASO are inconsistent and within standard deviations.\nFrom a point of view of log probabilities, ASO therefore provides a good approximation to AS.\nWe can also compare topic assignments. Any set of topic assignments can be thought of as partition\nof the corresponding tokens into T topics. In order to measure similarity between two sets of topic\nassignments Z and Z(cid:48) for W, we can compute the distance between these partitions using variation\nof information (VI) [11, 6] (see suppl. mat. for a de\ufb01nition of VI for topic models). VI has several\nattractive properties: it is a proper distance metric, it is invariant to permutations of the topic labels,\nand it can be computed in O (N + T T (cid:48)) time, i.e., time that is linear in the number of tokens and\nthe product of the numbers of topics in Z and Z(cid:48). For each model (AS, ASO and SS), we calculated\nthe average VI distance between all 10 unique pairs of topic assignments from the 5 Gibbs runs for\nthat model, giving a measure of within-model consistency. We also calculated the between-model\nVI distance for each pair of models, averaged over all 25 unique pairs of topic assignments for that\npair. Table 3 indicates that ASO partitions are approximately as similar to AS partitions as they are\nto other ASO partitions. ASO and AS partitions are both further away from SS partitions, which\ntend to be more dispersed. These results con\ufb01rm that ASO is indeed a good approximation to AS.\n\n6 Effect on Selecting the Number of Topics\n\nSelecting the number of topics T is one of the most problematic modeling choices in \ufb01nite topic\nmodeling. Not only is there no clear method for choosing T (other than evaluating the probability of\nheld-out data for various values of T ), but degree to which LDA is robust to a poor setting of T is not\nwell-understood. Although nonparametric models provide an alternative, they lose the substantial\ncomputational ef\ufb01ciency advantages of \ufb01nite models. We explore whether the combination of priors\nadvocated in the previous sections (model AS) can improve the stability of LDA to different values of\nT , while retaining the static memory management and simple inference algorithms of \ufb01nite models.\nIdeally, if LDA has suf\ufb01cient topics to model W well, the assignments of tokens to topics should be\nrelatively invariant to an increase in T \u2014i.e., the additional topics should be seldom used. For exam-\nple, if ten topics is suf\ufb01cient to accurately model the data, then increasing the number of topics to\ntwenty shouldn\u2019t signi\ufb01cantly affect inferred topic assignments. If this is the case, then using large\nT should not have a signi\ufb01cant impact on either Z or the speed of inference, especially as recently-\nintroduced sparse sampling methods allow models with large T to be trained ef\ufb01ciently [20]. Fig-\nure 4a shows the average VI distance between topic assignments (for the patent abstracts) inferred\nby models with T = 25 and models with T \u2208 {50, 75, 100}. AS and AA, the bottom two lines, are\n\n7\n\n\f(a)\n\n(b)\n\nFigure 4: (a) Topic consistency measured by average VI distance from models with T = 25. As T increases,\nAS (red) and AA (black) produce Zs that stay signi\ufb01cantly closer to those obtained with T = 25 than SA\n(green) and SS (blue). (b) Assignments of tokens (patent abstracts) allocated to the largest topic in a 25 topic\nmodel, as T increases. For AS, the topic is relatively intact, even at T = 100: 80% of tokens assigned to the\ntopic at T = 25 are assigned to seven topics. For SS, the topic has been subdivided across many more topics.\n\nmuch more stable (smaller average VI distances) than SS and SA at 50 topics and remain so as T\nincreases: even at 100 topics, AS has a smaller VI distance to a 25 topic model than SS at 50 topics.\nFigure 4b provides intuition for this difference: for AS, the tokens assigned to the largest topic at\nT = 25 remain within a small number of topics as T is increased, while for SS, topic usage is more\nuniform and increasing T causes the tokens to be divided among many more topics. These results\nsuggest that for AS, new topics effectively \u201cnibble away\u201d at existing topics, rather than splitting\nthem more uniformly. We therefore argue that the risk of using too many topics is lower than the\nrisk of using too few, and that practitioners should be comfortable using larger values of T .\n\n7 Discussion\n\nThe previous sections demonstrated that AS results in the best performance over AA, SA and SS,\nmeasured in several ways. However, it is worth examining why this combination of priors results\nin superior performance. The primary assumption underlying topic modeling is that a topic should\ncapture semantically-related word co-occurrences. Topics must also be distinct in order to convey\ninformation: knowing only a few co-occurring words should be suf\ufb01cient to resolve semantic ambi-\nguities. A priori, we therefore do not expect that a particular topic\u2019s distribution over words will be\nlike that of any other topic. An asymmetric prior over \u03a6 is therefore a bad idea: the base measure\nwill re\ufb02ect corpus-wide word usage statistics, and a priori, all topics will exhibit those statistics too.\nA symmetric prior over \u03a6 only makes a prior statement (determined by the concentration param-\neter \u03b2) about whether topics will have more sparse or more uniform distributions over words, so\nthe topics are free to be as distinct and specialized as is necessary. However, it is still necessary to\naccount for power-law word usage. A natural way of doing this is to expect that certain groups of\nwords will occur more frequently than others in every document in a given corpus. For example, the\nwords \u201cmodel,\u201d \u201cdata,\u201d and \u201calgorithm\u201d are likely to appear in every paper published in a machine\nlearning conference. These assumptions lead naturally to the combination of priors that we have\nempirically identi\ufb01ed as superior: an asymmetric Dirichlet prior over \u0398 that serves to share com-\nmonalities across documents and a symmetric Dirichlet prior over \u03a6 that serves to avoid con\ufb02icts\nbetween topics. Since these priors can be implemented using ef\ufb01cient algorithms that add negligible\ncost beyond standard inference techniques, we recommend them as a new standard for LDA.\n\n8 Acknowledgments\n\nThis work was supported in part by the Center for Intelligent Information Retrieval, in part by\nCIA, NSA and NSF under NSF grant number IIS-0326249, and in part by subcontract number\nB582467 from Lawrence Livermore National Security, LLC under prime contract number DE-\nAC52-07NA27344 from DOE/NNSA. Any opinions, \ufb01ndings and conclusions or recommendations\nexpressed in this material are the authors\u2019 and do not necessarily re\ufb02ect those of the sponsor.\n\n8\n\n4.04.55.05.56.06.5Clustering distance from T=25TopicsVariation of Information507510050 topics75 topics100 topicsAS prior0.00.20.40.60.81.050 topics75 topics100 topicsSS prior0.00.20.40.60.81.0\fReferences\n[1] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic\nmodels. In Proceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2009.\n[2] D. Blei and J. Lafferty. A correlated topic model of Science. Annals of Applied Statistics,\n\n1(1):17\u201335, 2007.\n\n[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, January 2003.\n\n[4] P. J. Cowans. Probabilistic Document Modelling. PhD thesis, University of Cambridge, 2006.\n[5] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restora-\nIEEE Transaction on Pattern Analysis and Machine Intelligence 6, pages\n\ntion of images.\n721\u2013741, 1984.\n\n[6] S. Goldwater and T. L. Grif\ufb01ths. A fully Bayesian approach to unsupervised part-of-speech\n\ntagging. In Association for Computational Linguistics, 2007.\n\n[7] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy\n\nof Sciences, 101(suppl. 1):5228\u20135235, 2004.\n\n[8] T. L. Grif\ufb01ths, M. Steyvers, D. M. Blei, and J. B. Tenenbaum. Integrating topics and syntax.\nIn L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing\nSystems 17, pages 536\u2013544. The MIT Press, 2005.\n\n[9] D. Hall, D. Jurafsky, and C. D. Manning. Studying the history of ideas using topic models. In\n\nProceedings of EMNLP 2008, pages 363\u2013371.\n\n[10] W. Li and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In Proceed-\n\nings of the 24th International Conference on Machine learning, pages 633\u2013640, 2007.\n\n[11] M. Meil\u02d8a. Comparing clusterings by the variation of information. In Conference on Learning\n\nTheory, 2003.\n\n[12] D. Mimno and A. McCallum. Organizing the OCA: Learning faceted subjects from a library\nof digital books. In Proceedings of the 7th ACM/IEEE joint conference on Digital libraries,\npages 376\u2013385, Vancouver, BC, Canada, 2007.\n\n[13] R. M. Neal. Slice sampling. Annals of Statistics, 31:705\u2013767, 2003.\n[14] D. Newman, C. Chemudugunta, P. Smyth, and M. Steyvers. Analyzing entities and topics in\nnews articles using statistical topic models. In Intelligence and Security Informatics, Lecture\nNotes in Computer Science. 2006.\n\n[15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal\n\nof the American Statistical Association, 101:1566\u20131581, 2006.\n\n[16] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm\nIn Advances in Neural Information Processing Systems 18,\n\nfor latent Dirichlet allocation.\n2006.\n\n[17] H. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models.\n\nIn Proceedings of the 26th Interational Conference on Machine Learning, 2009.\n\n[18] H. M. Wallach. Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd Interna-\n\ntional Conference on Machine Learning, pages 977\u2013984, Pittsburgh, Pennsylvania, 2006.\n\n[19] H. M. Wallach. Structured Topic Models for Language. Ph.D. thesis, University of Cambridge,\n\n2008.\n\n[20] L. Yao, D. Mimno, and A. McCallum. Ef\ufb01cient methods for topic model inference on stream-\n\ning document collections. In Proceedings of KDD 2009, 2009.\n\n9\n\n\f", "award": [], "sourceid": 929, "authors": [{"given_name": "Hanna", "family_name": "Wallach", "institution": ""}, {"given_name": "David", "family_name": "Mimno", "institution": null}, {"given_name": "Andrew", "family_name": "McCallum", "institution": ""}]}