{"title": "Improving Topic Coherence with Regularized Topic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 496, "page_last": 504, "abstract": "Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.", "full_text": "Improving Topic Coherence with\n\nRegularized Topic Models\n\nDavid Newman\n\nUniversity of California, Irvine\n\nnewman@uci.edu\n\nEdwin V. Bonilla\n\nWray Buntine\n\nNICTA & Australian National University\n\n{edwin.bonilla, wray.buntine}@nicta.com.au\n\nAbstract\n\nTopic models have the potential to improve search and browsing by extracting\nuseful semantic themes from web pages and other text documents. When learned\ntopics are coherent and interpretable, they can be valuable for faceted browsing,\nresults set diversity analysis, and document retrieval. However, when dealing with\nsmall collections or noisy text (e.g. web search result snippets or blog posts),\nlearned topics can be less coherent, less interpretable, and less useful. To over-\ncome this, we propose two methods to regularize the learning of topic models.\nOur regularizers work by creating a structured prior over words that re\ufb02ect broad\npatterns in the external data. Using thirteen datasets we show that both regularizers\nimprove topic coherence and interpretability while learning a faithful representa-\ntion of the collection of interest. Overall, this work makes topic models more\nuseful across a broader range of text data.\n\n1\n\nIntroduction\n\nTopic modeling holds much promise for improving the ways users search, discover, and organize\nonline content by automatically extracting semantic themes from collections of text documents.\nLearned topics can be useful in user interfaces for ad-hoc document retrieval [18]; driving faceted\nbrowsing [14]; clustering search results [19]; or improving display of search results by increasing\nresult diversity [10]. When the text being modeled is plentiful, clear and well written (e.g. large\ncollections of abstracts from scienti\ufb01c literature), learned topics are usually coherent, easily under-\nstood, and \ufb01t for use in user interfaces. However, topics are not always consistently coherent, and\neven with relatively well written text, one can learn topics that are a mix of concepts or hard to\nunderstand [1, 6]. This problem is exacerbated for content that is sparse or noisy, such as blog posts,\ntweets, or web search result snippets. Take for instance the task of learning categories in clustering\nsearch engine results. A few searches with Carrot2, Yippee, or WebClust quickly demonstrate that\nconsistently learning meaningful topic facets is a dif\ufb01cult task [5].\nOur goal in this paper is to improve the coherence, interpretability and ultimate usability of learned\ntopics. To achieve this we propose QUAD-REG and CONV-REG, two new methods for regularizing\ntopic models, which produce more coherent and interpretable topics. Our work is predicated on\nrecent evidence that a pointwise mutual information-based score (PMI-Score) is highly correlated\nwith human-judged topic coherence [15, 16]. We develop two Bayesian regularization formula-\ntions that are designed to improve PMI-Score. We experiment with \ufb01ve search result datasets from\n7M Blog posts, four search result datasets from 1M News articles, and four datasets of Google\nsearch results. Using these thirteen datasets, our experiments demonstrate that both regularizers\nconsistently improve topic coherence and interpretability, as measured separately by PMI-Score and\nhuman judgements. To the best of our knowledge, our models are the \ufb01rst to address the problem\nof learning topics when dealing with limited and/or noisy text content. This work opens up new\napplication areas for topic modeling.\n\n1\n\n\f2 Topic Coherence and PMI-Score\n\nTopics learned from a statistical topic model are formally a multinomial distribution over words,\nand are often displayed by printing the 10 most probable words in the topic. These top-10 words\nusually provide suf\ufb01cient information to determine the subject area and interpretation of a topic,\nand distinguish one topic from another. However, topics learned on sparse or noisy text data are\noften less coherent, dif\ufb01cult to interpret, and not particularly useful. Some of these noisy topics\ncan be vaguely interpretable, but contain (in the top-10 words) one or two unrelated words \u2013 while\nother topics can be practically incoherent. In this paper we wish to improve topic models learned on\ndocument collections where the text data is sparse and/or noisy. We postulate that using additional\n(possibly external) data will regularize the learning of the topic models.\nTherefore, our goal is to improve topic coherence. Topic coherence \u2013 meaning semantic coherence\n\u2013 is a human judged quality that depends on the semantics of the words, and cannot be measured\nby model-based statistical measures that treat the words as exchangeable tokens. Fortunately, recent\nwork has demonstrated that it is possible to automatically measure topic coherence with near-human\naccuracy [16, 15] using a score based on pointwise mutual information (PMI). In that work they\nshowed (using 6000 human evaluations) that the PMI-Score broadly agrees with human-judged\ntopic coherence. The PMI-Score is motivated by measuring word association between all pairs of\nwords in the top-10 topic words. PMI-Score is de\ufb01ned as follows:\n\n(cid:88)\n\ni<j\n\nPMI-Score(w) =\n\n1\n45\n\nPMI(wi, wj), ij \u2208 {1 . . . 10}\n\nwhere PMI(wi, wj) = log\n\nP (wi, wj)\n\nP (wi)P (wj)\n\n,\n\n(1)\n\n(2)\n\nand 45 is the number of PMI scores over the set of distinct word pairs in the top-10 words. A key\naspect of this score is that it uses external data \u2013 that is data not used during topic modeling. This\ndata could come from a variety of sources, for example the corpus of 3M English Wikipedia articles.\nFor this paper, we will use both PMI-Score and human judgements to measure topic coherence.\nNote that we can measure the PMI-Score of an individual topic, or for a topic model of T topics (in\nthat case PMI-Score will refer to the average of T PMI-Scores). This PMI-Score \u2013 and the idea of\nusing external data to measure it \u2013 forms the foundation of our idea for regularization.\n\n3 Regularized Topic Models\n\nIn this section we describe our approach to regularization in topic models by proposing two dif-\nferent methods: (a) a quadratic regularizer (QUAD-REG) and (b) a convolved Dirichlet regularizer\n(CONV-REG). We start by introducing the standard notation in topic modeling and the baseline\nlatent Dirichlet allocation method (LDA, [4, 9]).\n\n3.1 Topic Modeling and LDA\n\nTopic models are a Bayesian version of probabilistic latent semantic analysis [11].\nIn standard\nLDA topic modeling each of D documents in the corpus is modeled as a discrete distribution over T\nlatent topics, and each topic is a discrete distribution over the vocabulary of W words. For document\nd, the distribution over topics, \u03b8t|d, is drawn from a Dirichlet distribution Dir[\u03b1]. Likewise, each\ndistribution over words, \u03c6w|t, is drawn from a Dirichlet distribution, Dir[\u03b2].\nFor the ith token in a document, a topic assignment, zid, is drawn from \u03b8t|d and the word, xid, is\ndrawn from the corresponding topic, \u03c6w|zid. Hence, the generative process in LDA is given by:\n\n\u03b8t|d \u223c Dirichlet[\u03b1]\nzid \u223c Mult[\u03b8t|d]\n\n(3)\n(4)\nWe can compute the posterior distribution of the topic assignments via Gibbs sampling by writ-\ning down the joint probability, integrating out \u03b8 and \u03c6, and following a few simple mathematical\nmanipulations to obtain the standard Gibbs sampling update:\np(zid = t|xid = w, z\u00aci) \u221d N\u00aci\nwt + \u03b2\nN\u00aci\nt + W \u03b2\n\n\u03c6w|t \u223c Dirichlet[\u03b2]\nxid \u223c Mult[\u03c6w|zid ].\n\n(N\u00aci\n\ntd + \u03b1) .\n\n(5)\n\n2\n\n\fwhere z\u00aci denotes the set of topic assignment variables except the ith variable; Nwt is the number\nof times word w has been assigned to topic t; Ntd is the number of times topic t has been assigned\n\nto document d, and Nt =(cid:80)W\n\nw=1 Nwt.\n\nGiven samples from the posterior distribution we can compute point estimates of the document-topic\nproportions \u03b8t|d and the word-topic probabilities \u03c6w|t. We will denote henceforth \u03c6t as the vector\nof word probabilities for a given topic t and analogously for other variables.\n\n3.2 Regularization via Structured Priors\n\nTo learn better topic models for small or noisy collections we introduce structured priors on \u03c6t based\nupon external data, which has a regularization effect on the standard LDA model. More speci\ufb01cally,\nour priors on \u03c6t will depend on the structural relations of the words in the vocabulary as given by\nexternal data, which will be characterized by the W \u00d7 W \u201ccovariance\u201d matrix C. Intuitively, C\nis a matrix that captures the short-range dependencies between words in the external data. More\nimportantly, we are only interested in relatively frequent terms from the vocabulary, so C will be a\nsparse matrix and hence computations are still feasible for our methods to be used in practice.\n\n3.3 Quadratic Regularizer (QUAD-REG)\n\nHere we use a standard quadratic form with a trade-off factor. Therefore, given a matrix of word\ndependencies C, we can use the prior:\n\np(\u03c6t|C) \u221d (cid:16)\nW(cid:88)\n\n\u03c6T\n\nt C\u03c6t\n\n(cid:17)\u03bd\n(cid:16)\n\n(cid:17)\n\nfor some power \u03bd. Note we do not know the normalization factor but for our purposes of MAP\nestimation we do not need it. The log posterior (omitting irrelevant constants) is given by:\n\nLMAP =\n\nNit log \u03c6i|t + \u03bd log\n\n\u03c6T\n\nt C\u03c6t\n\n.\n\ni=1\n\nOptimizing Equation (7) with respect to \u03c6w|t subject to the constraints(cid:80)W\n(cid:33)\n\nthe following \ufb01xed point update:\n\u03c6w|t \u2190\n\n(cid:80)W\n\nNwt + 2\u03bd\n\n(cid:32)\n\n\u03c6w|t\n\n1\n\ni=1 Ciw\u03c6i|t\nt C\u03c6t\n\n\u03c6T\n\nNt + 2\u03bd\n\ni=1 \u03c6i|t = 1, we obtain\n\n.\n\n(8)\n\n(6)\n\n(7)\n\nWe note that unlike other topic models in which a covariance or correlation structure is used (as\nin the correlated topic model, [3]) in the context of correlated priors for \u03b8t|d, our method does not\nrequire the inversion of C, which would be impractical for even modest vocabulary sizes.\nBy using the update in Equation (8) we obtain the values for \u03c6w|t. This means we no longer have\nneat conjugate priors for \u03c6w|t and thus the sampling in Equation (5) does not hold. Instead, at the\nend of each major Gibbs cycle, \u03c6w|t is re-estimated and the corresponding Gibbs update becomes:\n(9)\n\np(zid = t|xid = w, z\u00aci, \u03c6w|t) \u221d \u03c6w|t(N\u00aci\n\ntd + \u03b1) .\n\n3.4 Convolved Dirichlet Regularizer (CONV-REG)\n\nAnother approach to leveraging information on word dependencies from external data is to consider\nthat each \u03c6t is a mixture of word probabilities \u03c8t, where the coef\ufb01cients are constrained by the\nword-pair dependency matrix C:\n\n(10)\nEach topic has a different \u03c8t drawn from a Dirichlet, thus the model is a convolved Dirichlet. This\nmeans that we convolve the supplied topic to include a spread of related words. Then we have that:\n\nwhere \u03c8t \u223c Dirichlet(\u03b31).\n\n\u03c6t \u221d C\u03c8t\n\np(w|z = t, C, \u03c8t) =\n\nCij\u03c8j|t\n\n.\n\n(11)\n\n\uf8f6\uf8f8Nit\n\nW(cid:89)\n\n\uf8eb\uf8ed W(cid:88)\n\ni=1\n\nj=1\n\n3\n\n\fTable 1: Search result datasets came from a collection of 7M Blogs, a collection of 1M News articles,\nand the web. The \ufb01rst two collections were indexed with Lucene. The queries below were issued to\ncreate \ufb01ve Blog datasets, four News datasets, and four Web datasets. Search result set sizes ranged\nfrom 1000 to 18,590. For Blogs and News, half of each dataset was set aside for Test, and Train was\nsampled from the remaining half. For Web, Train was the top-40 search results.\n\nBlogs\n\nNews\n\nWeb\n\nName\nbeijing\nclimate\nobama\npalin\nvista\n\nbaseball\ndrama\nhealth\nlegal\n\ndepression\nmigraine\namerica\n\nsouth africa\n\nQuery\nbeijing olympic ceremony\nclimate change\nobama debate\npalin interview\nvista problem\nmajor league baseball game team player\ntelevision show series drama\nhealth medicine insurance\nlaw legal crime court\ndepression\nmigraine\namerica\nsouth africa\n\n# Results DTest DTrain\n39\n58\n72\n40\n32\n29\n23\n25\n23\n40\n40\n40\n40\n\n5024\n14,932\n18,590\n10,478\n4214\n3774\n3024\n1655\n2976\n1000\n1000\n1000\n1000\n\n2512\n7466\n9295\n5239\n2107\n1887\n1512\n828\n1488\n1000\n1000\n1000\n1000\n\nWe obtain the MAP solution to \u03c8t by optimizing:\n\nW(cid:88)\n\nLMAP =\n\nSolving for \u03c8w|t we obtain:\n\ni=1\n\nNit log\n\nCij\u03c8j|t +\n\n(\u03b3 \u2212 1) log \u03c8j|t\n\nW(cid:88)\n\u03c8w|t \u221d W(cid:88)\n\nj=1\n\ni=1\n\nW(cid:88)\n\nj=1\n\n(cid:80)W\n\nNitCiw\nj=1 Cij\u03c8j|t\n\nW(cid:88)\n\nj=1\n\ns.t.\n\n\u03c8j|t = 1.\n\n(12)\n\n\u03c8w|t + \u03b3.\n\n(13)\n\nWe follow the same semi-collapsed inference procedure used for QUAD-REG, with the updates in\nEquations (13) and (10) producing the values for \u03c6w|t to be used in the semi-collapsed sampler (9).\n\n4 Search Result Datasets\n\nText datasets came from a collection of 7M Blogs (from ICWSM 2009), a collection of 1M News\narticles (LDC Gigaword), and the Web (using Google\u2019s search). Table 1 shows a summary of the\ndatasets used. These datasets provide a diverse range of content for topic modeling. Blogs are often\nwritten in a chatty and informal style, which tends to produce topics that are dif\ufb01cult to interpret.\nNews articles are edited to a higher standard, so learned topics are often fairly interpretable when\none models, say, thousands of articles. However, our experiments use 23-29 articles, limiting the\ndata for topic learning. Snippets from web search result present perhaps the most sparse data. For\neach dataset we created the standard bag-of-words representation and performed fairly standard\ntokenization. We created a vocabulary of terms that occurred at least \ufb01ve times (or two times, for the\nWeb datasets), after excluding stopwords. We learned the topic models on the Train data set, setting\nT = 15 for Blogs datasets, T = 10 for News datasets, and T = 8 for the Web datasets.\nConstruction of C: The word co-occurrence data for regularization was obtained from the entire\nLDC dataset of 1M articles (for News), a subset of the 7M blog posts (for Blogs), and using all 3M\nEnglish Wikipedia articles (for Web). Word co-occurrence was computed using a sliding window\nof ten words to emphasize short-range dependency. Note that we only kept positive PMI values.\nFor each dataset we created a W \u00d7 W matrix of co-occurrence counts using the 2000-most frequent\nterms in the vocabulary for that dataset, thereby maintaining reasonably good sparsity for these data.\nSelecting most-frequent terms makes sense because our objective is to improve PMI-Score, which\nis de\ufb01ned over the top-10 topic words, which tend to involve relatively high-frequency terms. Using\nhigh-frequency terms also avoids potential numerical problems of large PMI values arising from\nco-occurrence of rare terms.\n\n4\n\n\fFigure 1: PMI-Score and test perplexity of regularized methods vs. LDA on Blogs, T = 15. Both\nregularization methods improve PMI-Score and perplexity for all datasets, with the exception of\n\u2018vista\u2019 where QUAD-REG has slightly higher perplexity.\n\n5 Experiments\n\nIn this section we evaluate our regularized topic models by reporting the average PMI-Score over 10\ndifferent runs, each computed using Equations (1) and (2) (and then in Section 5.4, we use human\njudgements). Additionally, we report the average test data perplexity over 10 samples from the\nposterior across ten independent chains, where each perplexity is calculated using:\n\n(cid:18)\n\n(cid:19)\n\u2212 1\nN test log p(xtest)\n\nPerp(xtest) = exp\n\n\u03b8t|d =\n\n\u03b1 + Ntd\nT \u03b1 + Nd\n\n(cid:88)\n\n(cid:88)\n\nlog p(xtest) =\n\nN test\n\ndw log\n\n\u03c6w|t =\n\nt\n\ndw\n\u03b2 + Nwt\nW \u03b2 + Nt\n\n.\n\n\u03c6w|t\u03b8t|d\n\n(14)\n\n(15)\n\nThe document mixture \u03b8t|d is learned from test data, and the log probability of the test words is com-\nputed using this mixture. Each \u03c6w|t is computed by Equation (15) for the baseline LDA model, and\nit is used directly for the QUAD-REG and CONV-REG methods. For the Gibbs sampling algorithms\nwe set \u03b1 = 0.05 N\nDT and \u03b2 = 0.01 (initially). This setting of \u03b1 allocates 5% of the probability mass\nfor smoothing. We run the sampling for 300 iterations; applied the \ufb01xed point iterations (on the\nregularized models) 10 times every 20 Gibbs iterations and ran 10 different random initializations\n(computing average over these runs). We used T = 10 for the News datasets, T = 15 for the Blogs\ndatasets and T = 8 for the Web datasets. Note that test perplexity is computed on DTest (Table 1) that\nis at least an order of magnitude larger than the training data. After some preliminary experiments,\nT .\nwe \ufb01xed QUAD-REG\u2019s regularization parameter to \u03bd = 0.5 N\n\n5.1 Results\n\nFigures 1 and 2 show the average PMI-Scores and average test perplexities for the Blogs and News\ndatasets. For Blogs (Figure 1) we see that our regularized models consistently improve PMI-Score\nand test perplexity on all datasets with the exception of the \u2018vista\u2019 dataset where QUAD-REG has\nslightly higher perplexity. For News (Figure 2) we see that both regularization methods improve\nPMI-Score and perplexity for all datasets. Hence, we can conclude that our regularized models not\nonly provide a good characterization of the collections but also improve the coherence of the learned\ntopics as measured by the PMI-Score. It is reasonable to expect both PMI-Score and perplexity to\nimprove as semantically related words should be expected in topic models, so with little data, our\nregularizers push both measures in a positive direction.\n\n5.2 Coherence of Learned Topics\n\nTable 2 shows selected topics learned by LDA and our QUAD-REG model. To obtain correspon-\ndence of topics (for this experiment), we initialized the QUAD-REG model with the converged LDA\nmodel. Overall, our regularized model tends to learn topics that are more focused on a particu-\nlar subject, contain fewer spurious words, and therefore are easier to interpret. The following list\nexplains how the regularized version of the topic is more useful:\n\n5\n\nbeijingclimateobamapalinvista1.822.22.4PMI\u2212Score  LDAQuad\u2212RegConv\u2212Regbeijingclimateobamapalinvista02000400060008000Test Perplexity  LDAQuad\u2212RegConv\u2212Reg\fFigure 2: PMI-Score and test perplexity of regularized methods vs. LDA on News, T = 10. Both\nregularization methods improve PMI-Score and perplexity for all datasets.\n\nTable 2: Selected topics improved by regularization. Each pair \ufb01rst shows an LDA topic and the\ncorresponding topic produced by QUAD-REG (initialized from the converged LDA model). QUAD-\nREG\u2019s PMI-Scores were always better than LDA\u2019s on these examples. The regularized versions tend\nto be more focused on a particular subject and easier to interpret.\n\nName Model Topic\nbeijing\n\nobama\n\ndrama\n\nlegal\n\nLDA\nREG\nLDA\nREG\nLDA\nREG\nLDA\nREG\n\ngirl phony world yang \ufb01reworks interest maybe miaoke peiyi young\ngirl yang peiyi miaoke lin voice real lip music sync\npalin biden sarah running mccain media hilton stein paris john\npalin sarah mate running biden vice governor selection alaska choice\nwire david place police robert baltimore corner friends com simon\ndrama episode characters series cop cast character actors detective emmy\nsaddam american iraqi iraq judge against charges minister thursday told\niraqi saddam iraq military crimes tribunal against troops accused of\ufb01cials\n\nbeijing QUAD-REG has better focus on the names and issues involved in the controversy over the\nChinese replacing the young girl doing the actual singing at the Olympic opening ceremony\nwith the girl who lip-synched.\n\nobama QUAD-REG focuses on Sarah Palin\u2019s selection as a GOP Vice Presidential candidate, while\nLDA has a less clear theme including the story of Paris Hilton giving Palin fashion advice.\ndrama QUAD-REG learns a topic related to television police dramas, while LDA narrowly focuses\n\nlegal\n\non David Simon\u2019s The Wire along with other scattered terms: robert and friends.\nLDA topic is somewhat related to Saddam Hussein\u2019s appearance in court, but includes\nuninteresting terms such as: thursday, and told. The QUAD-REG topic is an overall better\ncategory relating to the tribunal and charges against Saddam Hussein.\n\n5.3 Modeling of Google Search Results\n\nAre our regularized topic models useful for building facets in a clustering-web-search-results type\nof application? Figure 3 (top) shows the average PMI-Score (mean +/\u2212 two standard errors over\n10 runs) for the four searches described in Table 1 (Web dataset) and the average perplexity using\ntop-1000 results as test data (bottom). In all cases QUAD-REG and CONV-REG learn better topics,\nas measured by PMI-Score, compared to those learned by LDA. Additionally, whereas QUAD-REG\nexhibits slightly higher values of perplexity compared to LDA, CONV-REG consistently improved\nperplexity on all four search datasets. This level of improvement in PMI-Score through regulariza-\ntion was not seen in News or Blogs likely because of the greater sparsity in these data.\n\n5.4 Human Evaluation of Regularized Topic Models\n\nSo far we have evaluated our regularized topic models by assessing (a) how faithful their represen-\ntation is to the collection of interest, as measured by test perplexity, and (b) how coherent they are,\n\n6\n\nbaseballdramahealthlegal22.533.5PMI\u2212Score  LDAQuad\u2212RegConv\u2212Regbaseballdramahealthlegal0200040006000800010000Test Perplexity  LDAQuad\u2212RegConv\u2212Reg\fFigure 3: PMI-Score and test perplexity of regularized methods vs. LDA on Google search results.\nBoth methods improve PMI-Score and CONV-REG also improves test perplexity, which is computed\nusing top-1000 results as test data (therefore top-1000 test perplexity is not reported).\n\nas given by the PMI-Score. Ultimately, we have hypothesized that humans will \ufb01nd our regularized\ntopic models more semantically coherent than baseline LDA and therefore more useful for tasks\nsuch as document clustering, search and browsing. To test this hypothesis we performed further ex-\nperiments where we asked humans to directly compare our regularized topics with LDA topics and\nchoose which is more coherent. As our experimental results in this section show, our regularized\ntopic models signi\ufb01cantly outperform LDA based on actual human judgements.\nTo evaluate our models with human judgments we used Amazon Mechanical Turk (AMT, https:\n//www.mturk.com) where we asked workers to compare topic pairs (one topic given by one\nof our regularized models and the other topic given by LDA) and to answer explicitly which topic\nwas more coherent according to how clearly they represented a single theme/idea/concept. To keep\nthe cognitive load low (while still having a fair and sound evaluation of the topics) we described\neach topic by its top-10 words. We provided an additional option \u201c...Can\u2019t decide...\u201d indicating\nthat the user could not \ufb01nd a qualitative difference between the topics presented. We also included\ncontrol comparisons to \ufb01lter out bad workers. These control comparisons were done by replacing\na randomly-selected topic word with an intruder word. To have aligned (matched) pairs of topics,\nthe sampling procedure of our regularized topic models was initialized with LDA\u2019s topic assignment\nobtained after convergence of Gibbs sampling. These experiments produced a total of 3650 topic-\ncomparison human evaluations and the results can be seen in Figure 4.\n\n6 Related Work\n\nSeveral authors have investigated the use of domain knowledge from external sources in topic model-\ning. For example, [7, 8] propose a method for combining topic models with ontological knowledge\nto tag web pages. They constrain the topics in an LDA-based model to be amongst those in the given\nontology. [20] also use statistical topic models with a prede\ufb01ned set of topics to address the task of\nquery classi\ufb01cation. Our goal is different to theirs in that we are not interested in constraining the\nlearned topics to those in the external data but rather in improving the topics in small or noisy collec-\ntions by means of regularization. Along a similar vein, [2] incorporate domain knowledge into topic\nmodels by encouraging some word pairs to have similar probability within a topic. Their method,\nas ours, is based on replacing the standard Dirichlet prior over word-topic probabilities. However,\nunlike our approach that is entirely data-driven, it appears that their method relies on interactive\nfeedback from the user or on the careful selection of words within an ontological concept.\nThe effect of structured priors in LDA has been investigated by [17] who showed that learning\nhierarchical Dirichlet priors over the document-topic distribution can provide better performance\nthan using a symmetric prior. Our work is motivated by the fact that priors matter but is focused on a\nrather different use case of topic models, i.e. when we are dealing with small or noisy collections and\nwant to improve the coherence of the topics by re-de\ufb01ning the prior on the word-topic distributions.\nPriors that introduce correlations in topic models have been investigated by [3]. Unlike our work\nthat considers priors on the word-topic distributions (\u03c6w|t), they introduce a correlated prior on the\n\n7\n\ndepressionmigraineamericasouth africa1.822.22.42.62.833.23.4PMI\u2212Score  LDA top\u221240Quad\u2212Reg top\u221240Conv\u2212Reg top\u221240LDA top\u22121000depressionmigraineamericasouth africa02004006008001000Test Perplexity  LDA top\u221240Quad\u2212Reg top\u221240Conv\u2212Reg top\u221240\fFigure 4: The proportion of times workers in Amazon Mechanical Turk selected each topic model as\nshowing better coherence. In nearly all cases our regularized models outperform LDA. CONV-REG\noutperforms LDA in 11 of 13 datasets. QUAD-REG never performs worse than LDA (at the dataset\nlevel). On average (from 3650 topic comparisons) workers selected QUAD-REG as more coherent\n57% of the time while they selected LDA as more coherent only 37% of the time. Similarly, they\nchose CONV-REG\u2019s topics as more coherent 56% of the time, and LDA as more coherent only 39%\nof the time. These results are statistically signi\ufb01cant at 5% level of signi\ufb01cance when performing\na paired t-test on the total values across all datasets. Note that the two bars corresponding to each\ndataset do not add up to 100% as the remaining mass corresponds to \u201c...Can\u2019t decide...\u201d responses.\n\ntopic proportions (\u03b8t|d). In our approach, considering similar priors for \u03c6w|t to those studied by [3]\nwould be unfeasible as they would require the inverse of a W \u00d7 W covariance matrix.\nNetwork structures associated with a collection of documents are used in [12] in order to \u201csmooth\u201d\nthe topic distributions of the PLSA model [11]. Our methods are different in that they do not require\nthe collection under study to have an associated network structure as we aim at addressing the\ndifferent problem of regularizing topic models on small or noisy collections. Additionally, their work\nis focused on regularizing the document-topic distributions instead of the word-topic distributions.\nFinally, the work in [13], contemporary to ours, also addresses the problem of improving the quality\nof topic models. However, our approach focuses on exploiting the knowledge provided by external\ndata given the noisy and/or small nature of the collection of interest.\n\n7 Discussion & Conclusions\n\nIn this paper we have proposed two methods for regularization of LDA topic models based upon\nthe direct inclusion of word dependencies in our word-topic prior distributions. We have shown that\nour regularized models can improve the coherence of learned topics signi\ufb01cantly compared to the\nbaseline LDA method, as measured by the PMI-Score and assessed by human workers in Amazon\nMechanical Turk. While our focus in this paper has been on small, and small and noisy datasets, we\nwould expect our regularization methods also to be effective on large and noisy datasets. Note that\nmixing and rate of convergence may be more of an issue with larger datasets, since our regularizers\nuse a semi-collapsed Gibbs sampler. We will address these large noisy collections in future work.\n\nAcknowledgments\n\nNICTA is funded by the Australian Government as represented by the Department of Broadband,\nCommunications and the Digital Economy and the Australian Research Council through the ICT\nCentre of Excellence program. DN was also supported by an NSF EAGER Award, an IMLS Re-\nsearch Grant, and a Google Research Award.\n\n8\n\nbaseballdramahealthlegalbeijingclimateobamapalinvistadepressionmigraineamericasouthafrica020406080% Time Method is Better  QuadRegLDAbaseballdramahealthlegalbeijingclimateobamapalinvistadepressionmigraineamericasouthafrica020406080% Time Method is Better  ConvRegLDA\fReferences\n[1] L. AlSumait, D. Barbar\u00b4a, J. Gentle, and C. Domeniconi. Topic signi\ufb01cance ranking of LDA generative\n\nmodels. In ECML/PKDD, 2009.\n\n[2] D. Andrzejewski, X. Zhu, and M. Craven.\n\nDirichlet forest priors. In ICML, 2009.\n\nIncorporating domain knowledge into topic modeling via\n\n[3] David M. Blei and John D. Lafferty. Correlated topic models. In NIPS, 2005.\n[4] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[5] Claudio Carpineto, Stanislaw Osinski, Giovanni Romano, and Dawid Weiss. A survey of web clustering\n\nengines. ACM Comput. Surv., 41(3), 2009.\n\n[6] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret\n\ntopic models. In NIPS, 2009.\n\n[7] Chaitanya Chemudugunta, America Holloway, Padhraic Smyth, and Mark Steyvers. Modeling documents\n\nby combining semantic concepts with unsupervised statistical learning. In ISWC, 2008.\n\n[8] Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. Combining concept hierarchies and\n\nstatistical topic models. In CIKM, 2008.\n\n[9] T. Grif\ufb01ths and M. Steyvers. Probabilistic topic models. In Latent Semantic Analysis: A Road to Meaning,\n\n2006.\n\n[10] Shengbo Guo and Scott Sanner. Probabilistic latent maximal marginal relevance. In SIGIR, 2010.\n[11] Thomas Hofmann. Probabilistic latent semantic indexing. In SIGIR, 1999.\n[12] Qiaozhu Mei, Deng Cai, Duo Zhang, and ChengXiang Zhai. Topic modeling with network regularization.\n\nIn WWW, 2008.\n\n[13] David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Optimizing\n\nsemantic coherence in topic models. In EMNLP, 2011.\n\n[14] D.M. Mimno and A. McCallum. Organizing the OCA: learning faceted subjects from a library of digital\n\nbooks. In JCDL, 2007.\n\n[15] D. Newman, J.H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL\n\nHLT, 2010.\n\n[16] D. Newman, Y. Noh, E. Talley, S. Karimi, and T. Baldwin. Evaluating topic models for digital libraries.\n\nIn JCDL, 2010.\n\n[17] H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, 2009.\n[18] Xing Wei and W. Bruce Croft. LDA-based document models for ad-hoc retrieval. In SIGIR, 2006.\n[19] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, and Jinwen Ma. Learning to cluster web search\n\nresults. In SIGIR, 2004.\n\n[20] Haijun Zhai, Jiafeng Guo, Qiong Wu, Xueqi Cheng, Huawei Sheng, and Jin Zhang. Query classi\ufb01cation\nbased on regularized correlated topic model. In Proceedings of the International Joint Conference on Web\nIntelligence and Intelligent Agent Technology, 2009.\n\n9\n\n\f", "award": [], "sourceid": 366, "authors": [{"given_name": "David", "family_name": "Newman", "institution": null}, {"given_name": "Edwin", "family_name": "Bonilla", "institution": null}, {"given_name": "Wray", "family_name": "Buntine", "institution": null}]}