{"title": "A Reduction for Efficient LDA Topic Reconstruction", "book": "Advances in Neural Information Processing Systems", "page_first": 7869, "page_last": 7879, "abstract": "We present a novel approach for LDA (Latent Dirichlet Allocation) topic reconstruction. The main technical idea is to show that the distribution over the documents generated by LDA can be transformed into a distribution for a much simpler generative model in which documents are generated from {\\em the same set of topics} but have a much simpler structure: documents are single topic and topics are chosen uniformly at random. Furthermore, this reduction is approximation preserving, in the sense that approximate distributions-- the only ones we can hope to compute in practice-- are mapped into approximate distribution in the simplified world. This opens up the possibility of efficiently reconstructing LDA topics in a roundabout way. Compute an approximate document distribution from the given corpus, transform it into an approximate distribution for the single-topic world, and run a reconstruction algorithm in the uniform, single topic world-- a much simpler task than direct LDA reconstruction. Indeed, we show the viability of the approach by giving very simple algorithms for a generalization of two notable cases that have been studied in the literature, $p$-separability and Gibbs sampling for matrix-like topics.", "full_text": "A Reduction for Ef\ufb01cient LDA Topic Reconstruction\n\nMatteo Almanza\u2217\nSapienza University\n\nRome, Italy\n\nalmanza@di.uniroma1.it\n\nFlavio Chierichetti\u2020\nSapienza University\n\nRome, Italy\n\nAlessandro Panconesi\u2021\nSapienza University\n\nRome, Italy\n\nflavio@di.uniroma1.it\n\nale@di.uniroma1.it\n\nAndrea Vattani\n\nSpiketrap\n\nSan Francisco, CA, USA\navattani@cs.ucsd.edu\n\nAbstract\n\nWe present a novel approach for LDA (Latent Dirichlet Allocation) topic reconstruc-\ntion. The main technical idea is to show that the distribution over the documents\ngenerated by LDA can be transformed into a distribution for a much simpler genera-\ntive model in which documents are generated from the same set of topics but have a\nmuch simpler structure: documents are single topic and topics are chosen uniformly\nat random. Furthermore, this reduction is approximation preserving, in the sense\nthat approximate distributions \u2014 the only ones we can hope to compute in practice\n\u2014 are mapped into approximate distribution in the simpli\ufb01ed world. This opens\nup the possibility of ef\ufb01ciently reconstructing LDA topics in a roundabout way.\nCompute an approximate document distribution from the given corpus, transform it\ninto an approximate distribution for the single-topic world, and run a reconstruction\nalgorithm in the uniform, single-topic world \u2014 a much simpler task than direct\nLDA reconstruction. We show the viability of the approach by giving very simple\nalgorithms for a generalization of two notable cases that have been studied in the\nliterature, p-separability and matrix-like topics.\n\n1\n\nIntroduction\n\nLatent Dirichlet Allocation (henceforth LDA) is a well-known paradigm for topic reconstruction (Blei\net al. , 2003). The general goal of topic reconstruction is, given a corpus of documents, to reconstruct\nthe topics. LDA is a generative model according to which documents are generated from a given set\nof unknown topics, where each topic is modelled as a probability distribution over the words. One of\nthe main motivations behind LDA is to allow documents to be able to talk about about multiple topics,\na goal achieved by the following mechanism. To generate a document containing (cid:96) words we \ufb01rst\nselect a probability distribution, the so-called admixture, over the topics. The admixture is randomly\ndrawn from a Dirichlet distribution, hence the name. Then, the words of the document are selected\none after the other in sequence by \ufb01rst selecting a topic at random according to the admixture, and\nthen by randomly selecting a word according to the selected topic (which, as remarked, is just a\n\u2217Supported in part by the ERC Starting Grant DMAP 680153, and by the \u201cDipartimenti di Eccellenza\n\u2020Supported in part by the ERC Starting Grant DMAP 680153, by a Google Focused Research Award, and by\n\u2021Supported in part by the ERC Starting Grant DMAP 680153, by a Google Focused Research Award, by the\n\u201cDipartimenti di Eccellenza 2018-2022\u201d grant awarded to the Dipartimento di Informatica at Sapienza, and by\nBiCi \u2013 Bertinoro international Center for informatics.\n\nthe \u201cDipartimenti di Eccellenza 2018-2022\u201d grant awarded to the Dipartimento di Informatica at Sapienza.\n\n2018-2022\u201d grant awarded to the Dipartimento di Informatica at Sapienza.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fprobability distribution over the words). In this way all topics contribute to generate a document\u2013 to\na degree speci\ufb01ed for each document by a random admixture. To generate another document, another\nadmixture is selected at random, and the same process is repeated. And so on, so forth.\nIn this paper we are interested in the problem of LDA topic identi\ufb01ability which, roughly speaking,\ncan be stated as follows: given a corpus of documents generated by the mechanism just described,\nreconstruct as ef\ufb01ciently and accurately as possible the K unknown topics (in the paper K will\nalways denote the number of topics). LDA is actually more general than a mechanism for generating\ncorpora of text documents, but it helps the intuition to consider it as a generative framework for text\ndocuments and we will stick to this scenario.\nThis paradigm has attracted a lot of interest, e.g., (Hong & Davison, 2010; Weng et al. , 2010; Zhao\net al. , 2011; Yan et al. , 2013; Sridhar, 2015; Alvarez-Melis & Saveski, 2016; Li et al. , 2016;\nHajjem & Latiri, 2017). Several algorithms for LDA topic reconstruction have been proposed (see,\nfor instance, (Arora et al. , 2012, 2013; Anandkumar et al. , 2013; Bansal et al. , 2014)). In this\npaper we continue this line of research by presenting a novel approach, the main thrust of which\nis, loosely speaking, that of reducing the problem of topic identi\ufb01ability in the LDA framework to\nthe problem of topic identi\ufb01ability under a much more constrained and simpler generative model.\nThe simpli\ufb01ed generative mechanism we have in mind is the following. The admixture, instead of\nbeing randomly selected anew for each document from a Dirichlet distribution, will stay put: when\ngenerating a document, a topic is selected uniformly at random with probability 1/K. The second\nfeature of the simpli\ufb01ed framework is that documents are single topic, i.e. once a topic is selected, all\nthe words in the document are chosen according to the distribution speci\ufb01ed by that topic. We shall\nrefer to this mechanism as single topic allocation, denoted as STA. In a nutshell, the contribution of\nthis paper is to show that if we have an ef\ufb01cient and accurate algorithm for STA topic identi\ufb01ability\n\u2014 a task seemingly much less daunting than its LDA counterpart \u2014 we can use it for ef\ufb01cient and\naccurate reconstruction of topics under the LDA paradigm. More precisely, we can do this in the\ncase of uniform LDA, i.e. when the admixtures come from a symmetric Dirichlet distribution with a\ngiven parameter \u03b1, which is a very important and commonly adopted special case (Blei et al. , 2003).\nHistorically, STA-type models have been considered before the advent of LDA (see, e.g., (Nigam\net al. , 2000)), whose main motivation, as mentioned, was precisely that of allowing documents\nto be mixtures of topics. In a way, our result vindicates STA in the sense that it shows that LDA\nreconstruction is not more general than STA reconstruction.\nThe main technical tool to achieve this is a reduction between the two paradigms, STA and uniform\nLDA. Given a set T of K topics and a Dirichlet parameter \u03b1, let D = D(cid:96) be the distribution that they\ninduce via LDA over the documents of a given length (cid:96). Similarly, let S = S(cid:96) denote the distribution\ninduced by STA over the documents of the same length (cid:96) when the same set of topics T is used. In a\ncompanion paper (Chierichetti et al. , 2018), we show that there is a reduction such that S can be\ncomputed from D and \u03b1, and viceversa. In that paper this fact is used to derive impossibility results\nabout LDA topic reconstruction whose gist is the following: unless the length of the documents is\ngreater than or equal to the number of topics, identifying them is impossible. Here, we show how to\nexploit this reduction in the opposite direction: if we have an ef\ufb01cient algorithm for identifying the\ntopics under STA then, thanks to the reduction, we can also use it to identify them under LDA.\nNote that the above reduction deals with the exact probability distributions D(cid:96) and S(cid:96) over the\ndocuments, something which is helpful when impossibility results are concerned, but that becomes\nan issue if we are seeking reconstruction algorithms that have to be deployed in practice, and which\nhave a limited number of documents to analyze. A \ufb01rst contribution of this paper is to show a robust\n\nversion of the above reduction. Fix a set of topics T , and suppose to have an approximation (cid:101)D(cid:96) of the\ntrue distribution D(cid:96) induced by LDA when T is the set of topics. In practice, (cid:101)D(cid:96) can be obtained\nof the reduction, on input (cid:101)D(cid:96) and \u03b1, produces a distribution (cid:101)S(cid:96) which is a good approximation of S(cid:96),\nLDA from a set of hidden topics T that we wish to reconstruct, compute (cid:101)D, an approximation of the\ntrue document distribution D. Apply the robust version of the reduction to (cid:101)D and \u03b1 (the Dirichlet\n\nfrom a large enough corpus of documents in a rather straightforward manner. Suppose also, as it is\ncustomarily assumed in practice, to know the value of the Dirichlet parameter \u03b1. The robust version\nthe true distribution induced by STA when T is the set of topics.\nThis result suggests an intriguing possibility, namely that LDA topics could be identi\ufb01ed in a rather\nroundabout way by means of the following pipeline. Starting from a document corpus generated by\n\n2\n\n\fof the true distribution S induced by STA from the same set of topics T . Suppose now to have an\n\nparameter which, as remarked, is assumed to be known in practice) to obtain (cid:101)S, an approximation\nef\ufb01cient algorithm that, given (cid:101)S, outputs T (cid:48), a good approximation of the set T of the unknown\nAn algorithm capable of producing such a good approximation T (cid:48) from (cid:101)S is called robust in this\n\ntopics we are looking for. With such an algorithm we can solve LDA identi\ufb01ability via single-topic\ndistributions!\n\npaper. As hinted at by the above discussion, the second contribution of this paper is to show that the\npipeline just described can be made to work. We provide a robust algorithm with provable guarantees\nwith which we can solve in one stroke a natural generalization of two notable cases that have been\nstudied in the literature. The \ufb01rst concerns so-called separable topics (Arora et al. , 2012, 2013). A\nset of topics is p-separable if, for each topic T there is a word w such that T assigns probability at\nleast p to w and every other topic assigns it probability zero. These special words are called anchor\nwords. Thus, separability occurs when each topic is essentially identi\ufb01ed uniquely by its anchor\nword. This set up has received considerable attention and several algorithms for LDA reconstruction\nhave been proposed. One of the virtues of the p-separability assumption is that it makes it possible to\nderive algorithms with provable guarantees. For instance, the main result of Arora et al. (2012) states\nthat there is an algorithm such that if a set of LDA topics are p-separable then they are identi\ufb01able\nwithin additive error \u03b4 in the (cid:96)\u221e-norm, provided that the corpus contains\n\n(cid:18) K 6\n\n\u0398\n\n\u03b42p6\u03b32(cid:96)\n\n(cid:19)\n\n\u00b7 log m\n\n(1)\n\nmany documents, or more. In the expression, m is the size of the vocabulary, (cid:96) is the length of the\ndocuments and \u03b3 is the condition number of the topic-topic covariance matrix. As remarked by the\nsame authors however, this algorithm is computationally impractical. A follow-up paper shows how\nto mitigate the problem by implementing the main steps in a different way (Arora et al. , 2013). The\nresulting algorithm is much more ef\ufb01cient but, unfortunately, heuristic in nature, thus losing one of\nthe nice features of its computationally more expensive predecessor.\nThe second scenario we tackle is that of Grif\ufb01ths & Steyvers (2004) in which Gibbs sampling is\nproposed as a heuristic without any performance guarantees for LDA topic reconstruction. In that\npaper, Gibbs sampling is applied to a dataset whose underlying set of topics is assumed to have the\nfollowing structure. The vocabulary consists of a n \u00d7 n matrix \u2014 each entry is a word (the authors of\nGrif\ufb01ths & Steyvers (2004) consider 5 \u00d7 5 matrices, i.e. 25 words in total). There are 2n topics, each\ncorresponding to a row or a column of the matrix. The topic corresponding to a given row has all\nzero entries except for that row, whose entries are uniformly 1/n. Topics corresponding to columns\nare de\ufb01ned analogously. Note that this set of topics is not p-separable, since every word has positive\nprobability in at least two topics (its row, and its column).\nBoth scenarios can be captured at once with the following natural de\ufb01nition. A set T of topics is\n(p, t)-separable if, for every topic T \u2208 T , there is a set of words ST of t words such that (i) the\nproduct of the probabilities assigned by T to the words of ST is at least p, and moreover (ii) for every\nother topic T (cid:48) \u2208 T \u2212 {T} there exists a word w \u2208 ST such that T (cid:48) assigns probability zero to w. It\ncan be checked that p-separability is (p, 1)-separability and that the matrix scenario is (p, 2)-separable\n(with p = n\u22122 for n \u00d7 n matrices, n \u2265 2). In practice, (p, t)-separability captures the notion that\nevery topic is uniquely identi\ufb01ed by a set of t words. We shall refer to these sets as anchor sets. With\nthis terminology, p-separability is just (p, 1)-separability with singleton anchor sets.\nIn this paper we give an algorithm for LDA topic reconstruction (under (p, 1)-separability) that,\nstarting from a random LDA corpus over a vocabulary of m words consisting of\n\n(cid:32)\n\nK 2 \u00b7 max(cid:0)1, K 2\u03b12(cid:1)\n\n\u0398\n\n\u03b42 \u00b7 p2\n\n(cid:33)\n\n\u00b7 log m\n\nmany documents of (at least) 2 words each, computes a set of topics T (cid:48) which is an approximation of\nlet parameter \u03b1 is typically assumed to be O(1/K), in which case the term max(cid:0)1, K 2\u03b12(cid:1) resolves\nthe true set of topics T with error \u03b4 (in (cid:96)\u221e-norm)4. Asymptotically, this compares favourably to the\n(cid:16) K2\n(cid:17)\nbound of Equation (1) but it is also the case that the algorithm is very simple and ef\ufb01cient. The Dirich-\n\u03b42\u00b7p2 \u00b7 log m\n\nto a constant, and the number of documents required for reconstruction becomes \u0398\n\n.\n\n4More precisely, there exists a bijection \u03c6 : T \u2192 T (cid:48) such that, for each T \u2208 T , |T \u2212 \u03c6(T )|\u221e \u2264 \u03b4\n\n3\n\n\fNote that the Dirichlet distribution is such that, as \u03b1 goes to zero, the admixture becomes more and\nmore polarized, in the sense that the documents resemble more and more single-topic documents,\nwhich intuitively facilitates topic reconstruction. When \u03b1 moves in the other direction toward larger\nand larger values, the admixture creates documents in which all topics are equally represented, which\nmakes reconstruction more expensive in the sense that the size of corpora must become bigger and\nbigger. These considerations apply to all algorithms, but we note that our dependence on K and p is\nmuch milder than those of the other algorithms we are discussing.\nIt is interesting to compare the overall structure of our algorithm to that in Arora et al. (2012). The\n\ufb01rst step of the latter is to project points into a low-dimensional space, where computation is cheaper,\nby preserving distances. In a very loose sense, this is equivalent to our reduction, which transforms\nthe distribution of documents of length 2 from LDA to STA. The second step is a very natural one:\ntry to identify the anchor words, using a simple combinatorial procedure (or, more generally, the\nt-anchor sets, starting from documents of length t + 1). The third step is again very natural: use the\nanchors to build the topics. It is here that the full advantage of our approach becomes evident. Our\nalgorithm attempts the reconstruction in the single topic world \u2014 a much less daunting prospect than\nreconstruction in the full-\ufb02edged LDA world. As a result, our third step is a very simple procedure \u2014\nin the LDA world one would have had to pay the price of heavy-duty linear algebra computations.\nIn order to deal with (p, t)-separable topics the algorithm only needs documents of length t + 1.\nTherefore, in order to reconstruct p-separable topics (t = 1) it only needs bigrams, and in the matrix\ncase (t = 2) only trigrams! Clearly, this has a signi\ufb01cant positive impact on ef\ufb01ciency.\nWe also present a comparative experimental evaluations which shows that our approach compares\nfavorably to those of (Arora et al. , 2012, 2013; Grif\ufb01ths & Steyvers, 2004; Anandkumar et al. ,\n2014).\nThe paper is organized as follows. We start in \u00a7 2 with some quick preliminaries. In \u00a7 3 we give the\nreduction from LDA to STA, followed by \u00a7 4 in which a robust algorithm for STA topic reconstruction\nis presented with which we solve the (p, t)-separable case for t = 1, 2, which subsumes both p-\nseparability and matrix-like topics. \u00a7 5 presents our experiments. The proofs missing from the main\nbody of the paper can be found in the Supplementary Material archive.\n\n2 Preliminaries\nThroughout the paper, we will use V to denote the underlying vocabulary and assume without loss of\ngenerality that m := |V| \u2265 2, since the case m = 1 is trivial (there can be only one topic).\nWe will only deal with LDA when the admixtures come from a symmetric Dirichlet distribution\nwhose parameter will be denoted by \u03b1. Since this is the only case we consider and there is no danger\nof confusion, we will sometimes omit to specify that we are dealing with symmetric LDA.\nWe will use the following notation. Given a set of K topics T and a Dirichlet parameter \u03b1, DT\n(cid:96) will\ndenote the distribution induced by LDA over the topics of length (cid:96). When there is no danger for\nconfusion subscripts and superscripts will be dropped. Similarly, ST\n(cid:96) will refer to the distribution\ninduced by STA over the topics of length (cid:96). And, likewise, subscripts and superscripts will be dropped\nwhen no danger for confusion may arise.\n\n3 A Reduction from LDA to STA\n\nIn this section we give the approximation preserving reduction from LDA to STA. As usual, in the\nbackground we have a set of unknown topics T that we wish to reconstruct. The reduction takes as\n\ninput the Dirichlet parameter \u03b1, an approximation (cid:101)D of the document distribution D generated by\nLDA with topics T , and gives as output an approximation (cid:101)S of the document distribution S generated\n\nby STA with the same set of topics T . The point of departure is a reduction between the two true\ndistributions D and S established by (Chierichetti et al. , 2018, Section 4).\nDe\ufb01nition 1. Given a permutation \u03c0 \u2208 Sym([(cid:96)]), let C\u03c0 be the partition of [(cid:96)] into the cycles of \u03c0:\n\nC\u03c0 = {S | S \u2286 [(cid:96)] and the elements of S form a cycle in \u03c0} .\n\n4\n\n\fFurthermore, for d \u2208 V (cid:96) and S = {i1, i2, . . . , i|S|} \u2286 [(cid:96)] with i1 < i2 < . . . < i|S|, let d|S be the\ndocument containing the words d(i1), . . . , d(i|S|) in this order (that is, let it be the document that is\nobtained by removing from d the words whose positions in d are not in S).\nFor example, if \u03c0 = (163)(25)(4) then C\u03c0 = {{1, 3, 6},{2, 5},{4}}. And, if d = w1w2w3w4w5w6\nand S = {1, 3, 6} then d|S = w1w3w6.\nTheorem 2 (Reduction from LDA to STA (Chierichetti et al. , 2018)). Let T be any set of K topics\non a vocabulary V and consider any d \u2208 V (cid:96). Then, for any \u03b1 > 0,\nST\n(cid:96) (d) =\n\n\u00b7 (cid:88)\n\n\u0393(K \u00b7 \u03b1 + (cid:96))\n\nK \u00b7 \u03b1 \u00b7 ST\n\n(cid:89)\n\n\u00b7DT ,\u03b1\n\n(d)\u2212\n\n|S|(d|S)\n\n. (2)\n\n(cid:16)\n\n(cid:17)\n\n1\n\n(cid:96)\n\n\u0393(K \u00b7 \u03b1 + 1) \u00b7 \u0393((cid:96))\n\nK \u00b7 \u03b1 \u00b7 \u0393((cid:96))\n\n\u03c0\u2208Sym([(cid:96)])\n\n|C\u03c0|\u22652\n\nS\u2208C\u03c0\n\nEquation (2) looks rather formidable, but the point is that it can be taken as a blackbox to transform\none probability distribution into the other. Note that the equation is recursive \u2014 it speci\ufb01es how\nto compute the STA distribution S(cid:96) over documents of length (cid:96), from D(cid:96) and the STA distributions\nS1, . . . ,S(cid:96)\u22121 over documents of length less than (cid:96). In the base case \u2014 documents of length one \u2014\nthe two distributions D1 and S1 coincide and thus the induction can be kick-started.\n\nThe next lemma tells us how to compute a good approximation (cid:101)D of the true document distribution\n\nD induced by LDA starting from a corpus.\nLemma 3 (LDA Probabilities Approximation). Fix (cid:96) \u2265 1, and \u03be \u2208 (0, 1). Let X1, . . . , Xn be n iid\nsamples from DT ,\u03b1\n. For i \u2208 [(cid:96)], and for a document d \u2208 [m]i, let nd be the number of samples having\nd as a pre\ufb01x, nd = |{j|j \u2208 [n] \u2227 d is a pre\ufb01x of Xj}|. For i \u2208 [(cid:96)], and for a document d \u2208 [m]i, let\n\n(cid:96)\n\nn be the empirical fraction of the samples whose i-pre\ufb01x is equal to d. Then,\n\n, with probability at least 1\u2212 O(m\u2212(cid:96)), for every document d of length\n\n(cid:101)Di(d) = nd\n(a) If n \u2265(cid:108) 2\n(b) For any q > 0, if n \u2265(cid:108) 9\n\n(cid:109)\n\u03be2 \u00b7 (cid:96) \u00b7 ln m\n\ni \u2264 (cid:96), it holds that |DT ,\u03b1\n\ni\n\n(d) \u2212 (cid:101)Di(d)| \u2264 \u03be.\n(cid:109)\nq\u00b7\u03be2 \u00b7 (cid:96) \u00b7 ln m\n\ndocument d of length i \u2264 (cid:96) such that DT ,\u03b1\n\n, with probability at least 1 \u2212 O(m\u2212(cid:96)), for every\n(d).\n\n(d) \u2265 q, it holds that (cid:101)Di(d) = (1 \u00b1 \u03be)DT ,\u03b1\n\ni\n\ni\n\ni\n\n(cid:101)Di(d) of DT ,\u03b1\n(cid:101)S1(w(cid:48)). Then,\n\nThe next theorem establishes our main result of this section, namely that Equation (2) is approximation\npreserving.\nTheorem 4 (Single-Topic Probabilities Approximation). Fix \u03be \u2208 (0, 1). Given an approximation\n\n(d), i \u2208 {1, 2}, de\ufb01ne (cid:101)S1 = (cid:101)D1, and (cid:101)S2(ww(cid:48)) = (K\u03b1 + 1)\u00b7(cid:101)D2(ww(cid:48))\u2212 K\u03b1\u00b7(cid:101)S1(w)\u00b7\ni (d) \u2212 (cid:101)Si(d)| \u2264 \u03be.\n(cid:17)DT ,\u03b1\n\n(b) If, for a given word w, it holds (cid:101)D1(w) =\n(cid:16)\n\n(d) \u2212 (cid:101)Di(d)| \u2264\n(cid:17)DT ,\u03b1\n\n(ww), then (cid:101)S2(ww) = (1 \u00b1 \u03be)ST\n\n(a) If for every document d of length i \u2264 2 it holds |DT ,\u03b1\n\n(w) and (cid:101)D2(ww) =\n\n4(K\u03b1+1) , then\n\n4K\u03b1+1\n2 (ww).\n\n1 \u00b1 \u03be\n\n1 \u00b1 \u03be\n\n|ST\n\n(cid:16)\n\n1\n\n\u03be\n\ni\n\n2\n\n4K\u03b1+1\n\n4 Robust Algorithms for STA Topic Identi\ufb01ability\n\nIn this section we give an algorithm for identifying p-separable topics (or, equivalently, (p, 1)-\nseparable topics). As usual, we have a set T of topics in the background that we wish to identify.\nThe \ufb01rst step is to identify anchor words or their proxies. By proxy, or quasi-anchor word, we mean\nthat the word has \u201clarge\u201d probability in one topic and very small probabilities in the remaining ones.\nWe begin with a technical lemma stating that if a vector has a coordinate that is very large with\nrespect to the others, then all of its (cid:96)p-norms are close to one another. Loosely speaking, the lemma\nsays that if a word is an anchor word or a quasi-anchor word then, if we look at the vector consisting\nof the probabilities assigned to this word by the topics, the (cid:96)p-norms of the vector are close.\n\n5\n\n\f2 (ww)/K ST\n\n1 \u2264 |v|p\n\np \u2264 (1 \u2212 \u0001)p\u22121 \u00b7 |v|p\n1.\n\nLemma 5. Let v \u2208 Rn, and suppose that |v|\u221e = (1 \u2212 \u0001) \u00b7 |v|1, for some \u0001 \u2208 [0, 1). Then, for each\np \u2265 1, (1 \u2212 \u0001)p \u00b7 |v|p\nThe next theorem tells us how to spot anchor words. The idea is that if a word w is an anchor\nword then there is a signal telling us so. Consider the two documents w and ww. The signal is\nthe ratio ST\ntwo good approximations (cid:101)S1(w) and (cid:101)S2(ww) of, respectively, ST\n1 (w)2. If w is an anchor word this ratio equals 1, and if w is \u201cfar\u201d from being\nan anchor word then the ratio is bounded below 1. In fact, the theorem tells us more. If we have\n\u03c1w = (cid:101)S2(ww)/K (cid:101)S1(w)2 will have (approximately) the same properties. Since we are dealing with an\n2 (ww) then the ratio\napproximation of the true distribution S, this tells us that we will be able to spot anchors even in this\ncase.\nNow, \ufb01x a word w of the dictionary let xw be the (unknown) vector of its probabilities in the K\ntopics, so that ST\n\nTheorem 6. Let \u03be \u2208 (0, 1) and w \u2208 V be any word. Suppose that (cid:101)S1(w) = (1 \u00b1 \u03be)ST\n(cid:101)S2(ww) = (1 \u00b1 \u03be)ST\n\n2 (ww) = K\u22121 \u00b7 |xw|2\n2.\n2 (ww). De\ufb01ne \u03c1w = (cid:101)S2(ww)\nK ((cid:101)S1(w))2 .\n\n1 (w) = K\u22121 \u00b7 |xw|1 and ST\n\n1 (w) and ST\n\n1 (w) and\n\n(1\u2212\u03be)2\n\n.\n\n(1+\u03be)2\n\n2 (ww).\n\n\u2264 \u03c1w \u2264 (1\u2212\u0001w)(1+\u03be)\n\n(1+\u03be)2 . Moreover, if \u03c1w \u2265 1\u2212\u03be\n\n1 (w) and (cid:101)S2(ww) = (1 \u00b1 \u03be)ST\n\nThen, if \u0001w is such that |xw|\u221e = (1 \u2212 \u0001w) \u00b7 |xw|1, it holds (1\u2212\u0001w)2(1\u2212\u03be)\nConsider the quantity \u03c1w de\ufb01ned by the previous theorem and suppose that \u03c1w \u2265 1\u2212\u03be/(1+\u03be)2. The\nnext lemma says that if w is an anchor word, then \u03c1w satis\ufb01es the inequality. And, viceversa, if \u03c1w\nsatis\ufb01es it, then w must be either an anchor word or a quasi-anchor word, which can also be used for\ntopic reconstruction.\n\nK ((cid:101)S1(w))2 , and \u0001w be such that |xw|\u221e = (1 \u2212 \u0001w) \u00b7 |xw|1.\n(1+\u03be)2 then \u0001w \u2264 6\u03be.\n\nLemma 7. Let \u03be \u2208 (0, 1). Suppose that (cid:101)S1(w) = (1 \u00b1 \u03be)ST\nLet \u03c1w = (cid:101)S2(ww)\nIf \u0001w = 0 then \u03c1w \u2265 1\u2212\u03be\nThe previous lemma gives us a simple test to identify anchor words or quasi-anchor words. We\nknow that each anchor word is uniquely associated with one topic \u2014 the one that assigns to it non\nzero probability. We will see later that \u03be can be chosen in a way that quasi-anchor words too can be\nassociated with one topic \u2014 the one assigning it a much larger probability than the other topics. The\nnext lemma tells us how to determine whether two different words insist on the same topic.\nWe say that a topic j is dominant for a word w, if (i) w has a unique largest probability in the topics,\nand (ii) its largest probability is in topic j. We say that the words w, w(cid:48) are codominated, if there\nexists a topic j such that j is dominant for both w and w(cid:48).\n1 (w), and that |xw|\u221e = (1 \u2212\n\nTheorem 8. For w \u2208 {w1, w2}, suppose that (cid:101)D1(w) = (1 \u00b1 \u03be)DT\n\u0001w) \u00b7 |xw|1. Suppose further that (cid:101)D2(w1w2) = (1 \u00b1 \u03be)DT\nDe\ufb01ne \u03c4 (w1, w2) := (cid:101)D2(w1w2)\n(cid:101)D1(w1)\u00b7(cid:101)D1(w2)\n\u03c4 (w1, w2) \u2265 (1 \u2212 \u03be)\n\n. If the words w1 and w2 are co-dominated, then\n\n(1 + \u03be)2 \u00b7 K\u03b1 + K(1 \u2212 \u00011)(1 \u2212 \u00012)\n\n2 (w1w2).\n\nK\u03b1 + 1\n\n,\n\notherwise\n\n\u03c4 (w1, w2) \u2264 (1 + \u03be)\n(1 \u2212 \u03be)2\n\nK\u03b1 + K(\u0001w1 + \u0001w2 + \u0001w1 \u0001w2 )\n\nK\u03b1 + 1\n\n.\n\n1\u22124\u0001\n\n\u03b1+1 , where \u0001 = maxw\u2208A \u0001w. Suppose that (cid:101)D1(w) = (1 \u00b1 \u03be)DT\n\nThe next corollary gives a simple way to determine which quasi-anchor words belong to the same\ntopic.\nCorollary 9. Let A, |A| > K, be a set of quasi-anchor words w with |xw|\u221e = (1 \u2212 \u0001w) \u00b7 |xw|1. Let\n1 (w) for w \u2208 A, and let E be\n\u03be < 1\n6\n\nthe maximal subset of(cid:0)A\n(cid:1) such that (cid:101)D2(w1w2) = (1 \u00b1 \u03be)DT\nIf E contains all the co-dominated pairs of words, the correct partitioning of A according to the K\ntopics T can be obtained by iteratively assigning to the same group the pair of words {w1, w2} \u2208 E\nwith largest \u03c4 (w1, w2) := (cid:101)D2(w1w2)\n(cid:101)D1(w1)\u00b7(cid:101)D1(w2)\n\n2 (w1w2) for each {w1, w2} \u2208 E.\n\nuntil reaching K groups.\n\n2\n\n6\n\n\fWe then have the main theorem of this section which gives the full algorithm for topic reconstruction\nin the p-separable (equivalent to the (p, 1)-separable) case.\nTheorem 10 (Main Result). Suppose that T is a set of K = |T | topics, and let \u03b4 \u2264 1/48. There\nexists an algorithm that, under the p-separability assumption, and under the LDA model DT ,\u03b1, with\nprobability 1 \u2212 o(1) reconstructs each topic in T to within an (cid:96)\u221e additive error upper bounded by \u03b4,\n. The algorithm runs in O(n).\nby accessing n = \u0398\n\n(cid:18) K2\u00b7max((K\u03b1)2,1)\n\n\u00b7 ln m\n\n(cid:19)\n\n\u03b42p2\n\niid samples from DT ,\u03b1\n\n2\n\nAlgorithm 1 is a version of the method analyzed in Theorem 10. The most notable feature of our\nalgorithm is its simplicity.\n\nAlgorithm 1 The Algorithm for reconstructing (p, 1)-separable topics.\nRequire: K, p > 0, \u03b4, corpus C of documents, \u03b1 parameter of the symmetric LDA mixture,\n1: Let W be the set of words w whose empirical fraction in C is at least p/2K.\napproximations (cid:101)D1 and (cid:101)D2 of D1 and D2 .\n2: For each w, w(cid:48) \u2208 W , estimate the empirical fraction of the document ww(cid:48) in C \u2014 that is, obtain\n3: Apply the reduction of Theorem 2 to estimate the uniform single-topic probabilities (cid:101)S1(w) and (cid:101)S2(ww(cid:48)).\n4: For each w \u2208 W , compute \u03c1w := (cid:101)S2(ww)\nK ((cid:101)S1(w))2 and add w to the set A of quasi-anchors if \u03c1w \u2265 1\u2212\u03b4\n(1+\u03b4)2 .\n6: For each wi, return a topic whose probability on word w \u2208 V is (cid:101)S2(wiw)/(cid:101)S1(wi).\n\n5: Use Corollary 9 on A to obtain K pairwise non-codominated quasi-anchor words w1, w2, . . . , wK.\n\n4.1 The general (p, t)-separable case\n\nThe algorithm we have developed in the previous section can be generalized to work for (p, t)-\nseparable topics (this is what we need to deal with the topic structure of Grif\ufb01ths & Steyvers (2004)).\nThe generalization is quite straightforward and is a natural extension of Algorithm 1 but, for lack of\nspace, we defer it to the full paper. We will however compare our generalized algorithm to Gibbs\nsampling \u2014 the method used by Grif\ufb01ths & Steyvers (2004) \u2014 in the next section.\n\n5 Experimental Results\n\nWe compare our approach5 to three state-of-the-art algorithms: GIBBS sampling6, a popular heuristic\napproach, the algorithm from (Arora et al. , 2013) for p-separable instances, referred to as RECOVER\nfrom now on, and the implementation of Yau (2018) of the tensor-based algorithm (henceforth\nTENSOR) introduced in (Anandkumar et al. , 2014). Each of these algorithms was executed on the\nsame computer: an Intel Xeon CPU E5-2650 v4, 2.20GHz, with 64GiB of DDR4 RAM. We used a\nsingle core per algorithm.\nThe topics. For the experiments we generated a family of k topics in various ways, for k = 10, 25, 50.\nThe family NIPS TOPICS was generated by running Gibbs sampling on the NIPS dataset (Newman,\n2008). Since these topics are not p-separable in general, a second family was generated by adding\nanchor words arti\ufb01cially. A third family, SYNTHETIC, was generated by sampling from a uniform\nDirichlet distribution with parameter \u03b2 = 1 and, to enforce p-separability, anchor words were added.\nFinally, a fourth family of topics were GRID topics. These are the prototypical grid-like topics of sizes\n7 \u00d7 7 and 5 \u00d7 5 (introduced by Grif\ufb01ths & Steyvers (2004)); notice that these are (p, 2)-separable but\nnot (p, 1)-separable.\nIn each instance except grid topics, the number of words of the vocabulary was set to m = 400.\nThe corpora. From each one of the set of topics speci\ufb01ed above, we generated a corpus of n\ndocuments of length (cid:96), for n = 104, 105, 106 and (cid:96) = 2, 3, 10, 100. Because of space constraints, we\nwill only show results for n = 106.\n\n5Our implementation can be downloaded from https://github.com/matteojug/lda-sta.\n6We use the popular MALLET library McCallum (2002), http://mallet.cs.umass.edu/, with a 200\n\niteration burnin period and 1000 iterations.\n\n7\n\n\fWe are interested in two aspects of performance, the wall-clock running time and quality of the\nreconstruction, measured as the (cid:96)\u221e norm between the true set of topics and the reconstruction. To\nassess this, we computed the best possible matching between the two families of topics as follows.\nConsider a bipartite graph with the true set of topics on one side of the bipartition and the reconstructed\ntopics on the other. Between every pair of topics on opposite sides, there is an edge of weight equal\nto their (cid:96)\u221e distance. The quality of the reconstruction is given by the minimum cost perfect matching\nin this graph. All algorithms were run on a single thread.\n\nFigure 1: The top-left plot shows the wall clock time (in seconds, on a log-scale) required by the\nalgorithms with NIPS TOPICS, with documents of length (cid:96) = 2 and 10 topics (TENSOR is not shown\nsince it requires (cid:96) > 2). The top-right plot shows (on a linear-scale) the (cid:96)\u221e error of the algorithms on\nthe same instance; observe that STA is faster than the other two algorithms by more than one order of\nmagnitude, while its error is almost as good as that of RECOVER. The bottom-left plot shows the\nwall clock time (in seconds, on a log-scale) required by the algorithms with 10 SYNTHETIC topics,\nwith documents of length (cid:96) = 3. As before, STA is faster than the other algorithms by more than one\norder of magnitude and its error is almost as good as the one of RECOVER.\n\nConceptually our algorithm implements the following pipeline, C (1)\u2212\u2192 L (2)\u2212\u2192 S (3)\u2212\u2192 T , where the\n\ufb01rst step, starting from the corpus C, computes the approximation to the distribution induced over the\ndocuments by LDA; the second step implements the reduction from the latter to the STA-induced\ndistribution, and, lastly, the third step is Algorithm 1. We implemented the steps of this pipeline with\nseveral optimizations. In particular, we did not fully compute the approximate distributions L and S:\nrather, we lazily computed their entries that were requested by Algorithm 1.\nAlgorithm 1 simply picks the \ufb01rst two words of a document and throws the rest away, seemingly a\nrather wasteful thing to do. A natural alternative is to feed the algorithm we all pairs of words from\nthe document, hoping that the correlations so introduced can be safely ignored. This variant, which\nwe call STA in the following, was consistently more accurate than Algorithm 1 at the expense of a\nsmall increase in the running time. Therefore this is the implementation that we discuss.\nIn the case of grid-like topics, STA is the version of Algorithm 1 for the (p, 2)-separable case.\nWall-clock time. The two plots on the left of Figure 1 compare the running times of the algorithms\nwith corpora of documents of length (cid:96) = 2, 3, with 10 topics. As expected, STA for documents of\n\n8\n\nn=106 =2 100101102Time (s)n=106 =2 0.000.010.020.030.040.050.060.070.08L errorSTAGibbsRecovern=106 =3 100101102Time (s)n=106 =3 0.0000.0050.0100.0150.0200.0250.0300.035L errorSTAGibbsRecoverTensor\flength 2 is much faster then the other algorithms (while GIBBS is especially cumbersome), and its\nreconstruction quality is close to the best one. This \ufb01gure exempli\ufb01es the general picture: a similar\noutcome was observed for all values of n topic families.\n\nFigure 2: On the left: wall clock time (in seconds, on a log-scale) required by the algorithms with 10\nSYNTHETIC topics, with documents of length (cid:96) = 100. On the right: the (cid:96)\u221e error (on a linear-scale)\nof the algorithms on the same instance.\n\nPrecision of the reconstruction. Figure 1 exempli\ufb01es the general picture that emerges from our tests,\nfor short documents and 1-separable topics. RECOVER and STA have the smallest reconstruction\nerrors. As expected, GIBBS did not work well with very short documents. Therefore we tested\nthe algorithms with documents of length (cid:96) = 100. In Figure 2, we show that STA gives the best\nreconstruction, and its the fastest one by at least one order of magnitude.\n\nFigure 3: On the left: wall clock time (in seconds, on a log-scale) required by the algorithms on\na 5 \u00d7 5 GRID with 10 topics, with documents of length (cid:96) = 10. On the right: the (cid:96)\u221e error (on\na linear-scale) of the algorithms on the same instance. Recall that, here, STA is the version of\nAlgorithm 1 for (p, 2)-separability. On this instance, RECOVER is the fastest algorithm; observe,\nthough, that RECOVER returns topics that are very far from the original ones, since this instance is\nnot p-separable.\nGrid. In a \ufb01nal set of experiments, we considered the prototypical GRID instances of sizes 7 \u00d7 7 and\n5 \u00d7 5 (introduced in Grif\ufb01ths & Steyvers (2004)). In Figure 3, we see that STA and GIBBS provide\nan (cid:96)\u221e error smaller by an order of magnitude than that of RECOVER (and 4 times smaller than that of\nTensor). Moreover, the running time of STA is at least one order of magnitude smaller than that of\nGIBBS.\nAssessment. A picture emerges from our experiments. STA offers a pretty good reconstruction, while\nbeing extremely competitive in terms of running time. We see this as an encouraging proof of concept\nthat warrants further investigation of the approach introduced in this paper, that is, reducing LDA-\nreconstruction to the much simpler problem of STA-reconstruction. A more careful implementation of\nour algorithms could further increase the speed of our approach, while more ideas seem to be needed\nto improve the quality of reconstruction. Our experiments show that this could be a worthwhile\nendeavor.\n\n9\n\nn=106 =100 102103104Time (s)n=106 =100 0.0000.0020.0040.0060.0080.0100.0120.014L errorSTAGibbsRecoverTensorn=106 =10 102103Time (s)n=106 =10 0.000.050.100.150.20L errorSTAGibbsRecoverTensor\fReferences\nAlvarez-Melis, David, & Saveski, Martin. 2016. Topic Modeling in Twitter: Aggregating Tweets by\n\nConversations. In: ICWSM.\n\nAnandkumar, Anima, Hsu, Daniel J., Janzamin, Majid, & Kakade, Sham M. 2013. When are\nOvercomplete Topic Models Identi\ufb01able? Uniqueness of Tensor Tucker Decompositions with\nStructured Sparsity. Pages 1986\u20131994 of: Burges, Christopher J. C., Bottou, L\u00e9on, Ghahramani,\nZoubin, & Weinberger, Kilian Q. (eds), NIPS.\n\nAnandkumar, Animashree, Ge, Rong, Hsu, Daniel, Kakade, Sham M., & Telgarsky, Matus. 2014.\nTensor Decompositions for Learning Latent Variable Models. Journal of Machine Learning\nResearch, 15, 2773\u20132832.\n\nArora, Sanjeev, Ge, Rong, & Moitra, Ankur. 2012. Learning Topic Models \u2013 Going Beyond SVD.\nPages 1\u201310 of: Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer\nScience. FOCS \u201912. Washington, DC, USA: IEEE Computer Society.\n\nArora, Sanjeev, Ge, Rong, Halpern, Yonatan, Mimno, David M., Moitra, Ankur, Sontag, David,\nWu, Yichen, & Zhu, Michael. 2013. A Practical Algorithm for Topic Modeling with Provable\nGuarantees. Pages 280\u2013288 of: ICML (2). JMLR Workshop and Conference Proceedings, vol. 28.\nJMLR.org.\n\nBansal, Trapit, Bhattacharyya, Chiranjib, & Kannan, Ravindran. 2014. A provable SVD-based\nalgorithm for learning topics in dominant admixture corpus. Pages 1997\u20132005 of: Advances in\nNeural Information Processing Systems 27: Annual Conference on Neural Information Processing\nSystems 2014, December 8-13 2014, Montreal, Quebec, Canada.\n\nBlei, David M., Ng, Andrew Y., & Jordan, Michael I. 2003. Latent Dirichlet Allocation. J. Mach.\n\nLearn. Res., 3(Mar.), 993\u20131022.\n\nChierichetti, Flavio, Panconesi, Alessandro, & Vattani, Andrea. 2018. The equivalence of Single-\n\nTopic and LDA topic reconstruction. Zenodo, 10.5281/zenodo.1470295.\n\nGrif\ufb01ths, T. L., & Steyvers, M. 2004. Finding scienti\ufb01c topics. Proceedings of the National Academy\n\nof Sciences, 101(Suppl. 1), 5228\u20135235.\n\nHajjem, Malek, & Latiri, Chiraz. 2017. Combining IR and LDA Topic Modeling for Filtering\nMicroblogs. Pages 761\u2013770 of: Zanni-Merk, Cecilia, Frydman, Claudia S., Toro, Carlos, Hicks,\nYulia, Howlett, Robert J., & Jain, Lakhmi C. (eds), KES. Procedia Computer Science, vol. 112.\nElsevier.\n\nHong, Liangjie, & Davison, Brian D. 2010. Empirical Study of Topic Modeling in Twitter. Pages\n80\u201388 of: Proceedings of the First Workshop on Social Media Analytics. SOMA \u201910. New York,\nNY, USA: ACM.\n\nLi, Chenliang, Wang, Haoran, Zhang, Zhiqian, Sun, Aixin, & Ma, Zongyang. 2016. Topic Modeling\nfor Short Texts with Auxiliary Word Embeddings. Pages 165\u2013174 of: Proceedings of the 39th\nInternational ACM SIGIR Conference on Research and Development in Information Retrieval.\nSIGIR \u201916. New York, NY, USA: ACM.\n\nMcCallum, A.K. 2002. A machine learning for language toolkit.\n\nNewman, David. 2008. NIPS Dataset.\n\nNigam, Kamal, McCallum, Andrew, Thrun, Sebastian, & Mitchell, Tom M. 2000. Text Classi\ufb01cation\n\nfrom Labeled and Unlabeled Documents using EM. Machine Learning, 39, 103\u2013134.\n\nSridhar, Vivek Kumar Rangarajan. 2015. Unsupervised Topic Modeling for Short Texts Using\nDistributed Representations of Words. Pages 192\u2013200 of: Blunsom, Phil, Cohen, Shay B., Dhillon,\nParamveer S., & Liang, Percy (eds), VS@HLT-NAACL. The Association for Computational\nLinguistics.\n\n10\n\n\fWeng, Jianshu, Lim, Ee-Peng, Jiang, Jing, & He, Qi. 2010. TwitterRank: Finding Topic-sensitive\nIn\ufb02uential Twitterers. Pages 261\u2013270 of: Proceedings of the Third ACM International Conference\non Web Search and Data Mining. WSDM \u201910. New York, NY, USA: ACM.\n\nYan, Xiaohui, Guo, Jiafeng, Lan, Yanyan, & Cheng, Xueqi. 2013. A biterm topic model for short\n\ntexts. In: WWW.\n\nYau, Chyi-Kwei. 2018. tensor-lda. https://github.com/chyikwei/tensor-lda.\n\nZhao, Wayne Xin, Jiang, Jing, Weng, Jianshu, He, Jing, Lim, Ee-Peng, Yan, Hongfei, & Li, Xiaom-\ning. 2011. Comparing Twitter and Traditional Media Using Topic Models. Pages 338\u2013349 of:\nProceedings of the 33rd European Conference on Advances in Information Retrieval. ECIR\u201911.\nBerlin, Heidelberg: Springer-Verlag.\n\n11\n\n\f", "award": [], "sourceid": 4891, "authors": [{"given_name": "Matteo", "family_name": "Almanza", "institution": "Sapienza University of Rome"}, {"given_name": "Flavio", "family_name": "Chierichetti", "institution": "Sapienza University"}, {"given_name": "Alessandro", "family_name": "Panconesi", "institution": "Sapienza, University of Rome"}, {"given_name": "Andrea", "family_name": "Vattani", "institution": "UC San Diego"}]}