{"title": "Scalable Inference for Logistic-Normal Topic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2445, "page_last": 2453, "abstract": "Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise.", "full_text": "Scalable Inference for Logistic-Normal Topic Models\n\nJianfei Chen, Jun Zhu, Zi Wang, Xun Zheng and Bo Zhang\n\nState Key Lab of Intelligent Tech. & Systems; Tsinghua National TNList Lab;\n\nDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, China\n\n{chenjf10,wangzi10}@mails.tsinghua.edu.cn;\n\n{dcszj,dcszb}@mail.tsinghua.edu.cn; xunzheng@cs.cmu.edu\n\nAbstract\n\nLogistic-normal\ntopic models can effectively discover correlation structures\namong latent topics. However, their inference remains a challenge because of the\nnon-conjugacy between the logistic-normal prior and multinomial topic mixing\nproportions. Existing algorithms either make restricting mean-\ufb01eld assumptions\nor are not scalable to large-scale applications. This paper presents a partially col-\nlapsed Gibbs sampling algorithm that approaches the provably correct distribution\nby exploring the ideas of data augmentation. To improve time ef\ufb01ciency, we fur-\nther present a parallel implementation that can deal with large-scale applications\nand learn the correlation structures of thousands of topics from millions of docu-\nments. Extensive empirical results demonstrate the promise.\n\n1\n\nIntroduction\n\nIn Bayesian models, though conjugate priors normally result in easier inference problems, non-\nconjugate priors could be more expressive in capturing desired model properties. One popular ex-\nample is admixture topic models which have obtained much success in discovering latent semantic\nstructures from data. For the most popular latent Dirichlet allocation (LDA) [5], a Dirichlet dis-\ntribution is used as the conjugate prior for multinomial mixing proportions. But a Dirichlet prior\nis unable to model topic correlation, which is important for understanding/visualizing the semantic\nstructures of complex data, especially in large-scale applications. One elegant extension of LDA\nis the logistic-normal topic models (aka correlated topic models, CTMs) [3], which use a logistic-\nnormal prior to capture the correlation structures among topics effectively. Along this line, many\nsubsequent extensions have been developed, including dynamic topic models [4] that deal with time\nseries via a dynamic linear system on the Gaussian variables and in\ufb01nite CTMs [11] that can resolve\nthe number of topics from data.\n\nThe modeling \ufb02exibility comes with computational cost. Although signi\ufb01cant progress has been\nmade on developing scalable inference algorithms for LDA using either distributed [10, 16, 1] or on-\nline [7] learning methods, the inference of logistic-normal topic models still remains a challenge, due\nto the non-conjugate priors. Existing algorithms on learning logistic-normal topic models mainly\nrely on approximate techniques, e.g., variational inference with unwarranted mean-\ufb01eld assump-\ntions [3]. Although variational methods have a deterministic objective to optimize and are usually\nef\ufb01cient, they could only achieve an approximate solution. If the mean-\ufb01eld assumptions are not\nmade appropriately, the approximation could be unsatisfactory. Furthermore, existing algorithms\ncan only deal with small corpora and learn a limited number of topics. It is important to develop\nscalable algorithms in order to apply the models to large collections of documents, which are be-\ncoming increasingly common in both scienti\ufb01c and engineering \ufb01elds.\n\nTo address the limitations listed above, we develop a scalable Gibbs sampling algorithm for logistic-\nnormal topic models, without making any restricting assumptions on the posterior distribution. Tech-\nnically, to deal with the non-conjugate logistic-normal prior, we introduce auxiliary Polya-Gamma\n\n1\n\n\fvariables [13], following the statistical ideas of data augmentation [17, 18, 8]; and the augmented\nposterior distribution leads to conditional distributions from which we can draw samples easily with-\nout accept/reject steps. Moreover, the auxiliary variables are locally associated with each individual\ndocument, and this locality naturally allows us to develop a distributed sampler by splitting the doc-\numents into multiple subsets and allocating them to multiple machines. The global statistics can\nbe updated asynchronously without sacri\ufb01cing the predictive ability on unseen testing documents.\nWe successfully apply the scalable inference algorithm to learning a correlation graph of thousands\nof topics on large corpora with millions of documents. These results are the largest automatically\nlearned topic correlation structures to our knowledge.\n\n2 Logistic-Normal Topic Models\n\nd=1 be a set of documents, where wd = {wdn}Nd\n\nLet W = {wd}D\nn=1 denote the words appearing\nin document d of length Nd. A hierarchical Bayesian topic model posits each document as an\nadmixture of K topics, where each topic \u03a6k is a multinomial distribution over a V -word vocabulary.\nFor a logistic-normal topic model (e.g., CTM), the generating process of document d is:\n\nd\n\ne\u03b7k\nPK\nj=1 e\u03b7\n\n\u03b7d \u223c N (\u00b5, \u03a3), \u03b8k\n\nd =\n\n, \u2200n \u2208 {1, \u00b7 \u00b7 \u00b7 , Nd} : zdn \u223c Mult(\u03b8d), wdn \u223c Mult(\u03a6zdn ),\n\nj\nd\n\nwhere Mult(\u00b7) denotes the multinomial distribution; zdn is a K-binary vector with only one nonzero\nelement; and \u03a6zdn denotes the topic selected by the non-zero entry of zdn. For Bayesian CTM, the\ntopics are samples drawn from a prior, e.g., \u03a6k \u223c Dir(\u03b2), where Dir(\u00b7) is a Dirichlet distribution.\nNote that for identi\ufb01ability, normally we assume \u03b7K\n\nd = 0.\n\nGiven a set of documents W, CTM infers the posterior distribution p(\u03b7, Z, \u03a6|W) \u221d\np0(\u03b7, Z, \u03a6)p(W|Z, \u03a6) by the Bayes\u2019 rule. This problem is generally hard because of the non-\nconjugacy between the normal prior and the logistic transformation function (can be seen as a like-\nlihood model for \u03b8). Existing approaches resort to variational approximate methods [3] with strict\nfactorization assumptions. To avoid mean-\ufb01eld assumptions and improve the inference accuracy,\nbelow we present a partially collapsed Gibbs sampler, which is simple to implement and can be\nnaturally parallelized for large-scale applications.\n\n3 Gibbs Sampling with Data Augmentation\n\nWe now present a block-wise Gibbs sampling algorithm for logistic-normal topic models. To\nimprove mixing rates, we \ufb01rst integrate out the Dirichlet variables \u03a6, by exploring the conjugacy\nbetween a Dirichlet prior and multinomial likelihood. Speci\ufb01cally, we can integrate out \u03a6 and\nperform Gibbs sampling for the marginalized distribution:\n\nD\n\nNd\n\np(\u03b7, Z|W) \u221d p(W|Z)\n\nd (cid:17)N (\u03b7d|\u00b5, \u03a3),\nk is the number of times topic k being assigned to the term t over the whole corpus; Ck =\n\nd (cid:17)N (\u03b7d|\u00b5, \u03a3) \u221d\n\u03b8zdn\n\nYd=1(cid:16)\n\nYd=1(cid:16)\n\nYn=1\n\nYn=1\n\nYk=1\n\n\u03b4(\u03b2)\n\nj\n\nK\n\n\u03b4(Ck + \u03b2)\n\nD\n\nNd\n\nzdn\nd\n\ne\u03b7\nPK\nj=1 e\u03b7\n\nis a function de\ufb01ned with the Gamma function \u0393(\u00b7).\n\nwhere C t\n{C t\n\nk}V\n\nt=1; and \u03b4(x) = Qdim(x)\n\u0393(Pdim(x)\n\ni=1\n\n\u0393(xi)\nxi)\n\ni=1\n\n3.1 Sampling Topic Assignments\n\nWhen the variables \u03b7 = {\u03b7d}D\nIn our Gibbs\nsampler, this is done by iteratively drawing a sample for each word in each document. The local\nconditional distribution is:\n\nd=1 are given, we draw samples from p(Z|\u03b7, W).\n\np(zk\n\ndn = 1|Z\u00acn, wdn, W\u00acdn, \u03b7) \u221d p(wdn|zk\n\ndn = 1, Z\u00acn, W\u00acdn)e\u03b7k\n\nd \u221d\n\nk,\u00acn + PV\n\u00b7,\u00acn indicates that term n is excluded from the corresponding document or topic.\n\nj=1 \u03b2j\n\nwhere C \u00b7\n\nk,\u00acn + \u03b2wdn\n\nC wdn\nPV\nj=1 C j\n\ne\u03b7k\n\nd , (1)\n\n3.2 Sampling Logistic-Normal Parameters\n\nWhen the topic assignments Z are given, we draw samples from the posterior distribution\n\np(\u03b7|Z, W) \u221d QD\n\nd=1 (cid:16)QNd\n\nn=1\n\n\u03b7d\n\nzn\n\ne\nPK\n\nj=1 e\n\nj (cid:17)N (\u03b7d|\u00b5, \u03a3), which is a Bayesian logistic regression model\n\n\u03b7d\n\n2\n\n\fwith Z as the multinomial observations. Though it is hard to draw samples directly due to non-\nconjugacy, we can leverage recent advances in data augmentation to solve this inference task ef\ufb01-\nciently, with analytical local conditionals for Gibbs sampling, as detailed below.\n\nSpeci\ufb01cally, we have the likelihood of \u201cobserving\u201d the topic assignments zd for document d 1 as\np(zd|\u03b7d) = QNd\n\u03b7\u00ack\n\n. Following Homes & Held [8], the likelihood for \u03b7d\n\nk conditioned on\n\ne\u03b7\nj=1 e\u03b7\nPK\n\nzdn\nd\n\nn=1\n\nis:\n\nj\nd\n\nd\n\n\u2113(\u03b7k\n\nd\n\nNd\n\nd ) =\n\nYn=1(cid:16) e\u03c1k\nd |\u03b7\u00ack\n1 + e\u03c1k\nd = log(Pj6=k e\u03b7j\n\nd (cid:17)zk\ndn(cid:16)\nd ); and C k\n\ndn\n\n1\n\nd (cid:17)1\u2212zk\n1 + e\u03c1k\nd = PNd\nn=1 zk\n\nwhere \u03c1k\nto topic k in document d. Therefore, we have the conditional distribution\n\nd = \u03b7k\n\nd \u2212 \u03b6 k\n\nd ; \u03b6 k\n\n=\n\n(e\u03c1k\n\nd )Ck\n\nd\n\n(1 + e\u03c1k\n\nd )Nd\n\n,\n\ndn is the number of words assigned\n\np(\u03b7k\n\nd |\u03b7\u00ack\n\nd , Z, W) \u221d \u2113(\u03b7k\n\nd |\u03b7\u00ack\n\nd )N (\u03b7k\n\nd |\u00b5k\n\nd, \u03c32\n\nk),\n\n(2)\n\nwhere \u00b5k\nGaussian distribution.\n\nd = \u00b5k \u2212 \u039b\u22121\n\nkk\n\n\u039bk\u00ack(\u03b7\u00ack\n\nd \u2212 \u00b5\u00ack) and \u03c32\n\nk = \u039b\u22121\n\nkk . \u039b = \u03a3\u22121 is the precision matrix of a\n\nThis is a posterior distribution of a Bayesian logistic model with a Gaussian prior, where zk\ndn are\nbinary response variables. Due to the non-conjugacy between the normal prior and logistic likeli-\nhood, we do not have analytical form of this posterior distribution. Although standard Monte Carlo\nmethods (e.g., rejection sampling) can be applied, they normally require a good proposal distribu-\ntion and may have the trouble to deal with accept/reject rates. Data augmentation techniques have\nbeen developed, e.g., [8] presented a two layer data augmentation representation with logistic dis-\ntributions and [9] applied another data augmentation with uniform variables and truncated Gaussian\ndistributions, which may involve sophisticated accept/reject strategies [14]. Below, we develop a\nsimple exact sampling method without a proposal distribution.\n\nOur method is based on a new data augmentation representation, following the recent developments\nin Bayesian logistic regression [13], which is a direct data augmentation scheme with only one layer\nof auxiliary variables and does not need to tune in order to get optimal performance. Speci\ufb01cally,\nfor the above posterior inference problem, we can show the following lemma.\nLemma 1 (Scale Mixture Representation). The likelihood \u2113(\u03b7k\n\nd ) can be expressed as\n\nd |\u03b7\u00ack\n\nwhere \u03bak\n\nd = C k\n\n(1 + e\u03c1k\n\nd )Nd\nd \u2212 Nd/2 and p(\u03bbk\n\n(e\u03c1k\n\nd )Ck\n\nd\n\n=\n\n1\n2Nd\n\nd \u03c1k\n\ne\u03bak\n\nd Z \u221e\n\n0\n\nd (\u03c1k\n\u03bbk\n2\n\nd )2\n\ne\u2212\n\np(\u03bbk\n\nd|Nd, 0)d\u03bbk\nd,\n\nd|Nd, 0) is the Polya-Gamma distribution PG(Nd, 0).\n\nThe lemma suggest that p(\u03b7k\n\nd |\u03b7\u00ack\n\nd , Z, W) is a marginal distribution of the complete distribution\n\np(\u03b7k\n\nd , \u03bbk\n\nd|\u03b7\u2212k\n\nd , Z, W) \u221d\n\n1\n2Nd\n\nexp(cid:16)\u03bak\nd\u03c1k\n\nd \u2212\n\nd)2\n\nd(\u03c1k\n\u03bbk\n2\n\n(cid:17)p(\u03bbk\n\nd|Nd, 0)N (\u03b7k\n\nd |\u00b5k\n\nd, \u03c32\n\nk).\n\nd , Z, W).\n(cid:1)N (\u03b7k\n\nTherefore, we can draw samples from the complete distribution. By discarding the augmented\nvariable \u03bbk\n\nd |\u03b7\u00ack\nd, we get the samples of the posterior distribution p(\u03b7k\nd )2\nd(\u03b7k\n2\n\nd |\u03b7\u00ack\n\nd \u2212 \u03bbk\nd\u03b7k\nd\u03b6 k\nd + \u03bbk\n\nk \u00b5k\n\nd |\u00b5, \u03c32) = N (\u03b3k\n\nd : we have p(\u03b7k\n\nd , Z, W, \u03bbk\nd = (\u03c4 k\n\nd) \u221d exp(cid:0)\u03bak\nd )2(\u03c3\u22122\nd + \u03bak\n\nthe conditional distribution of the augmented variable is p(\u03bbk\nd|Nd, 0) = PG(cid:0)\u03bbk\n\nFor \u03b7k\nd ) and the variance is (\u03c4 k\nwhere the posterior mean is \u03b3k\nd)\u22121. Therefore, we can easily draw a sample from a univariate Gaussian distribution.\n\u03bbk\nFor \u03bbk\nd:\n\u03bbk\nd)2\nd(\u03c1k\n(cid:1)p(\u03bbk\n2\n\nd|Z, W, \u03b7) \u221d exp(cid:0) \u2212\nd(cid:1), which is again a Polya-Gamma distribution by using the\nconstruction de\ufb01nition of the general PG(a, b) class through an exponential tilting of the PG(a, 0)\ndensity [13]. To draw samples from the Polya-Gamma distribution, note that a naive implementation\nof the sampling using the in\ufb01nite sum-of-Gamma representation is not ef\ufb01cient and it also involves\na potentially inaccurate step of truncating the in\ufb01nite sum. Here we adopt the exact method pro-\nposed in [13], which draws the samples through drawing Nd samples from PG(1, \u03b7k\nd ). Since Nd is\nnormally large, we will develop a fast and effective approximation in the next section.\n\nd , (\u03c4 k\nd )2 = (\u03c3\u22122\n\nd )2),\nk +\n\nd; Nd, \u03c1k\n\n1Due to the independence, we can treat documents separately.\n\n3\n\n\fx 104\n\n2.5\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n2\n\n1.5\n\n1\n\n0.5\n\n \n\n0\n120\n\n130\n\n \n\n10000\n\nm=1\nm=2\nm=4\nm=8\nm=n (exact)\n\n8000\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n6000\n\n4000\n\n2000\n\n0\n \n\u22121\n\n150\n\n140\nz \u223c PG(z ; m, \u03c1)\n\n160\n\n170\n\n180\n\n(a)\n\n3.3 Fully-Bayesian Models\n\n \n\nm=1\nm=2\nm=4\nm=8\nm=n (exact)\n\n2\n1\n0\n\u00ac k, Z, W )\n\u03b7\nk \u223c P(\u03b7\nk | \u03b7\nd\nd\nd\n\n3\n\n(b)\n\nd \u223c p(\u03b7k\n\nd |\u03b7\u00ack\n\nFigure 3: (a) frequency of f (z) with z \u223c\nPG(m, \u03c1); and (b) frequency of samples\nfrom \u03b7k\nd , Z, W). Though z\nis not from the exact distribution, the dis-\ntribution of \u03b7k\nd is very accurate. The pa-\nrameters \u03c1k\nd = 19, Nd =\n1155, \u00b5k\nd = 0.31, and \u03b6 = 5.35\nare from a real distribution when training\non the NIPS data set.\n\nd = \u22124.19, C k\n\nd = 0.40, \u03c32\n\nWe can treat \u00b5 and \u03a3 as random variables and perform fully-Bayesian inference, by using the\nconjugate Normal-Inverse-Wishart prior, p0(\u00b5, \u03a3) = N IW(\u00b50, \u03c1, \u03ba, W ), that is\n\n\u03a3|\u03ba, W \u223c IW(\u03a3; \u03ba, W \u22121), \u00b5|\u03a3, \u00b50, \u03c1 \u223c N (\u00b5; \u00b50, \u03a3/\u03c1),\n\nwhere IW(\u03a3; \u03ba, W \u22121) =\n\n2\n\n|W |\u03ba/2\n\n\u03baM\n2 \u0393M ( \u03ba\n\n2 )|\u03a3|\n\n\u03ba+M +1\n\n2\n\nexp(\u2212 1\n\n2 Tr(W \u03a3\u22121)) is the inverse Wishart\n\ndistribution and (\u00b50, \u03c1, \u03ba, W ) are hyper-parameters. Then, the conditional distribution is\n\np(\u00b5, \u03a3|\u03b7, Z, W) \u221d p0(\u00b5, \u03a3)Yd\n\np(\u03b7d|\u00b5, \u03a3) = N IW(\u00b5\u2032\n\n0, \u03c1\u2032, \u03ba\u2032, W \u2032),\n\n(3)\n\nwhich is still a Normal-Inverse-Wishart distribution due to the conjugate property and the parameters\n0 = \u03c1\n\u03c1+D ( \u00af\u03b7 \u2212 \u00b50)( \u00af\u03b7 \u2212 \u00b50)\u22a4,\nare \u00b5\u2032\nwhere \u00af\u03b7 = 1\n\n\u03c1+D \u00b50 + D\nD Pd \u03b7d is the empirical mean of the data and Q = Pd(\u03b7d \u2212 \u00af\u03b7)(\u03b7d \u2212 \u00af\u03b7)\u22a4.\n\n\u03c1+D \u00af\u03b7, \u03c1\u2032 = \u03c1 + D, \u03ba\u2032 = \u03ba + D and W \u2032 = W + Q + \u03c1D\n\n4 Parallel Implementation and Fast Approximate Sampling\n\nThe above Gibbs sampler can be naturally parallelized to extract large correlation graphs from mil-\nlions of documents, due to the following observations:\n\nFirst, both \u03b7d and \u03bbd are conditionally independent given \u00b5 and \u03a3, which makes it natural to dis-\ntribute documents over machines and infer local \u03b7d and \u03bbd. No communication is needed for this\nsampling step. Second, the global variables \u00b5 and \u03a3 can be inferred and broadcast to every machine\nafter each iteration. As mentioned in Section 3.3, this involves: 1) computing N IW posterior pa-\n0, W \u2032\nrameters, and 2) sampling from Eq. 3. Notice that \u03b7d contribute to the posterior parameters \u00b5\u2032\nthrough the simple summation operator, so that we can perform local summation on each machine,\nfollowed by a global aggregation. Similarly, N IW sample can be drawn distributively, by com-\nputing sample covariance of x1, \u00b7 \u00b7 \u00b7 , x\u03ba\u2032 , drawn from N (x|0, W \u2032) distributively after broadcasting\nW \u2032. Finally, the topic assignments zd are conditionally independent given the topic counts Ck. We\nsynchronize Ck globally by leveraging the recent advances on scalable inference of LDA [1, 16],\nwhich implemented a general framework to synchronize such counts.\n\nTo further speed up the inference algorithm, we designed a fast approximate sampling method to\ndraw PG(n, \u03c1) samples, reducing the time complexity from O(n) in [13] to O(1). Speci\ufb01cally,\nPolson et al. [13] show how to ef\ufb01ciently generate PG(1, \u03c1) random variates. Due to additive prop-\nerty of Polya-Gamma distribution, y \u223c PG(n, \u03c1) if xi \u223c PG(1, \u03c1) and y = Pn\ni=1 xi. However,\nthis sampler can be slow when n is large. For our Gibbs sampler, n is the document length, often\naround hundreds. Fortunately, an effective approximation can be developed to achieve constant time\nsampling of PG. Since n is relatively large, the sum variable y should be almost normally dis-\ntributed, according to the central limit theorem. Fig. 3(a) con\ufb01rms this intuition. Consider another\nPG variable z \u223c PG(m, \u03c1). If both m and n are large, y and z should be both samples from normal\ndistribution. Hence, we can do a simple linear transformation of z to approximate y. Speci\ufb01cally,\nwe have f (z) = pV ar(y)/V ar(z)(z \u2212 E[z]) + E[y], where E[y] = n\n2\u03c1 tanh(\u03c1/2) from [12], and\nV ar(y) = m\nn since both y and z are sum of PG(1, \u03c1) variates. It can be shown that f (z) and y\nhave the same mean and variance. In practice, we found that even when m = 1, the algorithm\nstill can draw good samples from p(\u03b7k\nd , Z, W) (See Fig. 3(b)). Hence, we are able to speed up\nthe Polya-Gamma sampling process signi\ufb01cantly by applying this approximation. More empirical\nanalysis can be found in the appendix.\n\nd |\u03b7\u00ack\n\nV ar(z)\n\n4\n\n\fFurthermore, we can perform sparsity-aware fast sampling [19] in the Gibbs sampler. Speci\ufb01cally,\n\nwdn\nk,\u00acn\n\nC\nj=1 Cj\nk,\u00acn+PV\n\nPV\n\ne\u03b7k\n\nd , Bk =\n\nlet Ak =\n\nd , then Eq. (1) can be written as\ndn = 1|Z\u00acn, wdn, W\u00acdn, \u03b7) \u221d Ak + Bk. Let ZA = Pk Ak and ZB = Pk Bk. We can show\n\np(zk\nthat the sampling of zdn can be done by sampling from Mult( A\nZA\n\n) or Mult( B\nZB\n\n), due to the fact:\n\nj=1 \u03b2j\n\nPV\n\nj=1 Cj\n\nj=1 \u03b2j\n\n\u03b2wdn\nk,\u00acn+PV\n\ne\u03b7k\n\np(zk\n\ndn = 1|Z\u00acn, wdn, W\u00acdn, \u03b7) =\n\nAk\n\nZA + ZB\n\n+\n\nBk\n\nZA + ZB\n\n= (1 \u2212 p)\n\nAk\nZA\n\n+ p\n\nBk\nZB\n\n,\n\n(4)\n\nZA+ZB\n\nwhere p = ZB\n. Note that Eq. (4) is a marginalization with respect to an auxiliary binary\nvariable. Thus a sample of zdn can be drawn by \ufb02ipping a coin with probability p being head. If\nit is tail, we draw zdn from Mult( A\n). The advantage is that we only\nZA\nneed to consider all non-zero entries of A to sample from Mult( A\n). In fact, A has few non-zero\nZA\nentries due to the sparsity of the topic counts Ck. Thus, the time complexity would be reduced from\nO(K) to O(s(K)), where s(K) is the average number of non-zero entries in Ck. In practice, Ck is\nvery sparse, hence s(K) \u226a K when K is large. To sample from Mult( B\n), we iterate over all K\nZB\npotential assignments. But since p is typically small, O(K) time complexity is acceptable.\n\n); otherwise from Mult( B\nZB\n\nWith the above techniques, the time complexity per document of the Gibbs sampler is O(Nds(K))\nfor sampling zd, O(K 2) for computing (\u00b5k\n(2),\nwhere S is the number of sub-burn-in steps over sampling \u03b7k\nd . Thus the overall time complexity\nis O(Nds(K) + K 2 + SK), which is higher than the O(Nds(K)) complexity of LDA [1] when K\nis large, indicating a cost for the enriched representation of CTM comparing to LDA.\n\nk), and O(SK) for sampling \u03b7d with Eq.\n\nd, \u03c32\n\n5 Experiments\n\nWe now present qualitative and quantitative evaluation to demonstrate the ef\ufb01cacy and scalability of\nthe Gibbs sampler for CTM (denoted by gCTM). Experiments are conducted on a 40-node cluster,\nwhere each node is equipped with two 6-core CPUs (2.93GHz). For all the experiments, if not\nexplicitly mentioned, we set the hyper-parameters as \u03b2 = 0.01, T = 350, S = 8, m = 1, \u03c1 = \u03ba =\n0.01D, \u00b50 = 0, and W = \u03baI, where T is the number of burn-in steps. We will use M to denote\nthe number of machines and P to denote the number of CPU cores. For baselines, we compare\nwith the variational CTM (vCTM) [3] and the state-of-the-art LDA implementation, Yahoo! LDA\n(Y!LDA) [1]. In order to achieve fair comparison, for both vCTM and gCTM we select T such that\nthe models converge suf\ufb01ciently, as we shall discuss later in Section 5.3.\n\nData Sets: Experiments are conducted on several benchmark data sets, including NIPS paper ab-\nstracts, 20Newsgroups, and NYTimes (New York Times) corpora from [2] and the Wikipedia corpus\nfrom [20]. All the data sets are randomly split into training and testing sets. Following the settings\nin [3], we partition each document in the testing set into an observed part and a held-out part.\n\n5.1 Qualitative Evaluation\n\nWe \ufb01rst examine the correlation structure of 1,000 topics learned by CTM using our scalable sampler\non the NYTimes corpus with 285,000 documents. Since the entire correlation graph is too large, we\nbuild a 3-layer hierarchy by clustering the learned topics, with their learned correlation strength\nas the similarity measure. Fig. 4 shows a part of the hierarchy2, where the subgraph A represents\nthe top layer with 10 clusters. The subgraphs B and C are two second layer clusters; and D and\nE are two correlation subgraphs consisting of leaf nodes (i.e., learned topics). To represent their\nsemantic meanings, we present 4 most frequent words for each topic; and for each topic cluster,\nwe also show most frequent words by building a hyper-topic that aggregates all the included topics.\nOn the top layer, the font size of each word in a word cloud is proportional to its frequency in the\nhyper-topic. Clearly, we can see that many topics have strong correlations and the structure is useful\nto help humans understand/browse the large collection of topics. With 40 machines, our parallel\nGibbs sampler \ufb01nishes the training in 2 hours, which means that we are able to process real world\ncorpus in considerable speed. More details on scalability will be provided below.\n\n2The entire correlation graph can be found on http://ml-thu.net/\u02dcscalable-ctm\n\n5\n\n\fB\n\n47\n\n130\n\n31\n\n4\n\n6\n\n6\n\n22\n\n27\n\n113 denotes the number of topics a cluster contains.\n\n113\n\nA\n\n5\n\n82\n\n12\n\n248\n\nC\n\n41\n\n48\n\n17\n\n42\n\n4\n\n13\n\n12\n\n17\n\n6\n\n7\n\n12\n\n5\n\n4\n\n3\n\n7\n\n3\n\n4\n\n3\n\n314\n\nD\n\nE\n\nFigure 4: A hierarchical visualization of the correlation graph with 1,000 topics learned from\n285,000 articles of the NYTimes. A denotes the top-layer subgraph with 10 big clusters; B and\nC denote two second-layer clusters; and D and E are two subgraphs with leaf nodes (i.e., topics).\nWe present most frequent words of each topic cluster. Edges denote a correlation (above some\nthreshold) and the distance between two nodes represents the strength of their correlation. The node\nsize of a cluster is determined by the number of topics included in that cluster.\n\n6\n\n\f \n\nvCTM\ngCTM (M=1, P=1)\ngCTM (M=1, P=12)\n\n104\n\n)\ns\n(\n \n\ne\nm\n\ni\nt\n\n102\n\n2200\n\n2000\n\n1800\n\n1600\n\ny\nt\ni\nx\ne\np\nr\ne\np\n\nl\n\n1400\n\n \n\n20\n\n40\n\n80 100\n\n60\nK\n\n100\n\n \n\n20\n\n \n\n4500\n\ny\nt\ni\nx\ne\np\nr\ne\np\n\nl\n\n4000\n\n3500\n\n3000\n\n2500\n\n \n\nvCTM\ngCTM (M=1, P=1)\ngCTM (M=1, P=12)\n40\n\n80 100\n\n60\nK\n\n \n\ngCTM (M=1, P=12)\ngCTM (M=40, P=480)\nY!LDA (M=40, P=480)\n\n200 400 600 800 1000\n\nK\n\n)\ns\n(\n \n\ne\nm\n\ni\nt\n\n106\n\n104\n\n102\n\n100\n\n \n\n \n\ngCTM (M=1, P=12)\ngCTM (M=40, P=480)\nY!LDA (M=40, P=480)\n\n200 400 600 800 1000\n\nK\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: (a)(b): Perplexity and training time of vCTM, single-core gCTM, and multi-core gCTM\non the NIPS data set; (c)(d): Perplexity and training time of single-machine gCTM, multi-machine\ngCTM, and multi-machine Y!LDA on the NYTimes data set.\n\n5.2 Performance\n\nWe begin with an empirical assessment on the small NIPS data set, whose training set contains\n1.2K documents. Fig. 5(a)&(b) show the performance of three single-machine methods: vCTM\n(M = 1, P = 1), sequential gCTM (M = 1, P = 1), and parallel gCTM (M = 1, P = 12).\nFig. 5(a) shows that both versions of gCTM produce similar or better perplexity, compared to vCTM.\nMoreover, Fig. 5(b) shows that when K is large, the advantage of gCTM becomes salient, e.g.,\nsequential gCTM is about 7.5 times faster than vCTM; and multi-core gCTM achieves almost two\norders of magnitude of speed-up compared to vCTM.\n\ndata set\nNIPS\n20NG\n\nNYTimes\n\nWiki\n\nIn Table 1, we compare the ef\ufb01ciency\nof vCTM and gCTM on different sized\ndata sets. It can be observed that vCTM\nimmediately becomes impractical when\nthe data size reaches 285K, while by uti-\nlizing additional computing resources,\ngCTM is able to process larger data sets\nwith considerable speed, making it ap-\nplicable to real world problems. Note\nthat gCTM has almost the same training time on NIPS and 20Newsgroups data sets, due to their\nsmall sizes. In such cases, the algorithm is dominated by synchronization rather than computation.\n\nTable 1: Training time of vCTM and gCTM (M = 40)\non various datasets.\n\nvCTM gCTM\n8.9 min\n1.9 hr\n9 min\n16 hr\nN/A*\n0.5 hr\n17 hr\nN/A*\n\nD\n\nK\n100\n1.2K\n200\n11K\n400\n285K\n6M 1000\n\n*not \ufb01nished within 1 week.\n\nFig. 5(c)&(d) show the results on the NYTimes corpus, which contains over 285K training docu-\nments and cannot be handled well by non-parallel methods. Therefore we concentrate on three par-\nallel methods \u2014 single-machine gCTM (M = 1, P = 12), multi-machine gCTM (M = 40, P =\n480), and multi-machine Y!LDA (M = 40, P = 480). We can see that: 1) both versions of gCTM\nobtain comparable perplexity to Y!LDA; and 2) gCTM (M = 40) is over an order of magnitude\nfaster than the single-machine method, achieving considerable speed-up with additional computing\nresources. These observations suggest that gCTM is able to handle large data sets without sacri\ufb01cing\nthe quality of inference. Also note that Y!LDA is faster than gCTM because of the model differ-\nence \u2014 LDA does not learn correlation structure among topics. As analyzed in Section 4, the time\ncomplexity of gCTM is O(K 2 + SK + Nds(K)) per document, while for LDA it is O(Nds(K)).\n\n5.3 Sensitivity\n\nBurn-In and Sub-Burn-In: Fig. 6(a)&(b) show the effect of burn-in steps and sub-burn-in steps on\nthe NIPS data set with K = 100. We also include vCTM for comparison. For vCTM, T denotes\nthe number of iteration of its EM loop in variational context. Our main observations are twofold:\n1) despite various S, all versions of gCTMs reach a similar level of perplexity that is better than\nvCTM; and 2) a moderate number of sub-iterations, e.g. S = 8, leads to the fastest convergence.\n\nThis experiment also provides insights on determining the number of outer iterations T that assures\nconvergence for both models. We adopt Cauchy\u2019s criterion [15] for convergence: given an \u01eb > 0, an\nalgorithm converges at iteration T if \u2200i, j \u2265 T, |Perpi \u2212 Perpj| < \u01eb, where Perpi and Perpj are\nperplexity at iteration i and j respectively. In practice, we set \u01eb = 20 and run experiments with very\nlarge number of iterations. As a result, we obtained T = 350 for gCTM and T = 8 for vCTM, as\npointed out with corresponing verticle line segments in Fig. 6(a)&(b).\n\n7\n\n\f2500\n\ny\nt\ni\nx\ne\np\nr\ne\np\n\nl\n\n2000\n\ngCTM (S=1)\ngCTM (S=2)\ngCTM (S=4)\ngCTM (S=8)\ngCTM (S=16)\ngCTM (S=32)\nvCTM\n\n \n\n2500\n\ny\nt\ni\nx\ne\np\nr\ne\np\n\nl\n\n2000\n\n \n\ngCTM (S=1)\ngCTM (S=2)\ngCTM (S=4)\ngCTM (S=8)\ngCTM (S=16)\ngCTM (S=32)\nvCTM\n\n1500\n\n \n\n100\n\n101\n\n102\n\n103\n\nT\n\n(a)\n\n1500\n\n \n\n101\n\n102\n\n103\n\ntime (s)\n\n104\n\n105\n\n(b)\n\ny\nt\ni\nx\ne\np\nr\ne\np\n\nl\n\n2000\n\n1800\n\n1600\n\n1400\n\n1200\n\n1000\n \n1\n\n \n\nK=50\nK=100\n\n4\n\n16\n\n64\n\n256\n\n1024\n\na\n\n(c)\n\nFigure 6: Sensitivity analysis with respect to key hyper-parameters: (a) perplexity at each iteration\nwith different S; (b) convergence speed with different S; (c) perplexity tested with different prior.\n\nPrior: Fig. 6(c) shows perplexity under different prior settings. To avoid expensive search in a huge\nspace, we set (\u00b50, \u03c1, W, \u03ba) = (0, a, aI, a) to test the effect of N IW prior, where a larger a implies\nmore pseudo-observations of \u00b5 = 0, \u03a3 = I. We can see that for both K = 50 and K = 100, the\nperplexity is invariant under a wide range of prior settings. This suggests that gCTM is insensitive\nto prior values.\n\nx 104\n\n5.4 Scalability\n\nFig. 7 shows the scalability of gCTM on the large\nWikipedia data set with K = 500. A practical problem\nin real world machine learning is that when computing\nresources are limitted, as the data size grows, the run-\nning time soon upsurges to an untolerable level. Ideally,\nthis problem can be solved by adding the same ratio\nof computing nodes. Our experiment demonstrates that\ngCTM performs well in this scenario \u2014 as we pour in\nthe same proportion of data and machines, the training\ntime is almost kept constant. In fact, the largest differ-\nence from ideal curve is about 1,000 seconds, which is\nalmost unobservable in the \ufb01gure. This suggests that\nparallel gCTM enjoys nice scalability.\n\n6 Conclusions and Discussions\n\n \n\nFixed M=8\nLinearly scaling M\nIdeal\n\n15\n\n10\n\n5\n\n)\ns\n(\n \ne\nm\n\ni\nt\n\n \n\n1.2M\n\n2.4M\n\n3.6M\n#docs\n\n4.8M\n\n6M\n\nFigure 7: Scalability analysis. We set\nM = 8, 16, 24, 32, 40 so that each ma-\nchine processes 150K documents.\n\nWe present a scalable Gibbs sampling algorithm for logistic-normal topic models. Our method\nbuilds on a novel data augmentation formulation and addresses the non-conjugacy without making\nstrict mean-\ufb01eld assumptions. The algorithm is naturally parallelizable and can be further boosted\nby approximate sampling techniques. Empirical results demonstrate signi\ufb01cant improvement in time\nef\ufb01ciency over existing variational methods, with slightly better perplexity. Our method enjoys good\nscalability, suggesting the ability to extract large structures from massive data.\n\nIn the future, we plan to study the performance of Gibbs CTM on industry level clusters with thou-\nsands of machines. We are also interested in developing scalable sampling algorithms of other\nlogistic-normal topic models, e.g., in\ufb01nite CTM and dynamic topic models. Finally, the fast sam-\npler of Poly-Gamma distributions can be used in relational and supervised topic models [6, 21].\n\nAcknowledgments\n\nThis work is supported by the National Basic Research Program (973 Program) of China (Nos.\n2013CB329403, 2012CB316301), National Natural Science Foundation of China (Nos. 61322308,\n61305066), Tsinghua University Initiative Scienti\ufb01c Research Program (No. 20121088071), and\nTsinghua National Laboratory for Information Science and Technology, China.\n\n8\n\n\fReferences\n\n[1] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola. Scalable inference in\nlatent variable models. In International Conference on Web Search and Data Mining (WSDM),\n2012.\n\n[2] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n\n[3] D. Blei and J. Lafferty. Correlated topic models. In Advances in Neural Information Processing\n\nSystems (NIPS), 2006.\n\n[4] D. Blei and J. Lafferty. Dynamic topic models.\n\nIn International Conference on Machine\n\nLearning (ICML), 2006.\n\n[5] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[6] N. Chen, J. Zhu, F. Xia, and B. Zhang. Generalized relational topic models with data augmen-\n\ntation. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2013.\n\n[7] M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation. In Advances\n\nin Neural Information Processing Systems (NIPS), 2010.\n\n[8] C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regres-\n\nsion. Bayesian Analysis, 1(1):145\u2013168, 2006.\n\n[9] D. Mimno, H. Wallach, and A. McCallum. Gibbs sampling for logistic normal topic models\n\nwith graph-based priors. In NIPS Workshop on Analyzing Graphs, 2008.\n\n[10] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models.\n\nJournal of Machine Learning Research, (10):1801\u20131828, 2009.\n\n[11] J. Paisley, C. Wang, and D. Blei. The discrete in\ufb01nite logistic normal distribution for mixed-\nIn International Conference on Arti\ufb01cial Intelligence and Statistics\n\nmembership modeling.\n(AISTATS), 2011.\n\n[12] N. G. Polson and J. G. Scott. Default bayesian analysis for multi-way tables: a data-\n\naugmentation approach. arXiv:1109.4180, 2011.\n\n[13] N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Polya-\n\nGamma latent variables. arXiv:1205.0310v2, 2013.\n\n[14] C. P. Robert. Simulation of truncated normal variables. Statistics and Compuating, 5:121\u2013125,\n\n1995.\n\n[15] W. Rudin. Principles of mathematical analysis. McGraw-Hill Book Co., 1964.\n\n[16] A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Very Large Data\n\nBase (VLDB), 2010.\n\n[17] M. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation.\n\nJournal of the Americal Statistical Association, 82(398):528\u2013540, 1987.\n\n[18] D. van Dyk and X. Meng. The art of data augmentation. Journal of Computational and\n\nGraphical Statistics, 10(1):1\u201350, 2001.\n\n[19] L. Yao, D. Mimno, and A. McCallum. Ef\ufb01cient methods for topic model inference on stream-\nIn International Conference on Knowledge Discovery and Data\n\ning document collections.\nmining (SIGKDD), 2009.\n\n[20] A. Zhang, J. Zhu, and B. Zhang. Sparse online topic models. In International Conference on\n\nWorld Wide Web (WWW), 2013.\n\n[21] J. Zhu, X. Zheng, and B. Zhang. Improved bayesian supervised topic models with data aug-\nmentation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2013.\n\n9\n\n\f", "award": [], "sourceid": 1155, "authors": [{"given_name": "Jianfei", "family_name": "Chen", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Zi", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Xun", "family_name": "Zheng", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Tsinghua University"}]}