{"title": "A Spectral Algorithm for Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 917, "page_last": 925, "abstract": "Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by \\emph{multiple} latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden.  This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (\\emph{i.e.}, third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on $k \\times k$ matrices, where $k$ is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space.", "full_text": "A Spectral Algorithm for Latent Dirichlet Allocation\n\nAnima Anandkumar\nUniversity of California\n\nIrvine, CA\n\na.anandkumar@uci.edu\n\nDean P. Foster\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA\n\ndean@foster.net\n\nDaniel Hsu\n\nMicrosoft Research\n\nCambridge, MA\n\ndahsu@microsoft.com\n\nSham M. Kakade\nMicrosoft Research\n\nCambridge, MA\n\nskakade@microsoft.com\n\nNational Institute of Standards and Technology\u2217\n\nYi-Kai Liu\n\nGaithersburg, MD\n\nyi-kai.liu@nist.gov\n\nAbstract\n\nTopic modeling is a generalization of clustering that posits that observations\n(words in a document) are generated by multiple latent factors (topics), as op-\nposed to just one. This increased representational power comes at the cost of\na more challenging unsupervised learning problem of estimating the topic-word\ndistributions when only words are observed, and the topics are hidden.\nThis work provides a simple and ef\ufb01cient learning procedure that is guaranteed to\nrecover the parameters for a wide class of topic models, including Latent Dirichlet\nAllocation (LDA). For LDA, the procedure correctly recovers both the topic-word\ndistributions and the parameters of the Dirichlet prior over the topic mixtures,\nusing only trigram statistics (i.e., third order moments, which may be estimated\nwith documents containing just three words). The method, called Excess Corre-\nlation Analysis, is based on a spectral decomposition of low-order moments via\ntwo singular value decompositions (SVDs). Moreover, the algorithm is scalable,\nsince the SVDs are carried out only on k \u00d7 k matrices, where k is the number\nof latent factors (topics) and is typically much smaller than the dimension of the\nobservation (word) space.\n\n1\n\nIntroduction\n\nTopic models use latent variables to explain the observed (co-)occurrences of words in documents.\nThey posit that each document is associated with a (possibly sparse) mixture of active topics, and\nthat each word in the document is accounted for (in fact, generated) by one of these active topics.\nIn Latent Dirichlet Allocation (LDA) [1], a Dirichlet prior gives the distribution of active topics\nin documents. LDA and related models possess a rich representational power because they allow\nfor documents to be comprised of words from several topics, rather than just a single topic. This\nincreased representational power comes at the cost of a more challenging unsupervised estimation\nproblem, when only the words are observed and the corresponding topics are hidden.\nIn practice, the most common unsupervised estimation procedures for topic models are based on\n\ufb01nding maximum likelihood estimates, through either local search or sampling based methods, e.g.,\nExpectation-Maximization [2], Gibbs sampling [3], and variational approaches [4]. Another body\nof tools is based on matrix factorization [5, 6]. For document modeling, a typical goal is to form\na sparse decomposition of a term by document matrix (which represents the word counts in each\n\n\u2217Contributions to this work by NIST, an agency of the US government, are not subject to copyright laws.\n\n1\n\n\fdocument) into two parts: one which speci\ufb01es the active topics in each document and the other\nwhich speci\ufb01es the distributions of words under each topic.\nThis work provides an alternative approach to parameter recovery based on the method of mo-\nments [7], which attempts to match the observed moments with those posited by the model. Our\napproach does this ef\ufb01ciently through a particular decomposition of the low-order observable mo-\nments, which can be extracted using singular value decompositions (SVDs). This method is simple\nand ef\ufb01cient to implement, and is guaranteed to recover the parameters of a wide class of topic\nmodels, including the LDA model. We exploit exchangeability of the observed variables and, more\ngenerally, the availability of multiple views drawn independently from the same hidden component.\n\n1.1 Summary of contributions\n\nWe present an approach called Excess Correlation Analysis (ECA) based on the low-order (cross)\nmoments of observed variables. These observed variables are assumed to be exchangeable (and,\nmore generally, drawn from a multi-view model). ECA differs from Principal Component Analysis\nand Canonical Correlation Analysis in that it is based on two singular value decompositions: the\n\ufb01rst SVD whitens the data (based on the correlation between two observed variables) and the second\nSVD uses higher-order moments (third- or fourth-order moments) to \ufb01nd directions which exhibit\nnon-Gaussianity, i.e., directions where the moments are in excess of those suggested by a Gaussian\ndistribution. The SVDs are performed only on k\u00d7k matrices, where k is the number of latent factors;\nnote that the number of latent factors (topics) k is typically much smaller than the dimension of the\nobserved space d (number of words).\nThe method is applicable to a wide class of latent variable models including exchangeable and multi-\nview models. We \ufb01rst consider the class of exchangeable variables with independent latent factors.\nWe show that the (exact) low-order moments permit a decomposition that recovers the parameters for\nmodel class, and that this decomposition can be computed using two SVD computations. We then\nconsider LDA and show that the same decomposition of a modi\ufb01ed third-order moment correctly\nrecovers both the probability distribution of words under each topic, as well as the parameters of the\nDirichlet prior. We note that in order to estimate third-order moments in the LDA model, it suf\ufb01ces\nfor each document to contain at least three words.\nWhile the methods described assume exact moments, it is straightforward to write down the ana-\nlogue \u201cplug-in\u201d estimators based on empirical moments from sampled data. We provide a simple\nsample complexity analysis that shows that estimating the third-order moments is not as dif\ufb01cult as\nit might na\u00a8\u0131vely seem since we only need a k \u00d7 k matrix to be accurate.\nFinally, we remark that the moment decomposition can also be obtained using other techniques,\nincluding tensor decomposition methods and simultaneous matrix diagonalization methods. Some\npreliminary experiments illustrating the ef\ufb01cacy of one such method is given in the appendix.\nOmitted proofs, and additional results and discussion are provided in the full version of the paper [8].\n\n1.2 Related work\n\nUnder the assumption that a single active topic occurs in each document, the work of [9] provides the\n\ufb01rst provable guarantees for recovering the topic distributions (i.e., the distribution of words under\neach topic), albeit with a rather stringent separation condition (where the words in each topic are\nessentially non-overlapping). Understanding what separation conditions permit ef\ufb01cient learning\nis a natural question; in the clustering literature, a line of work has focussed on understanding the\nrelationship between the separation of the mixture components and the complexity of learning. For\nclustering, the \ufb01rst provable learnability result [10] was under a rather strong separation condition;\nsubsequent results relaxed [11\u201318] or removed these conditions [19\u201321]; roughly speaking, learning\nunder a weaker separation condition is more challenging, both computationally and statistically.\nFor the topic modeling problem in which only a single active topic is present per document, [22]\nprovides an algorithm for learning topics with no separation requirement, but under a certain full\nrank assumption on the topic probability matrix.\nFor the case of LDA (where each document may be about multiple topics), the recent work of [23]\nprovides the \ufb01rst provable result under a natural separation condition. The condition requires that\n\n2\n\n\feach topic be associated with \u201canchor words\u201d that only occur in documents about that topic. This\nis a signi\ufb01cantly milder assumption than the one in [9]. Under this assumption, [23] provide the\n\ufb01rst provably correct algorithm for learning the topic distributions. Their work also justi\ufb01es the\nuse of non-negative matrix (NMF) as a provable procedure for this problem (the original motivation\nfor NMF was as a topic modeling algorithm, though, prior to this work, formal guarantees as such\nwere rather limited). Furthermore, [23] provides results for certain correlated topic models. Our\napproach makes further progress on this problem by relaxing the need for this separation condition\nand establishing a much simpler procedure for parameter estimation.\nThe underlying approach we take is a certain diagonalization technique of the observed moments.\nWe know of at least three different settings which use this idea for parameter estimation.\nThe work in [24] uses eigenvector methods for parameter estimation in discrete Markov models\ninvolving multinomial distributions. The idea has been extended to other discrete mixture models\nsuch as discrete hidden Markov models (HMMs) and mixture models with a single active topic\nin each document (see [22, 25, 26]). For such single topic models, the work in [22] demonstrates\nthe generality of the eigenvector method and the irrelevance of the noise model for the observations,\nmaking it applicable to both discrete models like HMMs as well as certain Gaussian mixture models.\nAnother set of related techniques is the body of algebraic methods used for the problem of blind\nsource separation [27]. These approaches are tailored for independent source separation with addi-\ntive noise (usually Gaussian) [28]. Much of the literature focuses on understanding the effects of\nmeasurement noise, which often requires more sophisticated algebraic tools (typically, knowledge\nof noise statistics or the availability of multiple views of the latent factors is not assumed). These\nalgebraic ideas are also used by [29, 30] for learning a linear transformation (in a noiseless setting)\nand provides a different provably correct algorithm, based on a certain ascent algorithm (rather than\njoint diagonalization approach, as in [27]), and a provably correct algorithm for the noisy case was\nrecently obtained by [31].\nThe underlying insight exploited by our method is the presence of exchangeable (or multi-view)\nvariables (e.g., multiple words in a document), which are drawn independently conditioned on the\nsame hidden state. This allows us to exploit ideas both from [24] and from [27]. In particular, we\nshow that the \u201ctopic\u201d modeling problem exhibits a rather simple algebraic solution, where only two\nSVDs suf\ufb01ce for parameter estimation.\nFurthermore, the exchangeability assumption permits us to have an arbitrary noise model (rather\nthan an additive Gaussian noise, which is not appropriate for multinomial and other discrete distri-\nbutions). A key technical contribution is that we show how the basic diagonalization approach can\nbe adapted for Dirichlet models, through a rather careful construction. This construction bridges the\ngap between the single topic models (as in [22, 24]) and the independent latent factors model.\nMore generally, the multi-view approach has been exploited in previous works for semi-supervised\nlearning and for learning mixtures of well-separated distributions (e.g., [16,18,32,33]). These previ-\nous works essentially use variants of canonical correlation analysis [34] between the two views. This\nwork follows [22] in showing that having a third view of the data permits rather simple estimation\nprocedures with guaranteed parameter recovery.\n\n2 The independent latent factors and LDA models\nLet h = (h1, h2, . . . , hk) \u2208 Rk be a random vector specifying the latent factors (i.e., the hidden\nstate) of a model, where hi is the value of the i-th factor. Consider a sequence of exchangeable ran-\ndom vectors x1, x2, x3, x4, . . . \u2208 Rd, which we take to be the observed variables. Assume through-\nout that d \u2265 k; that x1, x2, x3, x4, . . . \u2208 Rd are conditionally independent given h. Furthermore,\nassume there exists a matrix O \u2208 Rd\u00d7k such that\nfor each v \u2208 {1, 2, 3, . . .}. Throughout, we assume the following condition.\nCondition 2.1. O has full column rank.\n\nE[xv|h] = Oh\n\nThis is a mild assumption, which allows for identi\ufb01ability of the columns of O. The goal is to\nestimate the matrix O, sometimes referred to as the topic matrix. Note that at this stage, we have not\nmade any assumptions on the noise model; it need not be additive nor even independent of h.\n\n3\n\n\f2.1\n\nIndependent latent factors model\n\nIn the independent latent factors model, we assume h has a product distribution, i.e., h1, h2, . . . , hk\nare independent. Two important examples of this setting are as follows.\nMultiple mixtures of Gaussians: Suppose xv = Oh + \u03b7, where \u03b7 is Gaussian noise and h is a\nbinary vector (under a product distribution). Here, the i-th column Oi can be considered to be the\nmean of the i-th Gaussian component. This generalizes the classic mixture of k Gaussians, as the\nmodel now permits any number of Gaussians to be responsible for generating the hidden state (i.e.,\nh is permitted to be any of the 2k vectors on the hypercube, while in the classic mixture problem,\nonly one component is responsible). We may also allow \u03b7 to be heteroskedastic (i.e., the noise may\ndepend on h, provided the linearity assumption E[xv|h] = Oh holds).\nMultiple mixtures of Poissons: Suppose [Oh]j speci\ufb01es the Poisson rate of counts for [xv]j. For\nexample, xv could be a vector of word counts in the v-th sentence of a document. Here, O would be\na matrix with positive entries, and hi would scale the rate at which topic i generates words in a sen-\ntence (as speci\ufb01ed by the i-th column of O). The linearity assumption is satis\ufb01ed as E[xv|h] = Oh\n(note the noise is not additive in this case). Here, multiple topics may be responsible for generating\nthe words in each sentence. This model provides a natural variant of LDA, where the distribution\nover h is a product distribution (while in LDA, h is a probability vector).\n\n2.2 The Dirichlet model\n\nNow suppose the hidden state h is a distribution itself, with a density speci\ufb01ed by the Dirichlet\ndistribution with parameter \u03b1 \u2208 Rk\n>0 (\u03b1 is a strictly positive real vector). We often think of h as a\ndistribution over topics. Precisely, the density of h \u2208 \u2206k\u22121 (where the probability simplex \u2206k\u22121\ndenotes the set of possible distributions over k outcomes) is speci\ufb01ed by:\n\nk(cid:89)\n\np\u03b1(h) :=\n\n1\n\nZ(\u03b1)\n\ni=1\n\nh\u03b1i\u22121\n\ni\n\n(cid:81)k\n\ni=1 \u0393(\u03b1i)\n\u0393(\u03b10)\n\nand \u03b10 := \u03b11 + \u03b12 + \u00b7\u00b7\u00b7 + \u03b1k. Intuitively, \u03b10 (the sum of the \u201cpseudo-\nwhere Z(\u03b1) :=\ncounts\u201d) characterizes the concentration of the distribution. As \u03b10 \u2192 0, the distribution degenerates\nto one over pure topics (i.e., the limiting density is one in which, almost surely, exactly one coordi-\nnate of h is 1, and the rest are 0).\nLatent Dirichlet Allocation: LDA makes the further assumption that each random variable\nx1, x2, x3, . . . takes on discrete values out of d outcomes (e.g., xv represents what the v-th word\nin a document is, so d represents the number of words in the language). The i-th column Oi of\nO is a probability vector representing the distribution over words for the i-th topic. The sampling\nprocess for a document is as follows. First, the topic mixture h is drawn from the Dirichlet dis-\ntribution. Then, the v-th word in the document (for v = 1, 2, . . . ) is generated by: (i) drawing\nt \u2208 [k] := {1, 2, . . . k} according to the discrete distribution speci\ufb01ed by h, then (ii) drawing xv\naccording to the discrete distribution speci\ufb01ed by Ot (the t-th column of O). Note that xv is in-\ndependent of h given t. For this model to \ufb01t in our setting, we use the \u201cone-hot\u201d encoding for xv\nfrom [22]: xv \u2208 {0, 1}d with [xv]j = 1 iff the v-th word in the document is the j-th word in the\nvocabulary. Observe that\n\nE[xv|h] =\n\nPr[t = i|h] \u00b7 E[xv|t = i, h] =\n\nhi \u00b7 Oi = Oh\n\nas required. Again, note that the noise model is not additive.\n\ni=1\n\ni=1\n\n3 Excess Correlation Analysis (ECA)\n\nWe now present ef\ufb01cient algorithms for exactly recovering O from low-order moments of the ob-\nserved variables. The algorithm is based on two singular value decompositions:\nthe \ufb01rst SVD\nwhitens the data (based on the correlation between two variables), and the second SVD is carried\n\n4\n\nk(cid:88)\n\nk(cid:88)\n\n\fAlgorithm 1 ECA, with skewed factors\n\nInput: vector \u03b8 \u2208 Rk; the moments Pairs and Triples.\n\n1. Dimensionality reduction: Find a matrix U \u2208 Rd\u00d7k such that\n\nrange(U ) = range(Pairs).\n\n(See Remark 1 for a fast procedure.)\n\n2. Whiten: Find V \u2208 Rk\u00d7k so V (cid:62)(U (cid:62) Pairs U )V is the k \u00d7 k identity matrix. Set:\n\nW = U V.\n3. SVD: Let \u039e be the set of left singular vectors of\n\nW (cid:62) Triples(W \u03b8)W\n\ncorresponding to non-repeated singular values (i.e., singluar values with multiplicity\none).\n\n4. Reconstruct: Return the set (cid:98)O := {(W +)(cid:62)\u03be : \u03be \u2208 \u039e}.\n\nout on higher-order moments. We start with the case of independent factors, as these algorithms\nmake the basic diagonalization approach clear.\nThroughout, we use A+ to denote the Moore-Penrose pseudo-inverse.\n\n3.1 Independent and skewed latent factors\n\nDe\ufb01ne the following moments:\n\u00b5 := E[x1], Pairs := E[(x1 \u2212 \u00b5) \u2297 (x2 \u2212 \u00b5)], Triples := E[(x1 \u2212 \u00b5) \u2297 (x2 \u2212 \u00b5) \u2297 (x3 \u2212 \u00b5)]\n(here \u2297 denotes the tensor product, so \u00b5 \u2208 Rd, Pairs \u2208 Rd\u00d7d, and Triples \u2208 Rd\u00d7d\u00d7d). It is\nconvenient to project Triples to matrices as follows:\n\nTriples(\u03b7) := E[(x1 \u2212 \u00b5)(x2 \u2212 \u00b5)(cid:62)(cid:104)\u03b7, x3 \u2212 \u00b5(cid:105)].\n\nRoughly speaking, we can think of Triples(\u03b7) as a re-weighting of a cross covariance (by (cid:104)\u03b7, x3 \u2212\n\u00b5(cid:105)).\nNote that the matrix O is only identi\ufb01able up to permutation and scaling of columns. To see the\nlatter, observe the distribution of any xv is unaltered if, for any i \u2208 [k], we multiply the i-th column\nof O by a scalar c (cid:54)= 0 and divide the variable hi by the same scalar c. Without further assumptions,\nwe can only hope to recover a certain canonical form of O, de\ufb01ned as follows.\nDe\ufb01nition 1 (Canonical form). We say O is in a canonical form (relative to h) if, for each i \u2208 [k],\n\ni := E[(hi \u2212 E[hi])2] = 1.\n\u03c32\n\nThe transformation O \u2190 O diag(\u03c31, \u03c32, . . . , \u03c3k) (and a rescaling of h) places O in canonical form\nrelative to h, and the distribution over x1, x2, x3, . . . is unaltered. In canonical form, O is unique up\nto a signed column permutation.\nLet \u00b5i,p := E[(hi \u2212 E[hi])p] denote the p-th central moment of hi, so the variance and skewness of\nhi are given by \u03c32\ni . The \ufb01rst result considers the case when the skewness\nis non-zero.\ni > 0 for each i \u2208 [k].\nTheorem 3.1 (Independent and skewed factors). Assume Condition 2.1 and \u03c32\nUnder the independent latent factor model, the following hold.\n\ni := \u00b5i,2 and \u03b3i := \u00b5i,3/\u03c33\n\n\u2022 No False Positives: For all \u03b8 \u2208 Rk, Algorithm 1 returns a subset of the columns of O, in\n\ncanonical form up to sign.\n\n\u2022 Exact Recovery: Assume \u03b3i (cid:54)= 0 for each i \u2208 [k]. If \u03b8 \u2208 Rk is drawn uniformly at random\nfrom the unit sphere S k\u22121, then with probability 1, Algorithm 1 returns all columns of O,\nin canonical form up to sign.\n\n5\n\n\fThe proof of this theorem relies on the following lemma.\nLemma 3.1 (Independent latent factors moments). Under the independent latent factor model,\n\nPairs =\n\nTriples =\n\ni Oi \u2297 Oi = O diag(\u03c32\n\u03c32\n\n1, \u03c32\n\n2, . . . , \u03c32\n\nk)O(cid:62),\n\n\u00b5i,3 Oi \u2297 Oi \u2297 Oi, Triples(\u03b7) = O diag(O(cid:62)\u03b7) diag(\u00b51,3, \u00b52,3, . . . , \u00b5k,3)O(cid:62).\n\nk(cid:88)\nk(cid:88)\n\ni=1\n\ni=1\n\nProof. The model assumption E[xv|h] = Oh implies \u00b5 = OE[h]. Therefore E[(xv \u2212 \u00b5)|h] =\nO(h \u2212 E[h]). Using the conditional independence of x1 and x2 given h, and the fact that h has a\nproduct distribution,\n\nPairs = E[(x1 \u2212 \u00b5) \u2297 (x2 \u2212 \u00b5)] = E[E[(x1 \u2212 \u00b5)|h] \u2297 E[(x2 \u2212 \u00b5)|h]]\n\n= OE[(h \u2212 E[h]) \u2297 (h \u2212 E[h])]O(cid:62) = O diag(\u03c32\nAn analogous argument gives the claims for Triples and Triples(\u03b7).\n\n1, \u03c32\n\n2, . . . , \u03c32\n\nk)O(cid:62).\n\nProof of Theorem 3.1. Assume O is in canonical form with respect to h. By Condition 2.1,\nU (cid:62) Pairs U \u2208 Rk\u00d7k is full rank and hence positive de\ufb01nite. Thus the whitening step is pos-\nsible, and M := W (cid:62)O is orthogonal. Observe that W (cid:62) Triples(W \u03b8)W = M DM (cid:62), where\nD := diag(M (cid:62)\u03b8) diag(\u03b31, \u03b32, . . . , \u03b3k). Since M is orthogonal, the above is an eigendecompo-\nsition of W (cid:62) Triples(W \u03b8)W , and hence the set of left singular vectors corresponding to non-\nrepeated singular values are uniquely de\ufb01ned up to sign. Each such singular vector \u03be is of the\nform siM ei = siW (cid:62)Oei = siW (cid:62)Oi for some i \u2208 [k] and si \u2208 {\u00b11}, so (W +)(cid:62)\u03be =\nsiW (W (cid:62)W )\u22121W (cid:62)Oi = siOi (because range(W ) = range(U ) = range(O)).\nIf \u03b8 is drawn uniformly at random from S k\u22121, then so is M (cid:62)\u03b8. In this case, almost surely, the\ndiagonal entries of D are unique (provided that each \u03b3i (cid:54)= 0), and hence every singular value of\nW (cid:62) Triples(W \u03b8)W is non-repeated.\nRemark 1 (Finding range(Pairs) ef\ufb01ciently). Let \u0398 \u2208 Rd\u00d7k be a random matrix with entries sam-\npled independently from the standard normal distribution, and set U := Pairs \u0398. Then with proba-\nbility 1, range(U ) = range(Pairs).\ni ) \u2212 3 (cid:54)= 0 for each i \u2208 [k],\nIt is easy to extend Algorithm 1 to kurtotic sources where \u03bai := (\u00b5i,4/\u03c34\nsimply by using fourth-order cumulants in places of Triples(\u03b7). The details are given in the full\nversion of the paper.\n\n3.2 Latent Dirichlet Allocation\n\n2\n\n1\n\nk\n\nh\u03b12\u22121\n\n\u00b7\u00b7\u00b7 h\u03b1k\u22121\n\nNow we turn to LDA where h has a Dirichlet density. Even though the distribution on h is propor-\ntional to the product h\u03b11\u22121\n, the hi are not independent because h is constrained to\nlive in the simplex. These mild dependencies suggest using a certain correction of the moments with\nECA.\nWe assume \u03b10 is known. Knowledge of \u03b10 = \u03b11 + \u03b12 + \u00b7\u00b7\u00b7 + \u03b1k is signi\ufb01cantly weaker than\nhaving full knowledge of the entire parameter vector \u03b1 = (\u03b11, \u03b12, . . . , \u03b1k). A common practice\nis to specify the entire parameter vector \u03b1 in a homogeneous manner, with each component being\nidentical (see [35]). Here, we need only specify the sum, which allows for arbitrary inhomogeneity\nin the prior.\nDenote the mean and a modi\ufb01ed second moment by\n\n2 ]\u03b7\u00b5(cid:62) + \u00b5\u03b7(cid:62)E[x1x(cid:62)\n\n2 ] + (cid:104)\u03b7, \u00b5(cid:105)E[x1x(cid:62)\n2 ]\n2\u03b12\n0\n\n(cid:104)\u03b7, \u00b5(cid:105)\u00b5\u00b5(cid:62).\n\n(\u03b10 + 2)(\u03b10 + 1)\n\n+\n\n6\n\n\u00b5 = E[x1],\n\nPairs\u03b10 := E[x1x(cid:62)\n\n2 ] \u2212 \u03b10\n\u03b10 + 1\n\n\u00b5\u00b5(cid:62),\n\nand a modi\ufb01ed third moment as\n\nTriples\u03b10(\u03b7) := E[x1x(cid:62)\n\n2 (cid:104)\u03b7, x3(cid:105)] \u2212 \u03b10\n\u03b10 + 2\n\n(cid:16)E[x1x(cid:62)\n\n(cid:17)\n\n\f1\u20133. Execute steps 1\u20133 of Algorithm 1 with Pairs\u03b10 and Triples\u03b10 in place of Pairs and\n\nAlgorithm 2 ECA for Latent Dirichlet Allocation\n\nInput: vector \u03b8 \u2208 Rk; the modi\ufb01ed moments Pairs\u03b10 and Triples\u03b10.\n(cid:27)\n\n4. Reconstruct and normalize: Return the set\n\n(cid:26) (W +)(cid:62)\u03be\n\nTriples.\n\n(cid:98)O :=\n\n: \u03be \u2208 \u039e\n\n(cid:126)1(cid:62)(W +)(cid:62)\u03be\n\nwhere (cid:126)1 \u2208 Rd is a vector of all ones.\n\nRemark 2 (Central vs. non-central moments). In the limit as \u03b10 \u2192 0, the Dirichlet model degen-\nerates so that, with probability 1, only one coordinate of h equals 1 and the rest are 0 (i.e., each\ndocument is about just one topic). In this case, the modi\ufb01ed moments tend to the raw (cross) mo-\nments:\n\nPairs\u03b10 = E[x1 \u2297 x2],\n\nlim\n\u03b10\u21920\n\nTriples\u03b10 = E[x1 \u2297 x2 \u2297 x3].\n(cid:88)\n\nPr[1st word = i, 2nd word = j] ei\u2297ej,\n\n1\u2264i,j\u2264d\n\nNote that the one-hot encoding of words in xv implies that\nE[x1\u2297x2] =\n\nPr[x1 = ei, x2 = ej] ei\u2297ej =\n\nlim\n\u03b10\u21920\n\n(cid:88)\n\n1\u2264i,j\u2264d\n\n(and a similar expression holds for E[x1 \u2297 x2 \u2297 x3]), so these raw moments in the limit \u03b10 \u2192 0 are\nprecisely the joint probabilitiy tables of words across all documents.\nAt the other extreme \u03b10 \u2192 \u221e, the modi\ufb01ed moments tend to the central moments:\n\u03b10\u2192\u221e Triples\u03b10 = E[(x1 \u2212 \u00b5)\u2297 (x2 \u2212 \u00b5)\u2297 (x3 \u2212 \u00b5)]\n\u03b10\u2192\u221e Pairs\u03b10 = E[(x1 \u2212 \u00b5)\u2297 (x2 \u2212 \u00b5)],\nlim\n(to see this, expand the central moment and use exchangeability: E[x1x(cid:62)\n3 ] = E[x1x(cid:62)\n3 ]).\nOur main result here shows that ECA recovers both the topic matrix O, up to a permutation of the\ncolumns (where each column represents a probability distribution over words for a given topic) and\nthe parameter vector \u03b1, using only knowledge of \u03b10 (which, as discussed earlier, is a signi\ufb01cantly\nless restrictive assumption than tuning the entire parameter vector).\nTheorem 3.2 (Latent Dirichlet Allocation). Assume Condition 2.1 holds. Under the LDA model,\nthe following hold.\n\n2 ] = E[x2x(cid:62)\n\nlim\n\n\u2022 No False Positives: For all \u03b8 \u2208 Rk, Algorithm 2 returns a subset of the columns of O.\n\u2022 Topic Recovery: If \u03b8 \u2208 Rk is drawn uniformly at random from the unit sphere S k\u22121, then\n\nwith probability 1, Algorithm 2 returns all columns of O.\n\n\u2022 Parameter Recovery:\n\n1)O+ Pairs\u03b10(O+)(cid:62)(cid:126)1, where (cid:126)1 \u2208 Rk is a vector of all ones.\n\nThe Dirichlet parameter \u03b1 satis\ufb01es \u03b1 = \u03b10(\u03b10 +\n\nThe proof relies on the following lemma.\nLemma 3.2 (LDA moments). Under the LDA model,\n\nPairs\u03b10 =\n\nTriples\u03b10 (\u03b7) =\n\nO diag(\u03b1)O(cid:62),\n\n1\n\n(\u03b10 + 1)\u03b10\n2\n\n(\u03b10 + 2)(\u03b10 + 1)\u03b10\n\nO diag(O(cid:62)\u03b7) diag(\u03b1)O(cid:62).\n\nThe proof of Lemma 3.2 is similar to that of Lemma 3.1, except here we must use the speci\ufb01c\nproperties of the Dirichlet distribution to show that the corrections to the raw (cross) moments have\nthe desired effect.\n\nProof of Theorem 3.2. Note that with the rescaling \u02dcO :=\n\u03b1k),\nwe have that Pairs\u03b10 = \u02dcO \u02dcO(cid:62). This is akin to \u02dcO being in canonical form as per the skewed factor\n\n\u03b12, . . . ,\n\n(\u03b10+1)\u03b10\n\n\u03b11,\n\n1\u221a\n\n\u221a\nO diag(\n\n\u221a\n\n\u221a\n\n7\n\n\fmodel of Theorem 3.1. Now the proof of the \ufb01rst two claims is the same as that of Theorem 3.1; the\nonly modi\ufb01cation is that we simply normalize the output of Algorithm 1. Finally, observe that claim\nfor estimating \u03b1 holds due to the functional form of Pairs\u03b10.\nRemark 3 (Limiting behaviors). ECA seamlessly interpolates between the single topic model (\u03b10 \u2192\n0) of [22] and the skewness-based ECA, Algorithm 1 (\u03b10 \u2192 \u221e).\n\n4 Discussion\n\n4.1 Sample complexity\n\nIt is straightforward to derive a \u201cplug-in\u201d variant of Algorithm 2 based on empirical moments rather\nthan exact population moments. The empirical moments are formed using the word co-occurrence\nstatistics for documents in a corpus. The following theorem shows that the empirical version of ECA\nreturns accurate estimates of the topics. The details and proof are left to the full version of the paper.\nTheorem 4.1 (Sample complexity for LDA). There exist universal constants C1, C2 > 0 such that\nthe following hold. Let pmin = mini\nand let \u03c3k(O) denote the smallest (non-zero) singular\nvalue of O. Suppose that we obtain N \u2265 C1 \u00b7 ((\u03b10 + 1)/(pmin\u03c3k(O)2))2 independent samples\nof x1, x2, x3 in the LDA model, which are used to form empirical moments (cid:91)Pairs\u03b10 and (cid:92)Triples\u03b10.\nWith high probability, the plug-in variant of Algorithm 2 returns a set { \u02c6O1, \u02c6O2, . . . \u02c6Ok} such that,\nfor some permutation \u03c3 of [k],\n\n\u03b1i\n\u03b10\n\n(cid:107)Oi \u2212 \u02c6O\u03c3(i)(cid:107)2 \u2264 C2 \u00b7\n\n(\u03b10 + 1)2k3\n\u221a\np2\nmin\u03c3k(O)3\n\nN\n\n,\n\n\u2200i \u2208 [k].\n\n4.2 Alternative decomposition methods\n\nposition of the tensor Triples = (cid:80)k\n\nAlgorithm 1 is a theoretically ef\ufb01cient and simple-to-state method for obtaining the desired decom-\ni=1 \u00b5i,3Oi \u2297 Oi \u2297 Oi (a similar tensor form for Triples\u03b10 in\nthe case of LDA can also be given). However, in practice the method is not particularly stable, due\nto the use of internal randomization to guarantee strict separation of singular values. It should be\nnoted that there are other methods in the literature for obtaining these decompositions, for instance,\nmethods based on simultaneous diagonalizations of matrices [36] as well as direct tensor decom-\nposition methods [37]; and that these methods can be signi\ufb01cantly more stable than Algorithm 1.\nIn particular, very recent work in [37] shows that the structure revealed in Lemmas 3.1 and 3.2 can\nbe exploited to derive very ef\ufb01cient estimation algorithms for all the models considered here (and\nothers) based on a tensor power iteration. We have used a simpli\ufb01ed version of this tensor power it-\neration in preliminary experiments for estimating topic models, and found the results (Appendix A)\nto be very encouraging, especially due to the speed and robustness of the algorithm.\n\nAcknowledgements\n\nWe thank Kamalika Chaudhuri, Adam Kalai, Percy Liang, Chris Meek, David Sontag, and Tong\nZhang for many invaluable insights. We also give warm thanks to Rong Ge for sharing preliminary\nresults (in [23]) and early insights into this problem with us. Part of this work was completed while\nall authors were at Microsoft Research New England. AA is supported in part by the NSF Award\nCCF-1219234, AFOSR Award FA9550-10-1-0310 and the ARO Award W911NF-12-1-0404.\n\nReferences\n[1] David M. Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[2] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM\n\nReview, 26(2):195\u2013239, 1984.\n\n[3] A. Asuncion, P. Smyth, M. Welling, D. Newman, I. Porteous, and S. Triglia. Distributed gibbs sam-\npling for latent variable models. In Scaling Up Machine Learning: Parallel and Distributed Approaches.\nCambridge Univ Pr, 2011.\n\n[4] M.D. Hoffman, D.M. Blei, and F. Bach. Online learning for latent dirichlet allocation. In NIPS, 2010.\n\n8\n\n\f[5] Thomas Hofmann. Probilistic latent semantic analysis. In UAI, 1999.\n[6] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization.\n\nNature, 401, 1999.\n\n[7] K. Pearson. Contributions to the mathematical theory of evolution. Phil. Trans. of the Royal Society,\n\nLondon, A., 1894.\n\n[8] A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y.-K. Liu. Two svds suf\ufb01ce: spectral decompo-\n\nsitions for probabilistic topic models and latent dirichlet allocation, 2012. arXiv:1204.6703.\n\n[9] Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic\n\nindexing: A probabilistic analysis. J. Comput. Syst. Sci., 61(2), 2000.\n\n[10] S. Dasgupta. Learning mixutres of Gaussians. In FOCS, 1999.\n[11] S. Dasgupta and L. Schulman. A two-round variant of em for gaussian mixtures. In UAI, 2000.\n[12] S. Arora and R. Kannan. Learning mixtures of arbitrary Gaussians. In STOC, 2001.\n[13] S. Vempala and G. Wang. A spectral algorithm for learning mixtures of distributions. In FOCS, 2002.\n[14] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In COLT,\n\n2005.\n\n[15] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In COLT, 2005.\n[16] K. Chaudhuri and S. Rao. Learning mixtures of product distributions using correlations and independence.\n\nIn COLT, 2008.\n\n[17] S. C. Brubaker and S. Vempala. Isotropic PCA and af\ufb01ne-invariant clustering. In FOCS, 2008.\n[18] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multi-view clustering via canonical correla-\n\ntion analysis. In ICML, 2009.\n\n[19] A. T. Kalai, A. Moitra, and G. Valiant. Ef\ufb01ciently learning mixtures of two Gaussians. In STOC, 2010.\n[20] M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS, 2010.\n[21] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In FOCS, 2010.\n[22] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden\n\nmarkov models. In COLT, 2012.\n\n[23] S. Arora, , R. Ge, and A. Moitra. Learning topic models \u2014 going beyond svd. In FOCS, 2012.\n[24] J. T. Chang. Full reconstruction of Markov models on evolutionary trees: Identi\ufb01ability and consistency.\n\nMathematical Biosciences, 137:51\u201373, 1996.\n\n[25] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. Annals of Applied\n\nProbability, 16(2):583\u2013614, 2006.\n\n[26] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. In COLT,\n\n2009.\n\n[27] Jean-Franois Cardoso and Pierre Comon. Independent component analysis, a survey of some algebraic\n\nmethods. In IEEE International Symposium on Circuits and Systems, pages 93\u201396, 1996.\n\n[28] P. Comon and C. Jutten. Handbook of Blind Source Separation: Independent Component Analysis and\n\nApplications. Academic Press. Elsevier, 2010.\n\n[29] Alan M. Frieze, Mark Jerrum, and Ravi Kannan. Learning linear transformations. In FOCS, 1996.\n[30] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures.\n\nJournal of Cryptology, 22(2):139\u2013160, 2009.\n\n[31] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, and implica-\n\ntions for Gaussian mixtures and autoencoders. In NIPS, 2012.\n\n[32] R. Ando and T. Zhang. Two-view feature generation model for semi-supervised learning. In ICML, 2007.\n[33] Sham M. Kakade and Dean P. Foster. Multi-view regression via canonical correlation analysis. In COLT,\n\n2007.\n\n[34] H. Hotelling. The most predictable criterion. Journal of Educational Psychology, 26(2):139\u2013142, 1935.\n[35] Mark Steyvers and Tom Grif\ufb01ths. Probabilistic topic models. In T. Landauer, D. Mcnamara, S. Dennis,\n\nand W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2006.\n\n[36] A. Bunse-Gerstner, R. Byers, and V. Mehrmann. Numerical methods for simultaneous diagonalization.\n\nSIAM Journal on Matrix Analysis and Applications, 14(4):927\u2013949, 1993.\n\n[37] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and T. Telgarsky. Tensor decompositions for learning\n\nlatent variable models, 2012. arXiv:1210.7559.\n\n9\n\n\f", "award": [], "sourceid": 441, "authors": [{"given_name": "Anima", "family_name": "Anandkumar", "institution": null}, {"given_name": "Dean", "family_name": "Foster", "institution": ""}, {"given_name": "Daniel", "family_name": "Hsu", "institution": ""}, {"given_name": "Sham", "family_name": "Kakade", "institution": ""}, {"given_name": "Yi-kai", "family_name": "Liu", "institution": ""}]}