{"title": "A concave regularization technique for sparse mixture models", "book": "Advances in Neural Information Processing Systems", "page_first": 1890, "page_last": 1898, "abstract": "Latent variable mixture models are a powerful tool for exploring the structure in large datasets. A common challenge for interpreting such models is a desire to impose sparsity, the natural assumption that each data point only contains few latent features. Since mixture distributions are constrained in their L1 norm, typical sparsity techniques based on L1 regularization become toothless, and concave regularization becomes necessary. Unfortunately concave regularization typically results in EM algorithms that must perform problematic non-concave M-step maximizations. In this work, we introduce a technique for circumventing this difficulty, using the so-called Mountain Pass Theorem to provide easily verifiable conditions under which the M-step is well-behaved despite the lacking concavity. We also develop a correspondence between logarithmic regularization and what we term the pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distribution well-suited for inducing sparsity. We demonstrate our approach on a text corpus, inferring a sparse topic mixture model for 2,406 weblogs.", "full_text": "A concave regularization technique\n\nfor sparse mixture models\n\nSchool of Operations Research and Information Engineering\n\nCenter for Applied Mathematics\n\nJohan Ugander\n\nCornell University\n\njhu5@cornell.edu\n\nMartin Larsson\n\nCornell University\n\nmol23@cornell.edu\n\nAbstract\n\nLatent variable mixture models are a powerful tool for exploring the structure in\nlarge datasets. A common challenge for interpreting such models is a desire to\nimpose sparsity, the natural assumption that each data point only contains few la-\ntent features. Since mixture distributions are constrained in their L1 norm, typical\nsparsity techniques based on L1 regularization become toothless, and concave reg-\nularization becomes necessary. Unfortunately concave regularization typically re-\nsults in EM algorithms that must perform problematic non-concave M-step maxi-\nmizations. In this work, we introduce a technique for circumventing this dif\ufb01culty,\nusing the so-called Mountain Pass Theorem to provide easily veri\ufb01able conditions\nunder which the M-step is well-behaved despite the lacking concavity. We also\ndevelop a correspondence between logarithmic regularization and what we term\nthe pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distri-\nbution well-suited for inducing sparsity. We demonstrate our approach on a text\ncorpus, inferring a sparse topic mixture model for 2,406 weblogs.\n\n1\n\nIntroduction\n\nThe current trend towards \u2018big data\u2019 has created a strong demand for techniques to ef\ufb01ciently extract\nstructure from ever-accumulating unstructured datasets. Speci\ufb01c contexts for this demand include\nlatent semantic models for organizing text corpora, image feature extraction models for navigating\nlarge photo datasets, and community detection in social networks for optimizing content delivery.\nMixture models identify such latent structure, helping to categorize unstructured data.\nMixture models approach datasets as a set D of element d \u2208 D, for example images or text doc-\numents. Each element consists of a collection of words w \u2208 W drawn with replacement from a\nvocabulary W. Each element-word pair observation is further assumed to be associated with an\nunobserved class z \u2208 Z, where Z is the set of classes. Ordinarily it is assumed that |Z| (cid:28) |D|,\nnamely that the number of classes is much less than the number of elements. In this work we ex-\nplore an additional sparsity assumption, namely that individual elements only incorporate a small\nsubset of the |Z| classes, so that each element arises as a mixture of only (cid:96) (cid:28) |Z| classes. We de-\nvelop a framework to overcome mathematical dif\ufb01culties in how this assumption can be harnessed\nto improve the performance of mixture models.\nOur primary context for mixture modeling in this work will be latent semantic models of text data,\nwhere elements d are documents, words w are literal words, and classes z are vocabulary topics.\nWe apply our framework to models based on Probabilistic Latent Semantic Analysis (PLSA) [1].\nWhile PLSA is often outperformed within text applications by techniques such as Latent Dirichlet\nAllocation (LDA) [2], it forms the foundation of many mixture model techniques, from computer\nvision [3] to network community detection [4], and we emphasize that our contribution is an opti-\nmization technique intended for broad application outside merely topic models for text corpora. The\n\n1\n\n\fnear-equivalence between PLSA and Nonnegative Matrix Factorization (NMF) [5, 6] implies that\nour technique is equally applicable to NMF problems as well. Sparse inference as a rule targets point\nestimation, which makes PLSA-style models appropriate since they are inherently frequentist, de-\nriving point-estimated models via likelihood maximization. In contrast, fully Bayesian frameworks\nsuch as Latent Dirichlet Allocation (LDA) output a posterior distribution across the model space.\nSparse inference is commonly achieved through two largely equivalent techniques: regularization\nor MAP inference. Regularization modi\ufb01es ordinary likelihood maximization with a penalty on\nthe magnitudes of the parameters. Maximum a posteriori (MAP) inference employs priors con-\ncentrated towards small parameter values. MAP PLSA is an established technique [7], but earlier\nwork has been limited to log-concave prior distributions (corresponding to convex regularization\nfunctions) that make a concave contribution to the posterior log-likelihood. While such priors allow\nfor tractable EM algorithms, they have the effect of promoting smoothing rather than sparsity. In\ncontrast, sparsity-inducing priors are invariably convex in their contribution. In this work we resolve\nthis dif\ufb01culty by showing how, even though concavity fails in general, we are able to derive simple\ncheckable conditions that guarantee a unique stationary point to the M-step objective function that\nserves as the unique global maximum. This rather surprising result, using the so-called Mountain\nPass Theorem, is a noteworthy contribution to the theory of learning algorithms which we expect\nhas many applications outside merely PLSA.\nSection 2 brie\ufb02y outlines the structure of MAP inference for PLSA. Section 3 discusses priors ap-\npropriate for inducing sparsity, and introduces a generalization of the Dirichlet distribution which\nwe term the pseudo-Dirichlet distribution. Section 4 contains our main result, a tractable EM algo-\nrithm for PLSA under sparse pseudo-Dirichlet priors using the Mountain Pass Theorem. Section 5\npresents empirical results for a corpus of 2,406 weblogs, and section 6 concludes with a discussion.\n\n2 Background and preliminaries\n\n2.1 Standard PLSA\n\n(cid:88)\n\nw,d\n\n(cid:104)(cid:88)\n\nWithin the PLSA framework, word-document-topic triplets (w, d, z) are assumed to be i.i.d. draws\nfrom a joint distribution on W \u00d7 D \u00d7 Z of the form\n\n(1)\nwhere \u03b8 consists of the model parameters P (w | z), P (z | d) and P (d) for (w, d, z) ranging over\nW \u00d7 D \u00d7 Z. Following [1], the corresponding data log-likelihood can be written\n\nP (w, d, z | \u03b8) = P (w | z)P (z | d)P (d),\n\n(cid:96)0(\u03b8) =\n\nn(w, d) log\n\nP (w | z)P (z | d)\n\nwhere n(w, d) is the number of occurrences of word w in document d, and n(d) =(cid:80)\n\nw n(w, d) is\nthe total number of words in d. The goal is to maximize the likelihood over the set of admissible \u03b8.\nThis is accomplished using the EM algorithm, iterating between the following two steps:\nE-step: Find P (z | w, d, \u03b8(cid:48)), the posterior distribution of the latent variable z, given (w, d) and a\ncurrent parameter estimate \u03b8(cid:48).\nM-step: Maximize Q0(\u03b8 | \u03b8(cid:48)) over \u03b8, where\n\nn(d) log P (d),\n\n(cid:88)\n\n+\n\nd\n\n(cid:105)\n\n(2)\n\nQ0(\u03b8 | \u03b8(cid:48)) =\n\nn(d) log P (d) +\n\nd\n\nw,d,z\n\nn(w, d)P (z | w, d, \u03b8(cid:48)) log\n\nP (w | z)P (z | d)\n\n.\n\nz\n\n(cid:88)\n\n(cid:88)\n\n(cid:104)\n\n(cid:105)\n\nWe refer to [1] for details on the derivations, as well as extensions using so-called tempered EM.\nThe resulting updates corresponding to the E-step and M-step are, respectively,\n\nn(d)\n\n2\n\nand\n\nP (z | w, d, \u03b8) =\n\nP (w | z)P (z | d)\nz(cid:48) P (w | z(cid:48))P (z(cid:48) | d)\n\n(cid:80)\n(cid:80)\n(cid:80)\nd P (z | w, d, \u03b8(cid:48))n(w, d)\nw(cid:48),d P (z | w(cid:48), d, \u03b8(cid:48))n(w(cid:48), d)\n\n,\n\n,\n\nP (d) =\n\nP (w | z) =\n\n(cid:80)\nw P (z | w, d, \u03b8(cid:48))n(w, d)\n\nP (z | d) =\n\n(3)\n\n(4)\n\nn(d)(cid:80)\n\nd(cid:48) n(d(cid:48))\n\n.\n\n\fNote that PLSA has an alternative parameterization, where (1) is replaced by P (w, d, z | \u03b8) =\nP (w|z)P (d|z)P (z). This formulation is less interesting in our context, since our sparsity assump-\ntion is intended as a statement about the vectors (P (z | d) : z \u2208 Z), d \u2208 D.\n\n2.2 MAP PLSA\n\nThe standard MAP extension of PLSA is to introduce a prior density P (\u03b8) on the parameter vector,\nand then maximize the posterior data log-likelihood (cid:96)(\u03b8) = (cid:96)0(\u03b8) + log P (\u03b8) via the EM algorithm.\nIn order to simplify the optimization problem, we impose the reasonable restriction that the vectors\n(P (w | z) : w \u2208 W) for z \u2208 Z, (P (z | d) : z \u2208 Z) for d \u2208 D, and (P (d) : d \u2208 D) be mutually\nindependent under the prior P (\u03b8). That is,\n\n(cid:89)\n\nz\u2208Z\n\nP (\u03b8) =\n\nfz(P (w | z) : w \u2208 W) \u00d7 (cid:89)\n\nd\u2208D\n\ngd(P (z | d) : z \u2208 Z) \u00d7 h(P (d) : d \u2208 D),\n\nfor densities fz, gd and h on the simplexes in R|W|, R|Z| and R|D|, respectively. With this structure\non P (\u03b8) one readily veri\ufb01es that the M-step objective function for the MAP likelihood problem,\nQ(\u03b8 | \u03b8(cid:48)) = Q0(\u03b8 | \u03b8(cid:48)) + log P (\u03b8), is given by\n\n(cid:88)\n\n(cid:88)\n\nQ(\u03b8 | \u03b8(cid:48)) =\n\nFz(\u03b8 | \u03b8(cid:48)) +\n\nGd(\u03b8 | \u03b8(cid:48)) + H(\u03b8 | \u03b8(cid:48)),\n\nwhere\n\nFz(\u03b8 | \u03b8(cid:48)) =\n\nGd(\u03b8 | \u03b8(cid:48)) =\n\nH(\u03b8 | \u03b8(cid:48)) =\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nw,d\n\nw,z\n\nz\n\nd\n\nP (z | w, d, \u03b8(cid:48))n(w, d) log P (w | z) + log fz(P (w | z) : w \u2208 W),\n\nP (z | w, d, \u03b8(cid:48))n(w, d) log P (z | d) + log gd(P (z | d) : z \u2208 Z),\n\nn(d) log P (d) + log h(P (d) : d \u2208 D).\n\nd\n\nAs a comment, notice that if the densities fz, gd, or h are log-concave then Fz, Gd, and H are\nconcave in \u03b8. Furthermore, the functions Fz, Gd, and H can be maximized independently, since\nthe corresponding non-negativity and normalization constraints are decoupled. In particular, the\n|Z| + |D| + 1 optimization problems can be solved in parallel.\n\n3 The pseudo-Dirichlet prior\nThe parameters for PLSA models consist of |Z| + |D| + 1 probability distributions taking their\nvalues on |Z| + |D| + 1 simplexes. The most well-known family of distributions on the simplex\nis the Dirichlet family, which has many properties that make it useful in Bayesian statistics [8].\nUnfortunately the Dirichlet distribution is not a suitable prior for modeling sparsity for PLSA, as we\nshall see, and to address this we introduce a generalization of the Dirichlet distribution which we\ncall the pseudo-Dirichlet distribution.\nTo illustrate why the Dirichlet distribution is unsuitable in the present context, consider placing a\nsymmetric Dirichlet prior on (P (z | d) : z \u2208 Z) for each document d. That is, for each d \u2208 D,\n\ngd(P (z | d) : z \u2208 Z) \u221d (cid:89)\n\nP (z | d)\u03b1\u22121,\n\nz\u2208Z\n\nwhere \u03b1 > 0 is the concentration parameter. Let fz and h be constant. The relevant case for sparsity\nis when \u03b1 < 1, which concentrates the density toward the (relative) boundary of the simplex. It is\neasy to see that the distribution is in this case log-convex, which means that the contribution to the\nlog-likelihood and M-step objective function Gd(\u03b8 | \u03b8(cid:48)) will be convex. We address this problem\nin Section 4. A bigger problem, however, is that for \u03b1 < 1 the density of the symmetric Dirichlet\ndistribution is unbounded and the MAP likelihood problem does not have a well-de\ufb01ned solution,\nas the following result shows.\n\n3\n\n\fProposition 1 Under the above assumptions on fz, gd and h there are in\ufb01nitely many sequences\n(\u03b8m)m\u22651, converging to distinct limits, such that limm\u2192\u221e Q(\u03b8m | \u03b8m) = \u221e. As a consequence,\n(cid:96)(\u03b8m) tends to in\ufb01nity as well.\nProof. Choose \u03b8m as follows: P (d) = |D|\u22121 and P (w | z) = |W|\u22121 for all w, d and z. Fix d0 \u2208 D\nand z0 \u2208 Z, and set P (z0 | d0) = m\u22121, P (z | d0) = 1\u2212m\u22121\n|Z|\u22121 for z (cid:54)= z0, and P (z | d) = |Z|\u22121 for\nall z and d (cid:54)= d0. It is then straightforward to verify that Q(\u03b8m | \u03b8m) tends to in\ufb01nity. The choice of\nd0 and z0 was arbitrary, so by choosing two other points we get a different sequence with a different\nlimit. Taking convex combinations yields the claimed in\ufb01nity of sequences. The second statement\nfollows from the well-known fact that Q(\u03b8 | \u03b8(cid:48)) \u2264 (cid:96)(\u03b8) for all \u03b8 and \u03b8(cid:48). (cid:3)\nThis proposition is a formal statement of the observation that when the Dirichlet prior is unbounded,\nany single zero element in P (z|d) leads to an in\ufb01nite posterior likelihood, and so the optimization\nproblem is not well-posed. To overcome these unbounded Dirichlet priors while retaining their\nsparsity-inducing properties, we introduce the following class of distributions on the simplex.\nDe\ufb01nition 1 A random vector con\ufb01ned to the simplex in Rp is said to follow a pseudo-Dirichlet\ndistribution with concentration parameter \u03b1 = (\u03b11, . . . , \u03b1p) \u2208 Rp and perturbation parameter\n\u0001 = (\u00011, . . . , \u0001p) \u2208 Rp\n\n+ if it has a density on the simplex given by\n\nP (x1, . . . , xp | \u03b1, \u0001) = C\n\n(\u0001i + xi)\u03b1i\u22121\n\n(5)\n\ni=1\n\nfor a normalizing constant C depending on \u03b1 and \u0001. If \u03b1i = \u03b1 and \u0001i = \u0001 for all i and some \ufb01xed\n\u03b1 \u2208 R, \u0001 \u2265 0, we call the resulting distribution symmetric pseudo-Dirichlet.\nNotice that if \u0001i > 0 for all i, the pseudo-Dirichlet density is indeed bounded for all \u03b1. If \u0001i = 0\nand \u03b1i > 0 for all i, we recover the standard Dirichlet distribution. If \u0001i = 0 and \u03b1i \u2264 0 for some\ni then the density is not integrable, but can still be used as an improper prior. Like the Dirichlet\ndistribution, when \u03b1 < 1 the pseudo-Dirichlet distribution is log-convex, and it will make a convex\ncontribution to the M-step objective function of any EM algorithm.\nThe psuedo-Dirichlet distribution can be viewed as a bounded perturbation of the Dirichlet distri-\nbution, and for small values of the perturbation parameter \u0001, many of the properties of the original\nDirichlet distribution hold approximately. In our discussion section we offer a justi\ufb01cation for al-\nlowing \u03b1 \u2264 0, framed within a regularization approach.\n\n4 EM under pseudo-Dirichlet priors\n\np(cid:89)\n\nWe now derive an EM algorithm for PLSA under sparse pseudo-Dirichlet priors. The E-step is the\nsame as for standard PLSA, and is given by (3). The M-step consists in optimizing each Fz, Gd and\nH individually. While our M-step will not offer a closed-form maximization, we are able to derive\nsimple checkable conditions under which the M-step has a stationary point that is also the global\nmaximum. Once the conditions are satis\ufb01ed, the M-step optimum can be found via a practitioner\u2019s\nfavorite root-\ufb01nding algorithm. For consideration, we propose an iteration scheme that in practice\nwe \ufb01nd converges rapidly and well. Because our sparsity assumption focuses on the parameters\nP (z|d), we perform our main analysis on Gd, but for completeness we state the corresponding\nresult for Fz. The less applicable treatment of H is omitted.\nz P (z |\nd) = 1 and P (z | d) \u2265 0 for all z. We use symmetric pseudo-Dirichlet priors with parameters\n\u03b1d = (\u03b1d, . . . , \u03b1d) and \u0001d = (\u0001d, . . . , \u0001d) for \u03b1d \u2208 R and \u0001d > 0. Since each Gd is treated\nseparately, let us \ufb01x d and write\n\nConsider the problem of maximizing Gd(\u03b8 | \u03b8(cid:48)) over (P (z | d) : z \u2208 Z) subject to(cid:80)\n\nxz = P (z | d),\n\ncz =\n\nP (z | w, d, \u03b8(cid:48))n(w, d),\n\nwhere the dependence on d is suppressed in the notation. For x = (xz : z \u2208 Z) and a \ufb01xed \u03b8(cid:48), we\nwrite Gd(x) = Gd(\u03b8 | \u03b8(cid:48)), which yields, up to an additive constant,\n\nGd(x) =\n\n(\u03b1d \u2212 1) log(\u0001d + xz) + cz log xz\n\n.\n\n(cid:88)\n\nw\n\n(cid:88)\n\n(cid:104)\n\nz\n\n4\n\n(cid:105)\n\n\fThe task is to maximize Gd, subject to(cid:80)\n\nz xz = 1 and xz \u2265 0 for all z. Assuming that every word\nw is observed in at least one document d and that all components of \u03b8(cid:48) are strictly positive, Lemma 1\nbelow implies that any M-step optimizer must have strictly positive components. The non-negativity\nconstraint is therefore never binding, so the appropriate Lagrangian for this problem is\n\nLd(x; \u03bb) = Gd(x) + \u03bb\n\n(cid:104)\n1 \u2212(cid:88)\n\n(cid:105)\n\nxz\n\n,\n\nz\n\nand it suf\ufb01ces to consider its stationary points.\nLemma 1 Assume that every word w has been observed in at least one document d, and that P (z |\nw, d; \u03b8(cid:48)) > 0 for all (w, d, z). If xz \u2192 0 for some z, and the nonnegativity and normalization\nconstraints are maintained, then Gd(x) \u2192 \u2212\u221e.\nProof. The assumption implies that cz > 0, \u2200z. Therefore, since log(\u0001d + xz) and log xz are\nbounded from above, \u2200z, when \u03b8 stays in the feasible region, xz \u2192 0 leads to Gd(x) \u2192 \u2212\u221e. (cid:3)\nThe next lemma establishes a property of the stationary points of the Lagrangian Ld which will be\nthe key to proving our main result.\nLemma 2 Let (x, \u03bb) be any stationary point of Ld such that xz > 0 for all z. Then \u03bb \u2265 n(d) \u2212\n(1 \u2212 \u03b1d)|Z|. If in addition to the assumptions of Lemma 1 we have n(d) \u2265 (1 \u2212 \u03b1d)|Z|, then\n\n(x, \u03bb) < 0\n\nfor all z \u2208 Z.\n\n\u22022Gd\n\u2202x2\nz\n\u2212 1\u2212\u03b1d\n\n\u0001d+xz\n\nProof. We have \u2202Ld\n\u03bbxz = cz\u2212(1\u2212\u03b1d) xz\nyields\n\n\u2202xz\n\n\u0001d+xz\n\n= cz\nxz\n\nFurthermore,(cid:80)\n\nz cz =(cid:80)\n\n\u2202xz\n\n\u2212 \u03bb. Since \u2202Ld\n\n\u2265 cz\u2212(1\u2212\u03b1d), which, after summing over z and using that(cid:80)\nw n(w, d)(cid:80)\n\nz P (z | w, d, \u03b8(cid:48)) = n(d), so \u03bb \u2265 n(d) \u2212 (1 \u2212 \u03b1d)|Z|.\n\n(x, \u03bb) = 0 at the stationary point, we get\nz xz = 1,\n\n\u03bb \u2265(cid:88)\n\ncz \u2212 (1 \u2212 \u03b1d)|Z|.\n\nz\n\nFor the second assertion, using once again that \u2202Ld\nshows that\n\n\u2202xz\n\n(x, \u03bb) = 0 at the stationary point, a calculation\n\n(cid:104)\n\n(cid:105)\n\n\u22022Gd\n\u2202x2\nz\n\n(x, \u03bb) = \u2212\n\n1\n\nx2\nz(\u0001d + xz)\n\nx2\nz\u03bb + cz\u0001d\n\n.\n\nThe assumptions imply that cz > 0, so it suf\ufb01ces to prove that \u03bb \u2265 0. This follows from our\nhypothesis and the \ufb01rst part of the lemma. (cid:3)\nThis allows us to obtain our main result result concerning the structure of the optimization problem\nassociated with the M-step.\nTheorem 1 Assume that (i) every word w has been observed in at least one document d, (ii) P (z |\nw, d, \u03b8(cid:48)) > 0 for all (w, d, z), and (iii) n(d) > (1 \u2212 \u03b1d)|Z| for each d. Then each Lagrangian\nLd has a unique stationary point, which is the global maximum of the corresponding optimization\nproblem, and whose components are strictly positive.\n\nThe proof relies on the following version of the so-called Mountain Pass Theorem.\nLemma 3 (Mountain Pass Theorem) Let O \u2282 Rn be open, and consider a continuously differen-\ntiable function \u03c6 : O \u2192 R s.t. \u03c6(x) \u2192 \u2212\u221e whenever x tends to the boundary of O. If \u03c6 has two\ndistinct strict local maxima, it must have a third stationary point that is not a strict local maximum.\nProof. See p. 223 in [9], or Theorem 5.2 in [10]. (cid:3)\nProof of Theorem 1. Consider a \ufb01xed d. We \ufb01rst prove that the corresponding Lagrangian Ld\ncan have at most one stationary point. To simplify notation, assume without loss of generality that\nZ = {1, . . . , K}, and de\ufb01ne\n\n(cid:101)Gd(x1, . . . , xK\u22121) = Gd\n\n(cid:16)\n\nx1, . . . , xK\u22121, 1 \u2212 K\u22121(cid:88)\n\n(cid:17)\n\nxk\n\n.\n\nk=1\n\n5\n\n\f.\n\n.\n\n\u0001d+xK\n\n\u22022Gd\n\u2202x2\nk\n\nk=1 y2\nk\n\n\u2212 1\u2212\u03b1d\n\nk=1 xk and \u03bb = cK\nxK\n\n{(x1, . . . , xK\u22121) \u2208 RK\u22121\n\nk xk < 1}. The following facts are readily veri\ufb01ed:\n\nThe constrained maximization of Gd is then equivalent to maximizing (cid:101)Gd over the open set O =\n++ :(cid:80)\n(i) If (x, \u03bb) is a stationary point of Ld, then (x1, . . . , xK\u22121) is a stationary point of (cid:101)Gd.\n(ii) If (x1, . . . , xK\u22121) is a stationary point of (cid:101)Gd, then (x, \u03bb) is a stationary point of Ld, where\nxK = 1 \u2212(cid:80)K\u22121\n(iii) For any y = (y1, . . . , yK\u22121,(cid:80)K\u22121\nwith Lemma 2 imply that (x1, . . . , xK\u22121) is a stationary point of (cid:101)Gd and that \u22072(cid:101)Gd is negative\nNext, suppose for contradiction that there are two distinct such points. By Lemma 1, (cid:101)Gd tends to\nthird point (\u02dcx1, . . . , \u02dcxK\u22121), stationary for (cid:101)Gd, that is not a strict local maximum. But by (ii), this\n(\u02dcx1, . . . , \u02dcxK\u22121) has to be a strict local max for (cid:101)Gd, which is a contradiction. We deduce that Ld has\n\n\u2212\u221e near the boundary of O, so we may apply the mountain pass theorem to get the existence of a\nyields a corresponding stationary point (\u02dcx, \u02dc\u03bb) for Ld. The same reasoning as above then shows that\n\nNow, suppose that (x, \u03bb) is a stationary point of Ld. Property (i) and property (iii) in conjunction\n\nk=1 yk), we have yT\u22072(cid:101)Gdy =(cid:80)K\n\nde\ufb01nite there. Hence it is a strict local maximum.\n\nat most one stationary point.\nFinally, the continuity of Gd together with its boundary behavior (Lemma 1) implies that a maxi-\nmizer exists and has strictly positive components. But the maximizer must be a stationary point of\nLd, so together with the previously established uniqueness, the result follows. (cid:3)\nCondition (i) in Theorem 1 is not a real restriction, since a word that does not appear in any doc-\nument typically will be removed from the vocabulary. Moreover, if the EM algorithm is initialized\nsuch that P (z | w, d; \u03b8(cid:48)) > 0 for all (w, d, z), Theorem 1 ensures that this will be the case for all fu-\nture iterates as well. The critical assumption is Condition (iii). It can be thought of as ensuring that\nthe prior does not drown the data. Indeed, suf\ufb01ciently large negative values of \u03b1d, corresponding to\nstrong prior beliefs, will cause the condition to fail.\nWhile there are various methods available for \ufb01nding the stationary point of Ld, we have found that\nthe following \ufb01xed-point type iterative scheme produces satisfactory results.\n\nxz \u2190\n\nn(d) + (1 \u2212 \u03b1d)\n\n(cid:104)\n\ncz\n1\n\n\u0001d+xz\n\n\u2212(cid:80)\n\n.\n\n(6)\n\nxz \u2190 xz(cid:80)\n\nz xz\n\n(cid:105) ,\n= 1 \u2212(cid:88)\n\nxz(cid:48)\n\n\u0001d+xz(cid:48)\n\nz(cid:48)\n\n\u2202Ld\n\u2202\u03bb\n\nxz, so by summing over z and using that(cid:80)\n\nxz.\n\nz\n\nTo motivate this particular update rule, recall that\n\u2212 \u03bb,\n\n\u2212 1 \u2212 \u03b1d\n\u0001d + xz\nAt the stationary point, \u03bbxz = cz \u2212 1\u2212\u03b1d\n\nz cz = n(d), we get \u03bb = n(d) \u2212 (1 \u2212 \u03b1d)(cid:80)\n\nand(cid:80)\n\n\u2202Ld\n\u2202xz\n\ncz\nxz\n\n\u0001d+xz\n\n=\n\nz\n\nxz\n\n\u0001d+xz\n\n. Substituting this for \u03bb in \u2202Ld\n\nz xz = 1\n= 0\nand rearranging terms yields the \ufb01rst part of (6). Notice that Lemma 2 ensures that the denominator\nstays strictly positive. Further, the normalization is a classic technique to restrict x to the simplex.\nNote that (6) reduces to the standard PLSA update (4) if \u03b1d = 1.\nFor completeness we also consider the topic-vocabulary distribution (P (w|z) : w \u2208 W). We\nimpose a symmetric pseudo-Dirichlet prior on the vector (P (w | z) : w \u2208 W) for each z \u2208 Z. The\ncorresponding parameters are denoted by \u03b1z and \u0001z. Each Fz is optimized individually, so we \ufb01x\nz \u2208 Z and write yw = P (w | z). The objective function Fz(y) = Fz(\u03b8 | \u03b8(cid:48)) is then given by\nP (z | w, d, \u03b8(cid:48))n(w, d).\n\n(\u03b1z \u2212 1) log(\u0001z + yw) + bw log yz\n\n(cid:88)\n\n(cid:88)\n\nFz(y) =\n\nbw =\n\n(cid:104)\n\n(cid:105)\n\n(7)\n\n\u2202xz\n\n,\n\nw\n\nd\n\nThe following is an analog of Theorem 1, whose proof is essentially the same and therefore omitted.\n(cid:80)\nTheorem 2 Assume condition (i) and (ii) of Theorem 1 are satis\ufb01ed, and that for each z \u2208 Z,\nw bw \u2265 (1 \u2212 \u03b1z)|W|. Then each Fz has a unique local optimum on the simplex, which is also a\n\nglobal maximum and whose components are strictly positive.\n\n6\n\n\fUnfortunately there is no simple expression for(cid:80)\n\nw bw in terms of the inputs to the problem. On\nthe other hand, the sum can be evaluated at the beginning of each M-step, which makes it possible\nto verify that \u03b1z is not too negative.\n\n5 Empirical evaluation\n\nTo evaluate our framework for sparse mixture model inference, we develop a MAP PLSA topic\nmodel for a corpus of 2,406 blogger.com blogs, a dataset originally analyzed by Schler et al. [11]\nfor the role of gender in language. Unigram frequencies for the blogs were built using the python\nNLTK toolkit [12]. Inference was run on the document-word distribution of 2,406 blogs and 2,000\nmost common words, as determined by the aggregate frequencies across the entire corpus. The\nimplications of Section 4 is that in order to adapt PLSA for sparse MAP inference, we simply need\nto replace equation (4) from PLSA\u2019s ordinary M-step with an iteration of (6).\nThe corpus also contains a user-provided \u2018category\u2019 for each blog, indicating one of 28 categories.\nWe focused our analysis on 8 varied but representative topics, while the complete corpus contained\nover 19,000 blogs. The user-provided topic labels are quite noisy, and so in order to have cleaner\nground truth data for evaluating our model we chose to also construct a synthetic, sparse dataset.\nThis synthetic dataset is employed to evaluate parameter choices within the model.\nTo generate our synthetic data, we ran PLSA on our text corpus and extracted the inferred P (w|z)\nand P (d) distributions, while creating 2,406 synthetic P (z|d) distributions where each synthetic\nblog was a uniform mixture of between 1 and 4 topics. These distributions were then used to\nconstruct a ground-truth word-document distribution Q(w, d), which we then sampled N times,\nwhere N is the total number of words in our true corpus. In this way we were able to generate a\nrealistic synthetic dataset with a sparse and known document-topic distribution.\nWe evaluate the quality of each model by calculating the model perplexity of the reconstructed word-\ndocument distribution as compared to the underlying ground truth distribution used to generate the\nsynthetic data. Here model perplexity is given by\n\nP(P (w, d)) = 2\u2212(cid:80)\n\nw,d Q(w,d) log2 P (w,d),\n\nwhere Q(w, d) is the true document-word distribution used to generate the synthetic dataset and\nP (w, d) is the reconstructed matrix inferred by the model. Using this synthetic dataset we are able\nto evaluate the roles of \u03b1 and \u0001 in our algorithm, as seen in Figure 1.\nFrom Figure 1 we can conclude that \u03b1 should in practice be chosen close the algorithm\u2019s feasible\nlower bound, and \u0001 can be almost arbitrarily small. Choosing \u03b1 = (cid:100)1\u2212maxd n(d)/k(cid:101) and \u0001 = 10\u22126,\nwe return to our blog data with its user-provided labels. In Figure 2 we see that sparse inference\nindeed results in P (z|d) distributions with signi\ufb01cantly sparser support. Furthermore, we can more\neasily see how certain categories of blogs cluster in their usage of certain topics. For example,\na majority of the blogs self-categorized as pertaining to \u2018religion\u2019 employ almost exclusively the\nsecond topic vocabulary of the model. The \ufb01ve most exceptional unigrams for this topic are \u2018prayer\u2019,\n\u2018christ\u2019, \u2018jesus\u2019, \u2018god\u2019, and \u2018church\u2019.\n\n6 Discussion\n\nWe have shown how certain latent variable mixture models can be tractably extended with sparsity-\ninducing priors using what we call the pseudo-Dirichlet distribution. Our main theoretical result\nshows that the resulting M-step maximization problem is well-behaved despite the lack of concavity,\nand empirical \ufb01ndings indicate that the approach is indeed effective. Our use of the Mountain Pass\nTheorem to prove that all local optima coincide is to the best of our knowledge new in the literature,\nand we \ufb01nd it intriguing and surprising that the global properties of maximizers, which are very\nrarely susceptible to analysis in the absence of concavity, can be studied using this tool.\nThe use of log-convex priors (equivalently, concave regularization functions) to encourage sparsity is\nparticularly relevant when the parameters of the model correspond to probability distributions. Since\neach distribution has a \ufb01xed L1 norm equal to one, the use of L1-regularization, which otherwise\nwould be the natural choice for inducing sparsity, becomes toothless. The pseudo-Dirichlet prior\ni log(xi + \u0001). We mention in\n\nwe introduce corresponds to a concave regularization of the form(cid:80)\n\n7\n\n\fFigure 1: Model perplexity for inferred models with k = 8 topics as a function of the concentration\nparameter \u03b1 of the pseudo-Dirichlet prior, shown from the algorithm\u2019s lower bound \u03b1 = 1\u2212 n(d)/k\nto the uniform prior case of \u03b1 = 1. Three different choices of \u0001 are shown, as well as the base-\nline PLSA perplexity corresponding to a uniform prior. The dashed line indicates the perplexity\nP(Q(w, d)), which should be interpreted as a lower bound.\n\nFigure 2: Document-topic distributions P (z|d) for the 8 different categories of blogs studied. All\ndistributions share the same color scale.\n\npassing that the same sum-log regularization has also been used for sparse signal recovery in [13].\nIt should be emphasized that the notion of sparsity we discuss in this work is not in the formal sense\nof a small L0 norm. Indeed, Theorem 1 shows that, no different from ordinary PLSA, the estimated\nparameters for MAP PLSA will all be strictly positive. Instead, we seek sparsity in the sense that\nmost parameters should be almost zero.\nNext, let us comment on the possibility to allow the concentration parameter \u03b1d to be negative,\nassuming for simplicity that fz and h are constant. Consider the normalized likelihood, where\nclearly (cid:96)(\u03b8) may be replaced by (cid:96)(\u03b8)/N,\n\nlog(\u0001d + P (z | d)),\n\n\u2212(cid:88)\n\nd\n\n(cid:96)(\u03b8)\nN\n\n=\n\n(cid:96)0(\u03b8)\n\nN\n\n(cid:88)\n\n1 \u2212 \u03b1d\n\nN\n\nz\n\nwhich by (2) we deduce only depends on the data through the normalized quantities n(w, d)/N.\nThis indicates that the quantity (1 \u2212 \u03b1d)/N, which plays the role of a regularization \u2018gain\u2019 in the\nnormalized problem, must be non-negligible in order for the regularization to have an effect. For\nrealistic sizes of N, allowing \u03b1d < 0 therefore becomes crucial.\nFinally, while we have chosen to present our methodology as applied to topic models, we expect the\nsame techniques to be useful in a notably broader context. In particular, our methodology is directly\napplicable to problems solved through Nonnegative Matrix Factorization (NMF), a close relative of\nPLSA where matrix columns or rows are often similarly constrained in their L1 norm.\nAcknowledgments: This work is supported in part by NSF grant IIS-0910664.\n\n8\n\n\u221250\u221240\u221230\u221220\u22121002.852.92.95x 105model perplexity\u03b1  \u03b5=10\u22126\u03b5=0.1\u03b5=1PLSA\u03b1  P(z|d), PLSA, Religion    P(z|d), Ps-Dir MAP, Religion    P(z|d), PLSA, Internet    P(z|d), Ps-Dir MAP, Internet    P(z|d), PLSA, Engineering    P(z|d), Ps-Dir MAP, Engineering   P(z|d), PLSA, Technology    P(z|d), Ps-Dir MAP, Technology    P(z|d), PLSA, Fashion    P(z|d), Ps-Dir MAP, Fashion    P(z|d), PLSA, Media    P(z|d), Ps-Dir MAP, Media    P(z|d), PLSA, Tourism    P(z|d), Ps-Dir MAP, Tourism    P(z|d), PLSA, Law    P(z|d), Ps-Dir MAP, Law  \fReferences\n[1] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learn-\n\ning, 42:177\u2013196, 2001.\n\n[2] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[3] A. Bosch, A. Zisserman, and X. Munoz. Scene Classi\ufb01cation via pLSA. In European Confer-\n\nence on Computer Vision, 2006.\n\n[4] I. Psorakis and B. Sheldon. Soft Partitioning in Networks via Baysian Non-negative Matrix\n\nFactorization. In NIPS, 2010.\n\n[5] C. Ding, T. Li, and W. Peng. Nonnegative matrix factorization and probabilistic latent semantic\nindexing: Equivalence chi-square statistic, and a hybrid method. In Proceedings of AAAI \u201906,\nvolume 21, page 342, 2006.\n\n[6] E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications. In Proceedings\n\nof ACM SIGIR, pages 601\u2013602. ACM, 2005.\n\n[7] A. Asuncion, M. Welling, P. Smyth, and Y.W. Teh. On smoothing and inference for topic\nmodels. In Proc. of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 27\u201334,\n2009.\n\n[8] A. Gelman. Bayesian data analysis. CRC Press, 2004.\n[9] R. Courant. Dirichlet\u2019s principle, conformal mapping, and minimal surfaces. Interscience,\n\nNew York, 1950.\n\n[10] Y. Jabri. The Mountain Pass Theorem: Variants, Generalizations and Some Applications.\n\nCambridge University Press, 2003.\n\n[11] J. Schler, M. Koppel, S. Argamon, and J. Pennebaker. Effects of age and gender on blogging.\nIn Proc. of the AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs,\npages 191\u2013197, 2006.\n\n[12] S. Bird, E. Klein, and Loper E. Natural language processing with Python. O\u2019Reilly Media,\n\n2009.\n\n[13] E.J. Cand`es, M.B. Wakin, and S.P. Boyd. Enhancing sparsity by reweighted (cid:96)1 minimization.\n\nJournal of Fourier Analysis and Applications, 14:877\u2013905, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1067, "authors": [{"given_name": "Martin", "family_name": "Larsson", "institution": null}, {"given_name": "Johan", "family_name": "Ugander", "institution": null}]}