{"title": "On Algorithms for Sparse Multi-factor NMF", "book": "Advances in Neural Information Processing Systems", "page_first": 602, "page_last": 610, "abstract": "Nonnegative matrix factorization (NMF) is a popular data analysis method, the objective of which is to decompose a matrix with all nonnegative components into the product of two other nonnegative matrices.  In this work, we describe a new simple and efficient algorithm for multi-factor nonnegative matrix factorization problem ({mfNMF}), which generalizes the original NMF problem to more than two factors.  Furthermore, we extend the mfNMF algorithm to incorporate a regularizer based on Dirichlet distribution over normalized columns to encourage sparsity in the obtained factors.  Our sparse NMF algorithm affords a closed form and an intuitive interpretation, and is more efficient in comparison with previous works that use fix point iterations. We demonstrate the effectiveness and efficiency of our algorithms on both synthetic and real data sets.", "full_text": "On Algorithms for Sparse Multi-factor NMF\n\nXin Wang\nSiwei Lyu\nComputer Science Department\nUniversity at Albany, SUNY\n\nAlbany, NY 12222\n\n{slyu,xwang26}@albany.edu\n\nAbstract\n\nNonnegative matrix factorization (NMF) is a popular data analysis method, the\nobjective of which is to approximate a matrix with all nonnegative components\ninto the product of two nonnegative matrices. In this work, we describe a new\nsimple and ef\ufb01cient algorithm for multi-factor nonnegative matrix factorization\n(mfNMF) problem that generalizes the original NMF problem to more than two\nfactors. Furthermore, we extend the mfNMF algorithm to incorporate a regularizer\nbased on the Dirichlet distribution to encourage the sparsity of the components of\nthe obtained factors. Our sparse mfNMF algorithm affords a closed form and an\nintuitive interpretation, and is more ef\ufb01cient in comparison with previous works\nthat use \ufb01x point iterations. We demonstrate the effectiveness and ef\ufb01ciency of\nour algorithms on both synthetic and real data sets.\n\n1\n\nIntroduction\n\n(cid:81)K\n\nThe goal of nonnegative matrix factorization (NMF) is to approximate a nonnegative matrix V\nwith the product of two nonnegative matrices, as V \u2248 W1W2. Since the seminal work of [1] that\nintroduces simple and ef\ufb01cient multiplicative update algorithms for solving the NMF problem, it has\nbecome a popular data analysis tool for applications where nonnegativity constraints are natural [2].\nIn this work, we address the multi-factor NMF (mfNMF) problem, where a nonnegative matrix V is\napproximated with the product of K \u2265 2 nonnegative matrices, V \u2248\nk=1 Wk. It has been argued\nthat using more factors in NMF can improve the algorithm\u2019s stability, especially for ill-conditioned\nand badly scaled data [2].\nIntroducing multiple factors into the NMF formulation can also \ufb01nd\npractical applications, for instance, extracting hierarchies of topics representing different levels of\nabstract concepts in document analysis or image representations [2].\nMany practical applications also require the obtained nonnegative factors to be sparse, i.e., having\nmany zero components. Most early works focuses on the matrix (cid:96)1 norm [6], but it has been pointed\nout that (cid:96)1 norm becomes completely toothless for factors that have constant (cid:96)1 norms, as in the case\nof stochastic matrices [7, 8]. Regularizers based on the entropic prior [7] or the Dirichlet prior [8]\nhave been shown to be more effective but do not afford closed-form solutions.\nThe main contribution of this work is therefore two-fold. First, we describe a new algorithm for\nthe mfNMF problem. Our solution to mfNMF seeks optimal factors that minimize the total dif-\nk=1 Wk, and it is based on the solution of a special matrix optimization\nproblem that we term as the stochastic matrix sandwich (SMS) problem. We show that the SMS\nproblem affords a simple and ef\ufb01cient algorithm consisting of only multiplicative update and nor-\nmalization (Lemma 2). The second contribution of this work is a new algorithm to incorporate the\nDirichlet sparsity regularizer in mfNMF. Our formulation of the sparse mfNMF problem leads to\na new closed-form solution, and the resulting algorithm naturally embeds sparsity control into the\nmfNMF algorithm without iteration (Lemma 3). We further show that the update steps of our sparse\n\nference between V and(cid:81)K\n\n1\n\n\fmfNMF algorithm afford a simple and intuitive interpretation. We demonstrate the effectiveness and\nef\ufb01ciency of our sparse mfNMF algorithm on both synthetic and real data.\n\n2 Related Works\n\nMost exiting works generalizing the original NMF problem to more than two factors are based on\nthe multi-layer NMF formulation, in which we solve a sequence of two factor NMF problems, as\nV \u2248 W1H1, H1 \u2248 W2H2,\u00b7\u00b7\u00b7 , and HK\u22121 \u2248 WK\u22121WK [3\u20135]. Though simple and ef\ufb01cient,\nsuch greedy algorithms are not associated with a consistent objective function involving all factors\nsimultaneously. Because of this, the obtained factors may be suboptimal measured by the difference\nbetween the target matrix V and their product. On the other hand, there exist much fewer works\ndirectly solving the mfNMF problem, one example is a multiplicative algorithm based on the general\nBregmann divergence [9]. In this work, we focus on the generalized Kulback-Leibler divergence (a\nspecial case of the Bregmann divergence), and use its decomposing property to simplify the mfNMF\nobjective and remove the undesirable multitude of equivalent solutions in the general formulation.\nThese changes lead to a more ef\ufb01cient algorithm that usually converges to a better local minimum\nof the objective function in comparison of the work in [9] (see Section 6).\nAs a common means to encouraging sparsity in machine learning, the (cid:96)1 norm has been incorpo-\nrated into the NMF objective function [6] as a sparsity regularizer. However, the (cid:96)1 norm may be\nless effective for nonnegative matrices, for which it reduces to the sum of all elements and can be\ndecreased trivially by scaling down all factors without affecting the number of zero components.\nFurthermore, the (cid:96)1 norm becomes completely toothless in cases when the nonnegative factors are\nconstrained to have constant column or row sums (as in the case of stochastic matrices).\nAn alternative solution is to use the Shannon entropy of each column of the matrix factor as sparsity\nregularizer [7], since a vector with unit (cid:96)1 norm and low entropy implies that only a few of its com-\nponents are signi\ufb01cant. However, the entropic prior based regularizer does not afford a closed form\nsolution, and an iterative \ufb01xed point algorithm is described based on the Lamber\u2019s W-function in [7].\nAnother regularizer is based on the symmetric Dirichlet distribution with concentration parameter\n\u03b1 < 1, as such a model allocates more probability weights to sparse vectors on a probability sim-\nplex [8, 10, 11]. However, using the Dirichlet regularizer has a practical problem, as it can become\nunbounded when there is a zero element in the factor. Simply ignoring such cases as in [11] can\nlead to unstable algorithm (see Section 5.2). Two approaches have been described to solve this prob-\nlem, one is based on the constrained concave-convex procedure (CCCP) [10]. The other uses the\npsuedo-Dirichlet regularizer [8], which is a bounded perturbation of the original Dirichlet model. A\ndrawback common to these methods is that they require extra iterations for the \ufb01x point algorithm.\nFurthermore, the effects of the updating steps on the sparsity of the resulting factors are obscured by\nthe iterative steps. In contrast, our sparse mfNMF algorithm uses the original Dirichlet model and\ndoes not require extra \ufb01x point iteration. More interestingly, the update steps of our sparse mfNMF\nalgorithm afford a simple and intuitive interpretation.\n\n3 Basic De\ufb01nitions\n\nWe denote 1m as the all-one column vector of dimension m and Im as the m-dimensional identity\nmatrix, and use V \u2265 0 or v \u2265 0 to indicate that all elements of a matrix V or a vector v are\nnonnegative. Throughout the paper, we assume a matrix does not have all-zero columns or rows. An\nm \u00d7 n nonnegative matrix V is stochastic if V T 1m = 1n, i.e., each column has a total sum of one.\nAlso, for stochastic matrices W1 and W2, their product W1W2 is also stochastic. Furthermore, an\nm \u00d7 n nonnegative matrix V can be uniquely represented as V = SD, with an n \u00d7 n nonnegative\ndiagonal matrix D = diag(V T 1m) and an m \u00d7 n stochastic matrix S = V D\u22121.\nFor nonnegative matrices V and W , their generalized Kulback-Leibler (KL) divergence [1] is de-\n\ufb01ned as\n\n(cid:19)\n\n.\n\n(1)\n\nm(cid:88)\n\nn(cid:88)\n\n(cid:18)\n\nd(V, W ) =\n\ni=1\n\nj=1\n\nVij log\n\nVij\nWij \u2212 Vij + Wij\n\n2\n\n\fWe have d(V, W ) \u2265 0 and the equality holds if and only if V = W 1. We emphasize the following\ndecomposition of the generalized KL divergence: representing V and W as products of stochastic\nmatrices and diagonal matrices, V = S(V )D(V ) and W = S(W )D(W ), we can decompose d(V, W )\ninto two terms involving only stochastic matrices or diagonal matrices, as\nS(V )\nij\nS(W )\nij\n\nD(V ), D(W )(cid:17)\n(cid:16)\n\nV, S(W )D(V )(cid:17)\n\nD(V ), D(W )(cid:17)\n\nm(cid:88)\n\nn(cid:88)\n\nd(V, W )=d\n\nVij log\n\n(cid:16)\n\n(cid:16)\n\n(cid:35)\n\ni=1\n\nj=1\n\n(cid:34)\n\n+ d\n\n+ d\n\n=\n\n.\n(2)\n\nDue to space limit, the proof of Eq.(2) is deferred to the supplementary materials.\n\n4 Multi-Factor NMF\n\nlk\u22121 \u00d7 lk for k = 1,\u00b7\u00b7\u00b7 , K, with l0 = m and lK = n that minimize d(V,(cid:81)K\n\nIn this work, we study the multi-factor NMF problem based on the generalized KL divergence.\nSpeci\ufb01cally, given an m \u00d7 n nonnegative matrix V , we seek K \u2265 2 matrices Wk of dimensions\nk=1 Wk), s.t., Wk \u2265 0.\nThis simple formulation has a drawback as it is invariant to relative scalings between the factors:\nfor any \u03b3 > 0, we have d(V, W1 \u00b7\u00b7\u00b7 Wi \u00b7\u00b7\u00b7 Wj \u00b7\u00b7\u00b7 WK) = d(V, W1 \u00b7\u00b7\u00b7 (\u03b3Wi)\u00b7\u00b7\u00b7 ( 1\n\u03b3 Wj)\u00b7\u00b7\u00b7 WK).\nIn other words, there exist in\ufb01nite number of equivalent solutions, which gives rise to an intrinsic\nill-posed nature of the mfNMF problem.\nTo alleviate this problem, we constrain the \ufb01rst K \u2212 1 factors, W1,\u00b7\u00b7\u00b7 , WK\u22121, to be stochastic\nmatrices, and differentiate the notationns with X1,\u00b7\u00b7\u00b7 , XK\u22121. Using the property of nonnegative\nmatrices, we represent the last nonnegative factor WK as the product of a stochastic matrix XK\nk=1 Wk\nk=1 Xk and a diagonal matrix D(W ). Similarly,\nwe also decompose the target nonnegative matrix V as the product of a stochastic matrix S(V )\nand a nonnegative diagonal matrix D(V ). It is not dif\ufb01cult to see that any solution from this more\nconstrained formulation leads to a solution to the original problem and vice versa.\nApplying the decomposition in Eq.(2), the mfNMF optimization problem can be re-expressed as\n\nand a nonnegative diagonal matrix D(W ). As such, we represent the nonnegative matrix(cid:81)K\nas the product of a stochastic matrix S(W ) = (cid:81)K\n\nmin\n\nX1,\u00b7\u00b7\u00b7 ,XK ,D(W )\n\ns.t. X T\n\nd\n\nV,\n\n(cid:17)\n\nk=1 Xk\n\n(cid:16)(cid:81)K\n\nD(V )(cid:17)\n\n+ d(cid:0)D(V ), D(W )(cid:1)\n\n(cid:16)\nk 1lk\u22121 = 1lk , Xk \u2265 0, k = 1,\u00b7\u00b7\u00b7 , K, D(W ) \u2265 0.\n(cid:16)\n\nD(V ), D(W )(cid:17)\n\nd\n\n, s.t. D(W ) \u2265 0.\n\nmin\nD(W )\n\nAs such, the original problem is solved with two sub-problems, each for a different set of variables.\nThe \ufb01rst sub-problem solves for the diagonal matrix D(W ), as:\n\nPer the property of the generalized KL divergence, its solution is trivially given by D(W ) = D(V ).\nThe second sub-problem optimizes the K stochastic factors, X1,\u00b7\u00b7\u00b7 , XK, which, after dropping\nirrelevant constants and rearranging terms, becomes\n\nm(cid:88)\n\nn(cid:88)\n\n(cid:32) K(cid:89)\n\n(cid:33)\n\nmax\n\nX1,\u00b7\u00b7\u00b7 ,XK\n\nVij log\n\nXk\n\ni=1\n\nj=1\n\nk=1\n\nij\n\n, s.t. X T\n\nk 1lk\u22121 = 1lk , Xk \u2265 0, k = 1,\u00b7\u00b7\u00b7 , K.\n\n(3)\n\nNote that Eq.(3) is essentially the maximization of the similarity between the stochastic part of V ,\nSV with the stochastic matrix formed as the product of the K stochastic matrices X1,\u00b7\u00b7\u00b7 , XK,\nweighted by DV .\n\n4.1 Stochastic Matrix Sandwich Problem\n\nBefore describing the algorithm solving (3), we \ufb01rst derive the solution to another related problem\nthat we term as the stochastic matrix sandwich (SMS) problem, from which we can construct a\nsolution to (3). Speci\ufb01cally, in an SMS problem one minimizes the following objective function\nwith regards to an m(cid:48)\n\n\u00d7 n(cid:48) stochastic matrix X, as\n\n1In computing the generalized KL divergence, we de\ufb01ne 0 log 0 = 0 and 0\n\n0 = 0.\n\n3\n\n\fm(cid:88)\n\nn(cid:88)\n\nX\n\ni=1\n\nj=1\n\nmax\n\nCij log (AXB)ij, s.t. X T 1m(cid:48) = 1n(cid:48), X \u2265 0,\nwhere A and B are two known stochastic matrices of dimensions m \u00d7 m(cid:48) and n(cid:48)\n\u00d7 n, respectively,\nand C is an m \u00d7 n nonnegative matrix.\nWe note that (4) is a convex optimization problem [12], which can be solved with general numerical\nprocedures such as interior-point methods. However, we present here a new simple solution based\non multiplicative updates and normalization, which completely obviates control parameters such as\nthe step sizes. We \ufb01rst show that there exists an \u201cauxiliary function\u201d to log (AXB)ij.\nLemma 1. Let us de\ufb01ne\n\n(4)\n\n(cid:48)(cid:88)\n\nm\n\n(cid:48)(cid:88)\n\nn\n\nk=1\n\nl=1\n\n(cid:16)\n\n(cid:17)\n\nAik \u02dcXklBlj\n\nA \u02dcXB\n\nij\n\n(cid:16)\n\n(cid:18) Xkl\n\n\u02dcXkl\n\nlog\n\n(cid:19)\n\n(cid:17)\n\nA \u02dcXB\n\n,\n\nij\n\nFij(X; \u02dcX) =\n\nthen we have Fij(X; \u02dcX) \u2264 log (AXB)ij and Fij(X; X) = log (AXB)ij.\nProof of Lemma 1 can be found in the supplementary materials.\nBased Lemma 1 we can develop an EM style iterative algorithm to optimize (4), in which, starting\nwith an initial values X = X0, we iteratively solve the following optimization problem,\n\nm(cid:88)\n\ni=1\n\nn(cid:88)\nCijFij(X; Xt) and t \u2190 t + 1.\n(cid:88)\n\n(cid:88)\n\nj=1\n\nCijFij(Xt+1; Xt) \u2264\n\nCij log (AXt+1B)ij,\n\n(cid:88)\n\nXt+1 \u2190\n\nargmax\n\nX:X T 1m(cid:48) =1n(cid:48) ,X\u22650\n\nUsing the relations given in Lemma 1, we have:\nCijFij(Xt; Xt) \u2264\n\nCij log (AXtB)ij =\n\n(cid:88)\n\n(5)\n\n(6)\n\ni,j\nwhich shows that each iteration of (5) leads to feasible X and does not decrease the objective func-\ntion of (4). Rearranging terms and expressing results using matrix operations, we can simplify the\n\ni,j\n\ni,j\n\ni,j\n\nobjective function of (5) as(cid:88)\n\nwhere\n\n(cid:104)\nCijFij(X; \u02dcX) =\n\ni,j\n\nm\n\n(cid:48)(cid:88)\nAT(cid:16)\n\nk=1\n\nn\n\n(cid:48)(cid:88)\n(cid:16)\n\nl=1\n\nMkl log Xkl + const,\n\n(cid:17)(cid:17)\n\nBT(cid:105)\n\nA \u02dcXB\n\nC (cid:11)\n\n(7)\nwhere \u2297 and (cid:11) correspond to element-wise matrix multiplication and division, respectively. A\ndetailed derivation of (6) and (7) is given in the supplemental materials. The following result shows\nthat the resulting optimization has a closed-form solution.\nLemma 2. The global optimal solution to the following optimization problem,\nMkl log Xkl, s.t. X T 1m(cid:48) = 1n(cid:48), X \u2265 0,\n\nM = \u02dcX \u2297\n(cid:48)(cid:88)\n(cid:48)(cid:88)\n\nmax\n\n(8)\n\nm\n\nX\n\nn\n\n,\n\nis given by\n\nk=1\n\nl=1\n\nMkl(cid:80)\n\n.\n\nk Mkl\n\nXkl =\n\nProof of Lemma 2 can be found in the supplementary materials.\nNext, we can construct a coordinate-wise optimization solution to the mfNMF problem (3) that\niteratively optimizes each Xk while \ufb01xing the others, based on the solution to the SMS problem\ngiven in Lemma 2. In particular, it is easy to see that for C = V ,\n\n\u2022 solving for X1 with \ufb01xed X2,\u00b7\u00b7\u00b7 , XK is an SMS problem with A = Im, X = X1 and\n\u2022 solving for Xk, k = 2,\u00b7\u00b7\u00b7 , K \u2212 1, with \ufb01xed X1,\u00b7\u00b7\u00b7 , Xk\u22121, Xk+1,\u00b7\u00b7\u00b7 , XK is an SMS\n\nB =(cid:81)K\nwith A =(cid:81)k\u22121\n\u2022 and solving for XK with \ufb01xed X1,\u00b7\u00b7\u00b7 , XK\u22121 is an SMS problem with A =(cid:81)K\u22121\n\nk(cid:48)=1 Xk(cid:48), X = Xk, and B =(cid:81)K\n\nk(cid:48)=k+1 Xk(cid:48);\n\nk=1 Xk,\n\nk=2 Xk;\n\nX = XK and B = In.\n\nIn practice, we do not need to run each SMS optimization to converge, and the algorithm can be\nimplemented with a few \ufb01xed steps updating each factor in order.\nIt should be pointed out that even though SMS is a convex optimization problem guaranteed with\na global optimal solution, this is not the case for the general mfNMF problem (3), the objective\nfunction of which is non-convex (it is an example of the multi-convex function [13]).\n\n4\n\n\f5 Sparse Multi-Factor NMF\n\n(cid:88)\n\ni,j\n\n(cid:88)\n\ni,j\n\n(cid:32) K(cid:89)\n\n(cid:33)\n\nK(cid:88)\n\nk=1\n\nNext, we describe incorporating sparsity regularization in the mfNMF formulation. We assume that\nthe sparsity requirement is applied to each individual factor in the mfNMF objective function (3), as\n\nmax\n\nX1,\u00b7\u00b7\u00b7 ,XK\n\nVij log\n\nXk\n\n+\n\nk=1\n\nij\n\n(cid:96)(Xk), s.t. X T\n\nk 1lk\u22121 = 1lk , Xk \u2265 0,\n\n(9)\n\nwhere (cid:96)(X) is the sparsity regularizer that is larger for stochastic matrix X with more zero entries.\nAs the overall mfNMF can be solved by optimizing each individual factor in an SMS problem, we\nfocus here on the case where the sparsity regularizer of each factor is introduced in (4), to solve\n\nmax\n\nX\n\nCij log (AXB)ij + (cid:96)(X), s.t. X T 1m(cid:48) = 1n(cid:48), X \u2265 0.\n\n(10)\n\n(cid:107)X(cid:107)1 =(cid:80)\n\n5.1 Dirichlet Sparsity Regularizer\nAs we have mentioned, the typical choice of (cid:96)(\u00b7) as the matrix (cid:96)1 norm is problematic in (10), as\nij Xij = n(cid:48) is a constant. On the other hand, if we treat each column of X as a point on\n(cid:81)d\na probability simplex, as their elements are nonnegative and sum to one, then we can induce a sparse\nregularizer from the Dirichlet distribution. Speci\ufb01cally, a Dirchilet distribution of d-dimensional\nk=1 v\u03b1\u22121\nvectors v : v \u2265 0, 1T v = 1 is de\ufb01ned as Dir(v; \u03b1) = \u0393(d\u03b1)\n, where \u0393(\u00b7) is the standard\nGamma function2. The parameter \u03b1 \u2208 [0, 1] is the parameter that controls the sparsity of samples \u2013\nsmaller \u03b1 corresponds to higher likelihood of sparse v in Dir(v; \u03b1).\nIncorporating a Dirichlet regularizer with parameter \u03b1l to each column of X and dropping irrelevant\nconstant terms, (10) reduces to3\n\n\u0393(\u03b1)d\n\nk\n\nm(cid:88)\n\nn(cid:88)\n\n(cid:48)(cid:88)\n\nm\n\n(cid:48)(cid:88)\n\nn\n\n(cid:48)(cid:88)\n\nm\n\n(cid:48)(cid:88)\n\nn\n\nk=1\n\nl=1\n\nmax\n\nX\n\nCij log (AXB)ij +\n\ni=1\n\nj=1\n\nk=1\n\nl=1\n\n(\u03b1l \u2212 1) log Xkl, s.t. X T 1m(cid:48) = 1n(cid:48), X \u2265 0.\n\n(11)\n\nAs in the case of mfNMF, we introduce the auxiliary function of log(AXB)ij to form an upper-\nbound of (11) and use an EM-style algorithm to optimize (11). Using the result given in Eqs.(6) and\n(7), the optimization problem can be further simpli\ufb01ed as:\n\nmax\n\nX\n\n(Mkl + \u03b1l \u2212 1) log Xkl, s.t. X T 1m(cid:48) = 1n(cid:48), X \u2265 0.\n\n(12)\n\n5.2 Solution to SMS with Dirichlet Sparse Regularizer\nHowever, a direct optimization of (12) is problematic when \u03b1l < 1: if there exists Mkl < 1\u2212 \u03b1l, the\nobjective function of (12) becomes non-convex and unbounded \u2013 the term (Mkl + \u03b1l \u2212 1) log Xkl\napproaches \u221e as Xkl \u2192 0. This problem is addressed in [8] by modifying the de\ufb01nition of the\nDirichlet regularizer in (11) to (\u03b1l \u2212 1) log(Xkl + \u0001), where \u0001 > 0 is a prede\ufb01ned parameter. This\navoids the problem of taking logarithm of zero, but it leads to a less ef\ufb01cient algorithm based on an\niterative \ufb01x point procedure. In addition, the \ufb01x point algorithm is dif\ufb01cult to interpret as its effect\non the sparsity of the obtained factors is obscured by the iterative steps.\nOn the other hand, notice that if we tighten the nonnegativity constraint to Xkl \u2265 \u0001, the objective\nfunction of (12) will always be \ufb01nite. Therefore, we can simply modify the optimization of (12) the\nobjective function to become in\ufb01nity, as:\n\nmax\n\nX\n\n(Mkl + \u03b1l \u2212 1) log Xkl, s.t. X T 1m(cid:48) = 1n(cid:48), Xkl \u2265 \u0001,\u2200k, l.\n\n(13)\n\nThe following result shows that with a suf\ufb01ciently small \u0001, the constrained optimization problem in\n(13) has a unique global optimal solution that affords a closed-form and intuitive interpretation.\n\n2For simplicity, we only discuss the symmetric Dirichlet model, but the method can be easily extended to\n\nthe non-symmetric Dirichlet model with different \u03b1 value for different dimension.\n\n3Alternatively, this special case of NMF can be formulated as C = AXB + E, where E contains inde-\npendent Poisson samples [14], and (11) can be viewed as a (log) maximum a posteriori estimation of column\nvectors of X with a Poisson likelihood and symmetric Dirichlet prior.\n\n5\n\n(cid:48)(cid:88)\n\nm\n\n(cid:48)(cid:88)\n\nn\n\nk=1\n\nl=1\n\n\fcase 1\n\ncase 2\n\ncase 3\n\nFigure 1: Sparsi\ufb01cation effects on the updated vectors before (left) and after (right) applying the algorithm\n(cid:16)\ngiven in Lemma 3, with each column illustrating one of the three cases.\nLemma 3. Without loss of generality, we assume Mkl (cid:54)= 1 \u2212 \u03b1l,\u2200k, l4. If we choose a constant\n, and for each column l de\ufb01ne Nl = {k|Mkl < 1 \u2212 \u03b1l} as the set of\n\u0001 \u2208\nelements with Mkl + \u03b1l \u2212 1 < 0, then the following is the global optimal solution to (13):\n\n0, minkl{|Mkl+\u03b1l\u22121|}\nm(cid:48) maxkl{|Mkl+\u03b1l\u22121|}\n\n(cid:17)\n\n\u2022 case 1. |Nl| = 0, i.e., all constant coef\ufb01cients of (13) are positive,\n\n(cid:80)\n\n\u02c6Xkl =\n\nMkl + \u03b1l \u2212 1\nk(cid:48) [Mk(cid:48)l + \u03b1l \u2212 1]\n\n,\n\n\u2022 case 2. 0 < |Nl| < m(cid:48), i.e., the constant coef\ufb01cients of (13) have mixed signs,\n(cid:80)\n(1 \u2212 |Nl|\u0001) [Mkl + \u03b1l \u2212 1]\nk(cid:48) {[Mk(cid:48)l + \u03b1l \u2212 1] \u00b7 \u03b4 [k(cid:48) (cid:54)\u2208 Nl]} \u00b7 \u03b4 [k (cid:54)\u2208 Nl] ,\nwhere \u03b4(c) is the Kronecker function that takes 1 if c is true and 0 otherwise.\n\u2022 case 3. |Nl| = m(cid:48), i.e., all constant coef\ufb01cients of (13) are negative,\n\n\u02c6Xkl = \u0001 \u00b7 \u03b4 [k \u2208 Nl] +\n(cid:34)\n\n(cid:35)\n\n(cid:34)\n\n\u02c6Xkl = (1\u2212 (m\n\n\u2212 1)\u0001)\u00b7 \u03b4\n\nk = argmax\n\nk(cid:48)\u2208{1,\u00b7\u00b7\u00b7 ,m(cid:48)}\n\nMk(cid:48)l\n\n+ \u0001\u00b7 \u03b4\n\nk (cid:54)= argmax\nk(cid:48)\u2208{1,\u00b7\u00b7\u00b7 ,m(cid:48)}\n\nMk(cid:48)l\n\n.\n\n(16)\n\n(cid:48)\n\n(14)\n\n(15)\n\n(cid:35)\n\nProof of Lemma 3 can be found in the supplementary materials. Note that the algorithm provided in\nLemma 3 is still valid when \u0001 = 0, but the theoretical result of it attaining the global optimum of a\n\ufb01nite optimization problem only holds for \u0001 satisfying the condition in Lemma 3.\nWe can provide an intuitive interpretation to Lemma 3, which is schematically illustrated in Fig.1\nfor a toy example. For the \ufb01rst case (\ufb01rst column of Fig.1) when all constant coef\ufb01cients of (13) are\npositive, it simply reduces to \ufb01rst decrease every Mkl by 1\u2212\u03b1l and then renormalize each column to\nsum to one, Eq.(14). This operation of reducing the same amount from all elements in one column\nof M has the effect of making \u201cthe rich get richer and the poor get poorer\u201d (known as Dalton\u2019s 3rd\nlaw), which increases the imbalance of the elements and improves the chances of small elements\nto be reduced to zero in the subsequent steps [15]5. In the second case (second column of Fig.1),\nwhen the coef\ufb01cients of (13) have mixed signs, the effect of the updating step in (15) is two-fold.\nFor Mkl < 1 \u2212 \u03b1l (\ufb01rst term in Eq.(15)), they are all reduced to \u0001, which is the de facto zero. In\nother words, components below the threshold 1 \u2212 \u03b1l are eliminated to zero. On the other hands,\nterms with Mkl > 1\u2212 \u03b1l (second term in Eq.(15)) are redistribute with the operation of reduction of\n1 \u2212 \u03b1l followed by renormalization. In the last case when all coef\ufb01cients of (13) are negative (third\ncolumn of Fig.1), only the element corresponding to Mkl that is closest to the threshold 1 \u2212 \u03b1k, or\nequivalently, the largest of all Mkl, will survive with a non-zero value that is essentially 1 (\ufb01rst term\nin Eq.(16)), while the rest of the elements all become extinct (second term in Eq.(16)), analogous to\na scenario of \u201csurvival of the \ufb01ttest\u201d. Note that it is the last two cases actually generating zero entries\nin the factors, but the \ufb01rst case makes more entries suitable for being set to zero. The thresholding\nand renormalization steps resemble algorithms in sparse coding [16].\n\n6 Experimental Evaluations\n\nWe perform experimental evaluations of the sparse multi-factor NMF algorithm using synthetic and\nreal data sets. In the \ufb01rst set of experiments, we study empirically the convergence of the multi-\nplicative algorithm for the SMS problem (Lemma 2). Speci\ufb01cally, with several different choices of\n\nin X to zero. So we can technically ignore such elements for each column index l.\n\n4It is easy to show that the optimal solution in this case is Xkl = 0, i.e., setting the corresponding component\n5Some early works (e.g., [11]) obtain simpler solution by setting negative Mk(cid:48)l + \u03b1l \u2212 1 to zero followed\n\nby normalization. Our result shows that such a solution is not optimal.\n\n6\n\nMklkk\u02c6Xkl1\u21b5lMklkk\u02c6Xkl1\u21b5lMklkk\u02c6Xkl1\u21b5l\fFigure 2: Convergences of the SMS objective function with multiplicative update algorithm (mult solid curve)\nand the projected gradient ascent method (pgd dashed curve) for different problem sizes.\n\n(m, n, m(cid:48), n(cid:48)), we randomly generate stochastic matrices A (m \u00d7 m(cid:48)) and B (n(cid:48)\n\u00d7 n), and non-\nnegative matrix C (m \u00d7 n). We then apply the SMS algorithm to solve for the optimal X. We\ncompare our algorithm with a projected gradient ascent optimization of the SMS problem, which\nupdates X using the gradient of the SMS objective function and chooses a step size to satisfy the\nnonnegative and normalization constraints. We do not consider methods that use the Hessian ma-\ntrix of the objective function, as constructing a general Hessian matrix in this case have prohibitive\nmemory requirement even for a medium sized problem. Shown in Fig.2 are several runs of the two\nalgorithms starting at the same initial values, as the the objective function of SMS vs. the number of\nupdates of X. Because of the convex nature of the SMS problem, both algorithms converge to the\nsame optimal value regardless of the initial values. On the other hand, the multiplicative updates for\nSMS usually achieve two order speed up in terms of the number of iterations and typically about 10\ntimes faster in running time when compared to the gradient based algorithm.\nIn the second set of experiments, we evaluate the performance of the coordinate-update mfNMF\nalgorithm based on the multiplicative updating algorithm of the SMS problem (Section 4.1). Specif-\nically, we consider the mfNMF problem that approximates a randomly generated target nonnegative\nmatrix V of dimension m\u00d7n with the product of three stochastic factors, W1 (m\u00d7m(cid:48)), W2 (m(cid:48)\n\u00d7n(cid:48)),\nand W3 (n(cid:48)\n\u00d7 n). The performance of the algorithm is evaluated by the logarithm of the generalized\nKL divergence for between V and W1W2W3, of which lower numerical values suggest better per-\nformances. As a comparison, we also implemented a multi-layer NMF algorithm [5], which solves\ntwo NMF problems in sequence, as: V \u2248 W1 \u02dcV and \u02dcV \u2248 W2W3, and the multiplicative update\nalgorithm of mfNMF of [9], both of which are based on the generalized KL divergence. To make\nthe comparison fair, we start all three algorithms with the same initial values.\n\nm, n, m(cid:48), n(cid:48)\n\nmulti-layer NMF [5]\nmulti-factor NMF [9]\n\nmulti-factor NMF (this work)\n\n50,40,30,10\n\n1.733\n1.431\n1.325\n\n200,100,60,30\n\n1000,400,200,50\n\n2.595\n2.478\n2.340\n\n70.526\n66.614\n62.086\n\n5000,2000,100,20\n\n183.617\n174.291\n161.338\n\nTable 1: Comparison of the multi-layer NMF method and two mfNMF methods for a three factor with different\nproblem sizes. The values correspond to the logarithm of generalized KL divergence, log d(V, W1W2W3).\nLower numerical values (in bold font) indicate better performances.\nThe results of several runs of these algorithms for different problem sizes are summarized in Table\n1, which show that in general, mfNMF algorithms lead to better solutions corresponding to lower\ngeneralized KL divergences between the target matrix and the product of the three estimated factors.\nThis is likely due to the fact that these algorithms optimize the generalized KL divergence directly,\nwhile multi-layer NMF is a greedy algorithm with sub-optimal solutions. On the other hand, our\nmfNMF algorithm consistently outperforms the method of [9] by a signi\ufb01cant margin, with on\naverage 40% less iterations. We think the improved performance and running ef\ufb01ciency are due\nto our formulation of the mfNMF problem based on stochastic matrices, which reduces the solution\nspace and encourage convergence to a better local minimum of the objective function.\nWe apply the sparse mfNMF algorithm to data converted from grayscale images from the MNIST\nHandwritten Digits data set [17] that are vectorized to column vectors and normalized to have total\nsum of one. All vectorized and normalized images are collected to form the target stochastic ma-\ntrix V , which are decomposed into the product of three factors W1W2W3. We also incorporate the\nDirichlet sparsity regularizers with different con\ufb01gurations. For simplicity, we use the same param-\neter for all column vectors in one factor. The threshold is set as \u0001 = 10\u22128/n where n is the total\nnumber of images. Shown in Fig.3 are the decomposition results corresponding to 500 vectorized\n20 \u00d7 20 images of handwritten digit 3, that are decomposed into three factors of size 400 \u00d7 196,\n\n7\n\n100101102103104105m=20,n=20m\u2019=10,n\u2019=10itr #obj. fun pgdmult100101102103104105m=90,n=50m\u2019=20,n\u2019=5itr #obj. fun pgdmult100101102103104105m=200,n=100m\u2019=50,n\u2019=25itr #obj. fun pgdmult100101102103104105m=1000,n=200m\u2019=35,n\u2019=15itr #obj. fun pgdmult\fW1\n\nW2\n\nW1W2\n\nW3\n\nW1W2W3\n\nFigure 3: Sparse mfNMF algorithm on the handwritten digit images. The three rows correspond to three cases\nas: \u03b11 = 1, \u03b12 = 1, \u03b13 = 1, \u03b11 = 1, \u03b12 = 1, \u03b13 = 0.99, \u03b11 = 1, \u03b12 = 0.99, \u03b13 = 0.99, respectively. See\ntexts for more details.\n196 \u00d7 100, and 100 \u00d7 500. The columns of the factors are reshaped to shown as images, where the\nbrightness of each pixel in the \ufb01gure is proportional to the nonnegative values in the corresponding\nfactors. Due to space limit, we only show the \ufb01rst 25 columns in each factor. All three factorization\nresults can reconstruct the target matrix (last column), but they put different constraints on the ob-\ntained factors. The factors are also visually meaningful: factor W1 contains low level components\nof the images that when combined with factor W2 forms more complex structures. The \ufb01rst row\ncorresponds to running the mfNMF without sparsity regularizer. The two rows below correspond\nto the cases when the Dirichlet sparsity regularizer is applied to the third factor and to the second\nand third factor simultaneously. Compare with the corresponding results in the non-sparse case,\nthe obtained factors contain more zeros. As a comparison, we also implement mfNMF algorithm\nusing a pseudo-Dirichlet sparse regularizer [8]. With similar decomposition results, our algorithm\nis typically 3 \u2212 5 times faster as it does not require the extra iterations of a \ufb01x point algorithm.\n7 Conclusion\n\nWe describe in this work a simple and ef\ufb01cient algorithm for the sparse multi-factor nonnegative\nmatrix factorization (mfNMF) problem, involving only multiplicative update and normalization. Our\nsolution to incorporate Dirichlet sparse regularizer leads to a closed form solution and the resulting\nalgorithm is more ef\ufb01cient than previous works based on \ufb01x point iterations. The effectiveness and\nef\ufb01ciency of our algorithms are demonstrated on both synthetic and real data sets.\nThere are several directions we would like to further explore. First, we are studying if similar\nmultiplicative update algorithm also exists for mfNMF with more general similarity norms such\nas Csizar\u2019s divergence [18], Itakura-Saito divergence, [19], \u03b1-\u03b2 divergence [20] or the Bregmann\ndivergence [9]. We will also study incorporating other constraints (e.g., value ranges) over the\nfactors into the mfNMF algorithm. Last, we would like to further study applications of mfNMF\nin problems such as co-clustering or hierarchical document topic analysis, exploiting its ability to\nrecover hierarchical decomposition of nonnegative matrices.\n\nAcknowledgement\n\nThis work is supported by the National Science Foundation under Grant Nos. IIS-0953373, IIS-\n1208463 and CCF-1319800.\n\n8\n\n\fReferences\n[1] Daniel D. Lee and H. Sebastian Seung. Algorithms for nonnegative matrix factorization. In Advances in\n\nNeural Information Processing Systems (NIPS 13), 2001. 1, 2\n\n[2] A. Cichocki, R. Zdunek, A.H. Phan, and S. Amari. Nonnegative Matrix and Tensor Factorizations:\n\nApplications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley, 2009. 1\n\n[3] Seungjin Choi Jong-Hoon Ahn and Jong-Hoon Oh. A multiplicative up-propagation algorithm. In ICML,\n\n2004. 2\n\n[4] Nicolas Gillis and Fran cois Glineur. A multilevel approach for nonnegative matrix factorization. Journal\n\nof Computational and Applied Mathematics, 236 (7):1708\u20131723, 2012. 2\n\n[5] A Cichocki and R Zdunek. Multilayer nonnegative matrix factorisation. Electronics Letters, 42(16):947\u2013\n\n948, 2006. 2, 7\n\n[6] Patrik O. Hoyer and Peter Dayan. Non-negative matrix factorization with sparseness constraints. Journal\n\nof Machine Learning Research, 5:1457\u20131469, 2004. 1, 2\n\n[7] Bhiksha Raj Madhusudana Shashanka and Paris Smaragdis. Sparse overcomplete latent variable decom-\n\nposition of counts data. In NIPS, 2007. 1, 2\n\n[8] Martin Larsson and Johan Ugander. A concave regularization technique for sparse mixture models. In\n\nNIPS, 2011. 1, 2, 5, 8\n\n[9] Suvrit Sra and Inderjit S Dhillon. Nonnegative matrix approximation: Algorithms and applications.\n\nComputer Science Department, University of Texas at Austin, 2006. 2, 7, 8\n\n[10] Jussi Kujala. Sparse topic modeling with concave-convex procedure: EMish algorithm for latent dirichlet\n\nallocation. In Technical Report, 2004. 2\n\n[11] Jagannadan Varadarajan, R\u00b4emi Emonet, and Jean-Marc Odobez. A sequential topic model for mining\nrecurrent activities from long term video logs. International Journal of Computer Vision, 103(1):100\u2013\n126, 2013. 2, 6\n\n[12] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2005. 4\n[13] P. Gahinet, P. Apkarian, and M. Chilali. Af\ufb01ne parameter-dependent Lyapunov functions and real para-\n\nmetric uncertainty. IEEE Transactions on Automatic Control, 41(3):436\u2013442, 1996. 4\n\n[14] Wray Buntine and Aleks Jakulin. Discrete component analysis. In Subspace, Latent Structure and Feature\n\nSelection Techniques. Springer-Verlag, 2006. 5\n\n[15] N. Hurley and Scott Rickard. Comparing measures of sparsity. Information Theory, IEEE Transactions\n\non, 55(10):4723\u20134741, 2009. 6\n\n[16] Misha Denil and Nando de Freitas. Recklessly approximate sparse coding. CoRR, abs/1208.0959, 2012.\n\n6\n\n[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998. 7\n\n[18] Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari. Csiszar\u2019s divergences for non-negative matrix fac-\ntorization: Family of new algorithms. In Independent Component Analysis and Blind Signal Separation,\npages 32\u201339. Springer, 2006. 8\n\n[19] C\u00b4edric F\u00b4evotte, Nancy Bertin, and Jean-Louis Durrieu. Nonnegative matrix factorization with the itakura-\n\nsaito divergence: With application to music analysis. Neural Computation, 21(3):793\u2013830, 2009. 8\n\n[20] Andrzej Cichocki, Rafal Zdunek, Seungjin Choi, Robert Plemmons, and Shun-Ichi Amari. Non-negative\ntensor factorization using alpha and beta divergences. In Acoustics, Speech and Signal Processing, 2007.\nICASSP 2007. IEEE International Conference on, volume 3, pages III\u20131393. IEEE, 2007. 8\n\n[21] V. Chv\u00b4atal. Linear Programming. W. H. Freeman and Company, New York, 1983.\n\n9\n\n\f", "award": [], "sourceid": 364, "authors": [{"given_name": "Siwei", "family_name": "Lyu", "institution": "SUNY at Albany"}, {"given_name": "Xin", "family_name": "Wang", "institution": "SUNY at Albany"}]}