{"title": "Clustering by Nonnegative Matrix Factorization Using Graph Random Walk", "book": "Advances in Neural Information Processing Systems", "page_first": 1079, "page_last": 1087, "abstract": "Nonnegative Matrix Factorization (NMF) is a promising relaxation technique for clustering analysis.  However, conventional NMF methods that directly approximate the pairwise similarities using the least square error often yield mediocre performance for data in curved manifolds because they can capture only the immediate similarities between data samples.  Here we propose a new NMF clustering method which replaces the approximated matrix with its smoothed version using random walk.  Our method can thus accommodate farther relationships between data samples.  Furthermore, we introduce a novel regularization in the proposed objective function in order to improve over spectral clustering.  The new learning objective is optimized by a multiplicative Majorization-Minimization algorithm with a scalable implementation for learning the factorizing matrix.  Extensive experimental results on real-world datasets show that our method has strong performance in terms of cluster purity.", "full_text": "Clustering by Nonnegative Matrix Factorization\n\nUsing Graph Random Walk\n\nZhirong Yang, Tele Hao, Onur Dikmen, Xi Chen and Erkki Oja\n\nDepartment of Information and Computer Science\n\n{zhirong.yang,tele.hao,onur.dikmen,xi.chen,erkki.oja}@aalto.fi\n\nAalto University, 00076, Finland\n\nAbstract\n\nNonnegative Matrix Factorization (NMF) is a promising relaxation technique for\nclustering analysis. However, conventional NMF methods that directly approx-\nimate the pairwise similarities using the least square error often yield mediocre\nperformance for data in curved manifolds because they can capture only the imme-\ndiate similarities between data samples. Here we propose a new NMF clustering\nmethod which replaces the approximated matrix with its smoothed version using\nrandom walk. Our method can thus accommodate farther relationships between\ndata samples. Furthermore, we introduce a novel regularization in the proposed\nobjective function in order to improve over spectral clustering. The new learning\nobjective is optimized by a multiplicative Majorization-Minimization algorithm\nwith a scalable implementation for learning the factorizing matrix. Extensive ex-\nperimental results on real-world datasets show that our method has strong perfor-\nmance in terms of cluster purity.\n\n1\n\nIntroduction\n\nClustering analysis as a discrete optimization problem is usually NP-hard. Nonnegative Matrix Fac-\ntorization (NMF) as a relaxation technique for clustering has shown remarkable progress in the past\ndecade (see e.g. [9, 4, 2, 26]). In general, NMF \ufb01nds a low-rank approximating matrix to the input\nnonnegative data matrix, where the most popular approximation criterion or divergence in NMF is\nthe Least Square Error (LSE). It has been shown that certain NMF variants with this divergence\nmeasure are equivalent to k-means, kernel k-means, or spectral graph cuts [7]. In addition, NMF\nwith LSE can be implemented ef\ufb01ciently by existing optimization methods (see e.g. [16]).\nAlthough popularly used, previous NMF methods based on LSE often yield mediocre performance\nfor clustering, especially for data that lie in a curved manifold. In clustering analysis, the clus-\nter assignment is often inferred from pairwise similarities between data samples. Commonly the\nsimilarities are calculated based on Euclidean distances. For data in a curved manifold, only local\nEuclidean distances are reliable and similarities between non-neighboring samples are usually set\nto zero, which yields a sparse input matrix to NMF. If the LSE is directly used in approximation\nto such a similarity matrix, a lot of learning effort will be wasted due to the large majority of zero\nentries. The same problem occurs for clustering nodes of a sparse network.\nIn this paper we propose a new NMF method for clustering such manifold data or sparse network\ndata. Previous NMF clustering methods based on LSE used an approximated matrix that takes only\nsimilarities within immediate neighborhood into account. Here we consider multi-step similarities\nbetween data samples using graph random walk, which has shown to be an effective smoothing\napproach for \ufb01nding global data structures such as clusters. In NMF the smoothing can reduce the\nsparsity gap in the approximation and thus ease cluster analysis. We name the new method NMF\nusing graph Random walk (NMFR).\n\n1\n\n\fIn implementation, we face two obstacles when the input matrix is replaced by its random walk\nversion: (1) the performance of unconstrained NMFR remains similar to classical spectral clustering\nbecause smoothing that manipulates eigenvalues of Laplacian of the similarity graph does not change\nthe eigensubspace; (2) The similarities by random walk require inverting an n \u00d7 n matrix for n data\nsamples. Explicit matrix inversion is infeasible for large datasets. To overcome the above obstacles,\nwe employ (1) a regularization technique that supplements the orthogonality constraint for better\nclustering, and (2) a more scalable \ufb01xed-point algorithm to calculate the product of the inverted\nmatrix and the factorizing matrix.\nWe have conducted extensive experiments for evaluating the new method. The proposed algorithm\nis compared with nine other state-of-the-art clustering approaches on a large variety of real-world\ndatasets. Experimental results show that with only simple initialization NMFR performs pretty\nrobust across 46 clustering tasks. The new method achieves the best clustering purity for 36 of the\nselected datasets, and nearly the best for the rest. In particular, NMFR is remarkably superior to the\nother methods for large-scale manifold data from various domains.\nIn the remaining, we brie\ufb02y review some related work of clustering by NMF in Section 2. In Section\n3 we point out a major drawback in previous NMF methods with least square error and present our\nsolution. Experimental settings and results are given in Section 4. Section 5 concludes the paper\nand discusses potential future work.\n\n2 Pairwise Clustering by NMF\n\n(cid:113)(cid:80)\nj Ujk such that M T M = I and(cid:80)\n\nCluster analysis or clustering is the task of assigning a set of data samples into groups (called clus-\nters) so that the objects in the same cluster are more similar to each other than to those in other\nclusters. Denote R+ = R \u222a {0}. The pairwise similarities between n data samples can be encoded\nin an undirected graph with adjacency matrix S \u2208 Rn\u00d7n\n+ . Because clustered data tend to have\nhigher similarities within clusters and lower similarities between clusters, the similarity matrix in\nvisualization has nearly diagonal blockwise looking if we sort the rows and columns by clusters.\nSuch structure motivated approximative low-rank factorization of S by the cluster indicator matrix\nU \u2208 {0, 1}n\u00d7r for r clusters: S \u2248 U U T , where Uik = 1 if the i-th sample is assigned to the k-th\ncluster and 0 otherwise. Moreover, clusters of balanced sizes are desired in most clustering tasks.\nThis can be achieved by suitable normalization of the approximating matrix. A common way is to\nnormalize Uik by Mik = Uik/\ni(M M T )ij = 1 (see e.g.\n[6, 7, 27]).\nHowever, directly optimizing over U or M is dif\ufb01cult due to discrete solution space, which usually\nleads to an NP-hard problem. Continuous relaxation is thus needed to ease the optimization. One\nof the popular choices is nonnegativity and orthogonality constraint combination [11, 23]. That is,\nwe replace M with W where Wik \u2265 0 and W T W = I. In this way, each row of W has only\none non-zero entry because the non-zero parts of two nonnegative and orthogonal vectors do not\noverlap. Some other Nonnegative Matrix Factorization (NMF) relaxations exist, for example, the\nkernel Convex NMF [9] and its special case Projective NMF [23], as well as the relaxation by using\na left-stochastic matrix [2].\nA commonly used divergence that measures the approximation error is the squared Euclidean dis-\ntance or Frobenius norm [15, 13]. The NMF objective to be minimized thus becomes\n\n(cid:88)\n\n(cid:104)\n\nSij \u2212(cid:0)W W T(cid:1)\n\n(cid:105)2\n\nij\n\n(cid:107)S \u2212 W W T(cid:107)2\n\nF =\n\n.\n\n(1)\n\nij\n\nThe above least square error objective is widely used because we have better understanding of its\nalgebra and geometric properties. For example, Zhao et al. [13] showed that the multiplicative\noptimization algorithm for the above Symmetric NMF (SNMF) problem is guaranteed to converge\nto a local minimum if S is positive semi-de\ufb01nite. Furthermore, SNMF with orthogonality has tight\nconnection to classical objectives such as kernel k-means and normalized cuts [7, 23]. In this paper,\nwe choose this divergence also because it is the sole one in \u03b1\u03b2-divergence family [5] that involves\nonly the product SW instead of S itself in the gradient. As we shall see in Section 3.2, this property\nenables a scalable implementation of gradient-based optimization algorithm.\n\n2\n\n\fFigure 1: Illustration of clustering the SEMEION handwritten digit dataset by NMF based on LSE:\n(a) the symmetrized 5-NN graph, (b) the correct clusters to be found, (c) the ideally assumed data\nthat suits the least square error, (d) the smoothed input by using graph random walk. The matrix\nentries are visualized as image pixels. Darker pixels represent higher similarities. For clarity we\nshow only the subset of digits \u201c2\u201d and \u201c3\u201d. In this paper we show that because (d) is \u201ccloser\u201d to (c)\nthan (a), it is easier to \ufb01nd correct clusters using (d)\u2248(b) instead of (a)\u2248(b) by NMF with LSE .\n\n3 NMF Using Graph Random Walk\n\nThere is a serious drawback in previous NMF clustering methods using least square errors. When\n\nF for given S, the approximating matrix (cid:98)S should be diagonal blockwise for\n\nminimizing (cid:107)S \u2212 (cid:98)S(cid:107)2\n\nclustering analysis, as shown in Figure 1 (b). Correspondingly, the ideal input S for LSE should\nlook like Figure 1 (c) because the underlying distribution of LSE is Gaussian.\nHowever, the similarity matrix of real-world data often occurs differently from the ideal case. In\nmany clustering tasks, the raw features of data are usually weak. That is, the given distance mea-\nsure between data points, such as the Euclidean distance, is only valid in a small neighborhood.\nThe similarities calculated from such distances are thus sparse, where the similarities between non-\nneighboring samples are usually set to zero. For example, symmetrized K-nearest-neighbor (K-NN)\ngraph is a popularly used similarity input. Therefore, similarity matrices in real-world clustering\ntasks often look like Figure 1 (a), where the non-zero entries are much sparser than the ideal case.\nIt is a mismatch to approximate a sparse similarity matrix by a dense diagonal blockwise matrix\nusing LSE. Because squared Euclidean distance is a symmetric metric, the learning objective can\nbe dominated by the approximation to the majority of zero entries, which is undesired for \ufb01nding\ncorrect cluster assignments. Although various matrix factorization schemes and factorizing matrix\n\n3\n\n\fWe thus propose to replace S in Eq. (1) with\n\nconstraints have been proposed for NMF, little research effort has been made to overcome the above\nmismatch.\nIn this work we present a different way to formalize NMF for clustering to reduce the sparsity gap\nbetween input and output matrices. Instead of approximation to the sparse input S, which only en-\ncodes the immediate similarities between data samples, we propose to approximate a smoothed ver-\nsion of S which takes farther relationships between data samples into account. Graph random walk\nis a common way to implement multi-step similarities. Denote Q = D\u22121/2SD\u22121/2 the normalized\nj Sij. The similarities between data\nnodes using j steps are given by (\u03b1Q)j, where \u03b1 \u2208 (0, 1) is a decay parameter controlling the ran-\nj=0 (\u03b1Q)j = (I \u2212 \u03b1Q)\u22121.\n\nsimilarity matrix, where D is a diagonal matrix with Dii =(cid:80)\ndom walk extent. Summing over all possible numbers of steps gives(cid:80)\u221e\nwhere c =(cid:80)\n\n(2)\nij is a normalizing factor. Here the parameter \u03b1 controls the smooth-\nness: a larger \u03b1 tends to produce smoother A while a smaller one makes A concentrate on its\ndiagonal. A smoothed approximated matrix A is shown in Figure 1 (d), from which we can see the\nsparsity gap to the approximating matrix is reduced.\nJust smoothing the input matrix by random walk is not enough, as we are presented with two dif-\n\ufb01culties. First, random walk only alters the spectrum of Q, while the eigensubspaces of A and Q\nare the same. Smoothing therefore does not change the result of clustering algorithms that operate\non the eigenvectors (e.g. [20]). If we simply replace S by A in Eq. (1), the resulting W is often\nthe same as the leading eigenvectors of Q up to an r \u00d7 r rotation. That is, smoothing by random\nwalk itself can bring little improvement unless we impose extra constraints or regularization. Sec-\nond, explicitly calculating A is infeasible because when S is large and sparse, A is also large but\ndense. This requires a more careful design of a scalable optimization algorithm. Below we present\nsolutions to overcome these two dif\ufb01culties in Sections 3.1 and 3.2, respectively.\n\n(cid:2)(I \u2212 \u03b1Q)\u22121(cid:3)\n\nA = c\u22121(I \u2212 \u03b1Q)\u22121,\n\nij\n\n3.1 Learning Objective\nMinimizing (cid:107)A \u2212 W W T(cid:107)2\n\nTr(cid:0)W T AW(cid:1). To improve over spectral clustering, we propose to regularize the trace maximization\n\nto W T W = I is equivalent\n\nF over W subject\n\nto maximizing\n\nby an extra penalty term on W . The new optimization problem for pairwise clustering is:\n\nJ (W ) = \u2212Tr(cid:0)W T AW(cid:1) + \u03bb\n\n(cid:33)2\n\n(cid:32)(cid:88)\n\n(cid:88)\n\ni\n\nk\n\nminimize\n\nW\u22650\n\nsubject to W T W = I,\n\nW 2\nik\n\n(3)\n\n(4)\n\nwhere \u03bb > 0 is the tradeoff parameter. We \ufb01nd that \u03bb = 1\nThe extra penalty term collaborates with the orthogonality constraint for pairwise clustering, which\nis justi\ufb01ed by two interpretations.\n\n2r works well in this work.\n\nBecause (cid:80)\n\n(cid:0)(cid:80)\n\n(cid:1)2\n\n(cid:80)\n\ni\n\n(cid:0)W W T(cid:1)2\n\ni\n\nk W 2\nik\n\n\u2022 It emphasizes off-diagonal correlation in the trace.\n\n=\nii, the minimization tends to reduce the diagonal magnitude in the approxi-\nmating matrix. This is desired because self-similarities usually give little information for\ngrouping data samples. Given the constraints W \u2265 0 and W T W = I, it is bene\ufb01cial to\npush the magnitudes in W W T off-diagonal for maximizing the correlation to similarities\nbetween different data samples.\n\n\u2022 It tends to equalize the norms of W rows. To see this, let us write ai \u2261(cid:80)\nBecause(cid:80)\ni actually maximizing(cid:80)\n\ni ai = r is constant, minimizing(cid:80)\n\nik for brevity.\nk W 2\nij:i(cid:54)=j aiaj. The\nmaximum is achieved when {ai}n\ni=1 are equal. Originally, the nonnegativity and orthogo-\nnality constraint combination only guarantees that each row of W has one non-zero entry,\nthough norms of different W rows can be diverse. The equalization by the proposed penalty\nterm thus well supplements the nonnegativity and orthogonality constraints and, as a whole,\nprovides closer relaxation to the normalized cluster indicator matrix M.\n\ni a2\n\n4\n\n\fAlgorithm 1 Large-Scale Relaxed Majorization and Minimization Algorithm for W\n\nInput: similarity matrix S, random walk extent \u03b1 \u2208 (0, 1), number of clusters r, nonnegative\ninitial guess of W .\nrepeat\n\nCalculate c=IterativeTracer(Q,\u03b1, e).\nCalculate G=IterativeSolver(Q,\u03b1, W ).\nUpdate W by Eq. (5), using c\u22121G in place of AW .\n\nuntil W converges\nDiscretized W to cluster indicator matrix U\nOutput: U.\n\nfunction ITERATIVETRACER(Q, \u03b1, W )\n\nF =IterativeSolver(Q,\u03b1, W )\nreturn Tr(W T F )\n\nend function\n\nfunction ITERATIVESOLVER(Q, \u03b1, W )\n\nInitialize F = W\nrepeat\n\nUpdate F \u2190 \u03b1QF + (1 \u2212 \u03b1)W\n\nuntil F converges\nreturn F/(1 \u2212 \u03b1)\n\nend function\n\n3.2 Optimization\n\nThe optimization algorithm is developed by following the procedure in [24, 26].\nIntroducing\nLagrangian multipliers {\u039bkl} for the orthogonality constraint, we have the augmented objective\n\nL(W, \u039b) = J (W ) + Tr(cid:2)\u039b(cid:0)W T W \u2212 I(cid:1)(cid:3). Using the Majorization-Minimization development pro-\n\ncedure in [24, 26], we can obtain the preliminary multiplicative update rule. We then use the or-\nthogonality constraint to solve the multipliers. Substituting the multipliers in the preliminary update\nrule, we obtain an optimization algorithm which iterates the following multiplicative update rule:\n\nW new\n\nik = Wik\n\nwhere V is a diagonal matrix with Vii =(cid:80)\n\nil.\nl W 2\nTheorem 1. L(W new, \u039b) \u2264 L(W, \u039b) for \u039b =\n1\n2\n\n(cid:35)1/4\n\n(cid:34)(cid:0)AW + 2\u03bbW W T V W(cid:1)\n(cid:19)\n\nik\n(2\u03bbV W + W W T AW )ik\n\n(cid:18) \u2202J\n\nW T\n\n.\n\n\u2202W\n\n(5)\n\nThe proof is given the appendix. Note that J (W ) does not necessarily decrease after each iteration.\nInstead, the monotonicity stated in the theorem justi\ufb01es that the above algorithm jointly minimizes\nthe J (W ) and drives W towards the manifold de\ufb01ned by the orthogonality constraint. After W\nconverges, we discretize it and obtain the cluster indicator matrix U.\nIt is a crucial observation that the update rule Eq. (5) requires only the product of (I \u2212 \u03b1Q)\u22121\nwith a low-rank matrix instead of A itself. We can thus avoid expensive computation and storage\nof large smoothed similarity matrix. There is an iterative and more scalable way to calculate F =\n(I \u2212 \u03b1Q)\u22121W [29]. See the IterativeSolver function in Algorithm 1. In practice, the calculation for\nF usually converges nicely within 100 iterations. The same technique can be applied to calculating\nthe normalizing factor c in Eq. (2), using e = [1, 1, . . . , 1] instead of W . The resulting algorithm for\noptimization w.r.t. W is summarized in Algorithm 1. Matlab codes can be found in [1].\n\n3.3 Initialization\n\nMost state-of-the-art clustering methods involve non-convex optimization objectives and thus only\nreturn local optima in general. This is also the case for our algorithm. To achieve a better local\n\n5\n\n\foptimum, a clustering algorithm should start from one or more relatively considerate initial guesses.\nDifferent strategies for choosing the starting point can be classi\ufb01ed into the following levels, sorted\nby their computational cost:\n\nLevel-0: (random-init) The starting relaxed indicator matrix is \ufb01lled by randomly generated num-\n\nbers.\n\nLevel-1: (simple-init) The starting matrix is the result of a cheap clustering method, e.g. Normal-\n\nized Cut or k-means, plus a small perturbation.\n\nLevel-2: (family-init) The initial guesses are results of the methods in a parameterized family. Typ-\nical examples include various regularization extents or Bayesian priors with different hy-\nperparameters (see e.g. [25]).\n\nLevel-3: (meta-init) The initial guesses can come from methods of various principles. Each ini-\n\ntialization method runs only once.\n\nLevel-4: (meta-co-init) Same as Level-3 except that clustering methods provide initialization for\neach other. A method can serve initialization multiple times if it \ufb01nds a better local min-\nimum. The whole procedure stops when each of the involved methods fails to \ufb01nd better\nlocal optimum (see e.g. [10]).\n\nSome methods are not sensitive to initializations but tend to return less accurate clustering. On the\nother hand, some other methods can \ufb01nd more accurate results but require comprehensive initializa-\ntion. A preferable clustering method should achieve high accuracy with cheap initialization. As we\nshall see, the proposed NMFR algorithm can attain satisfactory clustering accuracy with only simple\ninitialization (Level-1).\n\n4 Experiments\n\nWe have compared our method against a variety of state-of-the-art clustering methods, including\nProjective NMF [23], Nonnegative Spectral Cut (NSC) [8], (symmetric) Orthogonal NMF (ONMF)\n[11], Left-Stochastic matrix Decomposition (LSD) [2], Data-Cluster-Data decomposition (DCD)\n[25], as well as classical Normalized Cut (Ncut) [21]. We also selected two recent clustering meth-\nods beyond NMF: 1-Spectral (1Spec) [14] which uses balanced graph cut, and Interaction Compo-\nnent Model (ICM) [22] which is the symmetric version of topic model [3].\nWe used default settings in the compared methods. For 1Spec, we used ratio Cheeger cut. For ICM,\nthe hyper-parameters for Dirichlet processes prior are updated by Minka\u2019s learning method [19].\nThe other NMF-type methods that use multiplicative updates were run with 10,000 iterations to\nguarantee convergence. For our method, we trained W by using Algorithm 1 for each candidate \u03b1 \u2208\n{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99} when n \u2264 8000. The best \u03b1 and the corresponding\nclustering result were then obtained by minimizing (cid:107)A \u2212 bW W T(cid:107)2\nF with a suitable positive scalar\nb. Here we set b = 2\u03bb using the heuristic that the penalty term in gradient can be interpreted as\nremoval of diagonal effect of approximating matrix. When \u03bb = 1\n2r , we obtain b = 1/r. The new\nclustering method is not very sensitive to the choice of \u03b1 for large-scale datasets. We simply used\n\u03b1 = 0.8 in experiments when n > 8000. All methods except Ncut, 1Spec, and ICM were initialized\nby Normalized Cut. That is, their starting point was the Ncut cluster indicator matrix plus a small\nconstant 0.2 to all entries.\nWe have compared the above methods on clustering various datasets. The domains of the datasets\nrange from network, text, biology, image, etc. All datasets are publicly available on the Internet. The\ndata sources and statistics are given in the supplemental document. We constructed symmetrized\nK-NN graphs from the multivariate data, where K = 5 for the 30 smallest datasets, text datasets,\nPROTEIN and SEISMIC datasets, while K = 10 for the remaining datasets. Following [25], we\nextract the scattering features [18] for images before calculating the K-NN graph. We used Tf-Idf\nfeatures for text data. The adjacency matrices of network data were symmetrized. The clustering\nk is the number of\nperformance is evaluated by cluster purity = 1\nn\ndata samples in the cluster k that belong to ground-truth class l. A larger purity in general corre-\nsponds to a better clustering result.\nThe resulting purities are shown in Table 1, where the rows are ordered by dataset size. We can\nsee that our method has much better performance than the other methods. NMFR wins 36 out of 46\n\nk=1 max1\u2264l\u2264r nl\n\nk, where nl\n\n(cid:80)r\n\n6\n\n\fTable 1: Clustering purities for the compared methods on various datasets. Boldface numbers indi-\ncate the best in each row.\n\nDataset\n\nSize Ncut PNMF NSC ONMF PLSI LSD 1Spec\n1.00\n0.96\n24\nSTRIKE\n1.00\n0.71\n35\nKOREA\n0.92\n0.92\n38\nAMLALL\n0.52\n0.52\nDUKE\n44\n0.83\n0.82\nHIGHSCHOOL 60\n0.57\nKHAN\n83\n0.58\n0.83\n0.78\n105\nPOLBOOKS\n0.93\n0.90\n115\nFOOTBALL\n0.90\n0.91\n150\nIRIS\n0.53\n0.51\n198\nCANCER\n0.79\n0.79\nSPECT\n267\n0.77\n0.77\n300\nROSETTA\n0.83\n0.79\n327\nECOLI\n0.69\n0.69\n351\nIONOSPHERE\n0.81\n0.80\n400\nORL\n0.74\n0.68\nUMIST\n575\n0.65\n0.65\n683\nWDBC\n0.65\n0.65\n768\nDIABETES\n0.20\n1.0K 0.36\nVOWEL\n0.50\n1.0K 0.53\nMED\nPIE\n1.2K 0.67\n0.64\n0.37\n1.3K 0.45\nYALEB\n0.44\n1.3K 0.45\nTERROR\n0.48\n1.4K 0.49\nALPHADIGS\n0.77\n1.4K 0.79\nCOIL-20\nYEAST\n1.5K 0.53\n0.54\n0.82\n1.6K 0.83\nSEMEION\n0.38\n1.9K 0.40\nFAULTS\n0.55\n2.3K 0.61\nSEG\n0.84\n2.4K 0.84\nADS\nCORA\n2.7K 0.38\n0.36\n0.12\n3.1K 0.41\nMIREX\n0.22\n3.3K 0.24\nCITESEER\n0.39\n4.2K 0.40\nWEBKB4\n0.25\n4.6K 0.25\n7SECTORS\nSPAM\n4.6K 0.61\n0.61\n0.22\n5.6K 0.26\nCURETGREY\n0.87\n5.6K 0.92\nOPTDIGITS\n0.93\n7.0K 0.90\nGISETTE\n8.3K 0.77\n0.63\nREUTERS\nRCV1\n9.6K 0.33\n0.31\n0.82\n11K 0.80\nPENDIGITS\n0.46\n18K 0.46\nPROTEIN\n0.07\n20K 0.25\n20NEWS\n0.88\n70K 0.77\nMNIST\nSEISMIC\n99K 0.52\n0.51\n\n1.00\n1.00\n0.92\n0.52\n0.82\n0.60\n0.77\n0.93\n0.92\n0.53\n0.79\n0.77\n0.78\n0.69\n0.82\n0.66\n0.65\n0.65\n0.30\n0.54\n0.66\n0.41\n0.46\n0.44\n0.65\n0.52\n0.85\n0.39\n0.53\n0.84\n0.37\n0.38\n0.31\n0.39\n0.25\n0.61\n0.21\n0.90\n0.51\n0.72\n0.31\n0.77\n0.46\n0.31\n0.73\n0.50\n\n1.00\n1.00\n0.92\n0.70\n0.82\n0.52\n0.78\n0.93\n0.75\n0.53\n0.79\n0.77\n0.68\n0.64\n0.81\n0.68\n0.65\n0.65\n0.34\n0.55\n0.69\n0.50\n0.45\n0.49\n0.75\n0.52\n0.89\n0.40\n0.64\n0.84\n0.46\n0.38\n0.36\n0.51\n0.26\n0.68\n0.21\n0.92\n0.93\n0.75\n0.48\n0.86\n0.46\n0.32\n0.76\n0.54\n\n0.96\n1.00\n0.92\n0.70\n0.83\n0.55\n0.78\n0.93\n0.91\n0.54\n0.79\n0.77\n0.80\n0.69\n0.83\n0.69\n0.65\n0.65\n0.36\n0.54\n0.68\n0.51\n0.46\n0.49\n0.79\n0.53\n0.85\n0.40\n0.61\n0.84\n0.44\n0.41\n0.36\n0.49\n0.29\n0.65\n0.26\n0.93\n0.93\n0.76\n0.37\n0.80\n0.46\n0.31\n0.79\n0.52\n\n1.00\n0.94\n0.92\n0.52\n0.82\n0.60\n0.78\n0.93\n0.93\n0.54\n0.79\n0.77\n0.78\n0.69\n0.82\n0.64\n0.65\n0.65\n0.35\n0.54\n0.66\n0.42\n0.45\n0.45\n0.71\n0.53\n0.87\n0.39\n0.51\n0.84\n0.37\n0.40\n0.31\n0.39\n0.27\n0.61\n0.22\n0.90\n0.52\n0.74\n0.35\n0.82\n0.46\n0.33\n0.87\n0.50\n\n0.96\n0.71\n0.92\n0.52\n0.83\n0.55\n0.81\n0.93\n0.90\n0.53\n0.79\n0.77\n0.79\n0.70\n0.82\n0.68\n0.65\n0.65\n0.36\n0.54\n0.68\n0.46\n0.46\n0.49\n0.79\n0.54\n0.83\n0.40\n0.61\n0.84\n0.37\n0.42\n0.23\n0.40\n0.25\n0.61\n0.26\n0.92\n0.93\n0.76\n0.32\n0.80\n0.46\n0.21\n0.79\n0.51\n\nICM DCD NMFR\n0.96\n0.58\n1.00\n0.66\n0.89\n0.50\n0.70\n0.52\n0.95\n0.82\n0.49\n0.51\n0.79\n0.78\n0.93\n0.93\n0.91\n0.53\n0.52\n0.53\n0.79\n0.79\n0.77\n0.77\n0.79\n0.78\n0.68\n0.69\n0.83\n0.19\n0.15\n0.72\n0.65\n0.65\n0.65\n0.65\n0.37\n0.15\n0.56\n0.33\n0.74\n0.12\n0.51\n0.10\n0.49\n0.34\n0.51\n0.10\n0.81\n0.11\n0.55\n0.34\n0.94\n0.13\n0.39\n0.38\n0.73\n0.32\n0.84\n0.84\n0.47\n0.30\n0.43\n0.27\n0.44\n0.41\n0.63\n0.48\n0.34\n0.28\n0.69\n0.61\n0.28\n0.11\n0.98\n0.90\n0.94\n0.62\n0.77\n0.71\n0.54\n0.38\n0.87\n0.52\n0.50\n0.46\n0.63\n0.23\n0.97\n0.95\n0.59\n0.50\n\n0.96\n0.97\n0.92\n0.52\n0.83\n0.55\n0.79\n0.93\n0.91\n0.54\n0.79\n0.77\n0.80\n0.69\n0.83\n0.69\n0.65\n0.65\n0.36\n0.55\n0.68\n0.51\n0.45\n0.50\n0.79\n0.52\n0.85\n0.41\n0.61\n0.84\n0.44\n0.18\n0.35\n0.51\n0.28\n0.67\n0.27\n0.92\n0.93\n0.76\n0.36\n0.80\n0.46\n0.31\n0.82\n0.52\n\nselected clustering tasks. Our method is especially superior for large-scale data in a curved manifold,\nfor example, OPTDIGITS and MNIST. Note that cluster purity can be regarded as classi\ufb01cation\naccuracy if we have a few labeled data samples to remove ambiguity between clusters and classes.\nIn this sense, the resulting purities for such manifold data are even comparable to the state-of-the-\nart supervised classi\ufb01cation results. Compared with the DCD results which require Level-2 family\ninitialization (see [25]), NMFR only needs Level-1 simple initialization. In addition, NMFR also\nbrings remarkable improvement for datasets beyond digit or letter recognition, for example, the text\ndata RCV1, 20NEWS, protein data PROTEIN and sensor data SEISMIC. Furthermore, it is worth\nto notice that our method has more robust performance over various datasets compared with other\napproaches. Even for some small datasets where NMFR is not the winner, its cluster purities are\nstill close to the best.\n\n7\n\n\f5 Conclusions\n\nWe have presented a new NMF method using random walk for clustering. Our work includes two\nmajor contributions: (1) we have shown that NMF approximation using least square error should be\napplied on smoothed similarities; the smoothing accompanied with a novel regularization can often\nsigni\ufb01cantly outperform spectral clustering; (2) the smoothing is realized in an implicit and scalable\nway. Extensive empirical study has shown that our method can often improve clustering accuracy\nremarkably given simple initialization.\nSome issues could be included in the future work. Here we only discuss a certain type of smoothing\nby random walk, while the proposed method could be extended by using other types of smoothing,\ne.g. diffusion kernels, where scalable optimization could also be developed by using a similar iter-\native subroutine. Moreover, the smoothing brings improved clustering accuracy but at the cost of\nincreased running time. Algorithms that are more ef\ufb01cient in both time and space should be further\ninvestigated. In addition, the approximated matrix could also be learnable. In current experiments,\nwe used constant K-NN graphs as input for fair comparison, which could be replaced by a more\ncomprehensive graph construction method (e.g. [28, 12, 17]).\n\n6 Acknowledgement\n\nThis work was \ufb01nancially supported by the Academy of Finland (Finnish Center of Excellence in\nComputational Inference Research COIN, grant no 251170; Zhirong Yang additionally by decision\nnumber 140398).\n\nAppendix: proof of Theorem 1\n\nThe proof follows the Majorization-Minimization development procedure in [26]. We use W and\n\n(cid:102)W to distinguish the current estimate and the variable, respectively.\nGiven a real-valued matrix B, we can always decompose it into two nonnegative parts such that\nB = B+ \u2212 B\u2212, where B+\ndecompose \u039b = \u039b+ \u2212 \u039b\u2212 and \u2202J ((cid:102)W )\nIn this way we\n\u2202(cid:102)W\n\u2261 \u2207 = \u2207+ \u2212 \u2207\u2212, where \u2207+ = 4\u03bbV W and\n\u2207\u2212 = 2AW .\n(cid:101)J ((cid:102)W , \u039b)\n(Majorization) Up to some additive constant,\n\nij = (|Bij| + Bij)/2 and B\u2212\n\nij = (|Bij| \u2212 Bij)/2.\n\n(cid:12)(cid:12)(cid:12)(cid:102)W =W\n(cid:33) (cid:102)W 4\n(cid:33) (cid:102)W 4\n\nW 2\nik\n\n8\n\n(cid:102)W 2\n\nik\nWik\n\n(cid:0)\u039b+W(cid:1)\n\n(cid:88)\n(cid:88)\n\nik\n\n(cid:16)(cid:102)W T \u039b\u2212W\n(cid:33)4\n\n(cid:17)\n(cid:16)(cid:102)W T \u039b\u2212W\n\n\u2212 2Tr\n\n(cid:17)\n\nik \u2212 2Tr\n\n(cid:32)(cid:102)Wik\n\nWik (\u039b+W )ik\n\n2\n\nWik\n\n\u2264 \u2212 2Tr\n\n+ \u03bb\n\nW 2\nil\n\n+\n\n(cid:16)(cid:102)W T AW\n(cid:16)(cid:102)W T AW\n\n(cid:17)\n(cid:17)\n\n(cid:32)(cid:88)\n(cid:32)(cid:88)\n\nl\n\n(cid:88)\n(cid:88)\n\nik\n\n\u2264 \u2212 2Tr\n\nW 2\nil\n\n+\n\n+ \u03bb\n\nl\n\nik\n\nW 2\nik\n\n\u2261G((cid:102)W , W ),\ndue to the inequality za \u2212 1\n(Minimization) Setting \u2202G((cid:102)W , \u039b)/\u2202(cid:102)Wik = 0 gives\n\n\u2264 zb \u2212 1\n\na\n\nik\n\nb\n\nfor z > 0 and a < b.\n\nwhere the \ufb01rst inequality is by standard convex-concave procedure, and the second upper bound is\n\n(cid:34)(cid:0)\u2207\u2212 + 2W\u039b+(cid:1)\n(cid:0)\u2207+ + 2W\u039b\u2212(cid:1)\n\nik\n\n(cid:35)1/4\n\nik\n\n.\n\n(6)\n\nW new\n\nik = Wik\n\nZeroing \u2202L(W, \u039b)/\u2202W gives 2W \u039b = \u2207+ \u2212 \u2207\u2212. Using W T W = I, we obtain \u039b = 1\n2 W T (\u2207+ \u2212\n\u2207\u2212), i.e. 2W \u039b+ = W W T\u2207+ and 2W \u039b\u2212 = W W T\u2207\u2212. Inserting these into Eq. (6), we obtain\nupdate rule in Eq. (5).\n\n\fReferences\n[1] http://users.ics.aalto.fi/rozyang/nmfr/index.shtml.\n[2] R. Arora, M. Gupta, A. Kapila, and M. Fazel. Clustering by left-stochastic matrix factorization. In ICML,\n\n2011.\n\n[3] D. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2001.\n\n[4] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S. Huang. Graph regularized non-negative matrix fac-\nIEEE Transactions on Pattern Analysis and Machine Intelligence,\n\ntorization for data representation.\n33(8):1548\u20131560, 2011.\n\n[5] A. Cichocki, S. Cruces, and S. Amari. Generalized alpha-beta divergences and their application to robust\n\nnonnegative matrix factorization. Entropy, 13:134\u2013170, 2011.\n\n[6] I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means, spectral clustering and normalized cuts. In KDD,\n\n2004.\n\n[7] C. Ding, X. He, and H. D. Simon. On the equivalence of nonnegative matrix factorization and spectral\n\nclustering. In ICDM, 2005.\n\n[8] C. Ding, T. Li, and M. I. Jordan. Nonnegative matrix factorization for combinatorial optimization: Spec-\n\ntral clustering, graph matching, and clique \ufb01nding. In ICDM, 2008.\n\n[9] C. Ding, T. Li, and M. I. Jordan. Convex and semi-nonnegative matrix factorizations. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 32(1):45\u201355, 2010.\n\n[10] C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and proba-\n\nbilistic laten semantic indexing. Computational Statistics and Data Analysis, 52(8):3913\u20133927, 2008.\n\n[11] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. In\n\nSIGKDD, 2006.\n\n[12] E. Elhamifar and R. Vidal. Sparse manifold clustering and embedding. In NIPS, 2011.\n[13] Z. He, S. Xie, R. Zdunek, G. Zhou, and A. Cichocki. Symmetric nonnegative matrix factorization: Algo-\nrithms and applications to probabilistic clustering. IEEE Transactions on Neural Networks, 22(12):2117\u2013\n2131, 2011.\n\n[14] M. Hein and T. B\u00a8uhler. An inverse power method for nonlinear eigenproblems with applications in 1-\n\nSpectral clustering and sparse PCA. In NIPS, 2010.\n\n[15] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000.\n[16] C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation,\n\n19:2756\u20132779, 2007.\n\n[17] M. Maier, U. von Luxburg, and M. Hein. How the result of graph clustering methods depends on the\n\nconstruction of the graph. ESAIM: Probability & Statistics, 2012. in press.\n\n[18] S. Mallat. Group invariant scattering. ArXiv e-prints, 2011.\n[19] T. Minka. Estimating a dirichlet distribution, 2000.\n[20] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, 2001.\n[21] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[22] J. Sinkkonen, J. Aukia, and S. Kaski. Component models for large networks. ArXiv e-prints, 2008.\n[23] Z. Yang and E. Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transaction\n\non Neural Networks, 21(5):734\u2013749, 2010.\n\n[24] Z. Yang and E. Oja. Uni\ufb01ed development of multiplicative algorithms for linear and quadratic nonnegative\n\nmatrix factorization. IEEE Transactions on Neural Networks, 22(12):1878\u20131891, 2011.\n\n[25] Z. Yang and E. Oja. Clustering by low-rank doubly stochastic matrix decomposition. In ICML, 2012.\n[26] Z. Yang and E. Oja. Quadratic nonnegative matrix factorization. Pattern Recognition, 45(4):1500\u20131510,\n\n2012.\n\n[27] R. Zass and A. Shashua. A unifying approach to hard and probabilistic clustering. In ICCV, 2005.\n[28] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 2004.\n[29] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global consistency.\n\nIn NIPS, 2003.\n\n9\n\n\f", "award": [], "sourceid": 524, "authors": [{"given_name": "Zhirong", "family_name": "Yang", "institution": null}, {"given_name": "Tele", "family_name": "Hao", "institution": null}, {"given_name": "Onur", "family_name": "Dikmen", "institution": null}, {"given_name": "Xi", "family_name": "Chen", "institution": null}, {"given_name": "Erkki", "family_name": "Oja", "institution": null}]}