{"title": "Low-rank matrix reconstruction and clustering via approximate message passing", "book": "Advances in Neural Information Processing Systems", "page_first": 917, "page_last": 925, "abstract": "We study the problem of reconstructing low-rank matrices from their noisy observations. We formulate the problem in the Bayesian framework, which allows us to exploit structural properties of matrices in addition to low-rankedness, such as sparsity. We propose an efficient approximate message passing algorithm, derived from the belief propagation algorithm, to perform the Bayesian inference for matrix reconstruction. We have also successfully applied the proposed algorithm to a clustering problem, by formulating the problem of clustering as a low-rank matrix reconstruction problem with an additional structural property. Numerical experiments show that the proposed algorithm outperforms Lloyd's K-means algorithm.", "full_text": "Low-rank matrix reconstruction and clustering via\n\napproximate message passing\n\nRyosuke Matsushita\n\nNTT DATA Mathematical Systems Inc.\n\n1F Shinanomachi Rengakan, 35,\n\nShinanomachi, Shinjuku-ku, Tokyo,\n\n160-0016, Japan\n\nToshiyuki Tanaka\n\nDepartment of Systems Science,\n\nGraduate School of Informatics, Kyoto University\n\nYoshida Hon-machi, Sakyo-ku, Kyoto-shi,\n\n606-8501 Japan\n\nmatsur8@gmail.com\n\ntt@i.kyoto-u.ac.jp\n\nAbstract\n\nWe study the problem of reconstructing low-rank matrices from their noisy ob-\nservations. We formulate the problem in the Bayesian framework, which allows\nus to exploit structural properties of matrices in addition to low-rankedness, such\nas sparsity. We propose an ef\ufb01cient approximate message passing algorithm, de-\nrived from the belief propagation algorithm, to perform the Bayesian inference for\nmatrix reconstruction. We have also successfully applied the proposed algorithm\nto a clustering problem, by reformulating it as a low-rank matrix reconstruction\nproblem with an additional structural property. Numerical experiments show that\nthe proposed algorithm outperforms Lloyd\u2019s K-means algorithm.\n\n1 Introduction\n\nLow-rankedness of matrices has frequently been exploited when one reconstructs a matrix from its\nnoisy observations. In such problems, there are often demands to incorporate additional structural\nproperties of matrices in addition to the low-rankedness. In this paper, we consider the case where\n0 , U0 2 Rm(cid:2)r, V0 2 RN(cid:2)r\na matrix A0 2 Rm(cid:2)N to be reconstructed is factored as A0 = U0V\n\u22a4\n(r \u226a m; N), and where one knows structural properties of the factors U0 and V0 a priori. Sparseness\nand non-negativity of the factors are popular examples of such structural properties [1, 2].\nSince the properties of the factors to be exploited vary according to the problem, it is desirable\nthat a reconstruction method has enough \ufb02exibility to incorporate a wide variety of properties. The\nBayesian approach achieves such \ufb02exibility by allowing us to select prior distributions of U0 and V0\nre\ufb02ecting a priori knowledge on the structural properties. The Bayesian approach, however, often\ninvolves computationally expensive processes such as high-dimensional integrations, thereby requir-\ning approximate inference methods in practical implementations. Monte Carlo sampling methods\nand variational Bayes methods have been proposed for low-rank matrix reconstruction to meet this\nrequirement [3\u20135].\nWe present in this paper an approximate message passing (AMP) based algorithm for Bayesian low-\nrank matrix reconstruction. Developed in the context of compressed sensing, the AMP algorithm re-\nconstructs sparse vectors from their linear measurements with low computational cost, and achieves\na certain theoretical limit [6]. AMP algorithms can also be used for approximating Bayesian in-\nference with a large class of prior distributions of signal vectors and noise distributions [7]. These\nsuccesses of AMP algorithms motivate the use of the same idea for low-rank matrix reconstruction.\nThe IterFac algorithm for the rank-one case [8] has been derived as an AMP algorithm. An AMP\nalgorithm for the general-rank case is proposed in [9], which, however, can only treat estimation of\nposterior means. We extend their algorithm so that one can deal with other estimations such as the\nmaximum a posteriori (MAP) estimation. It is the \ufb01rst contribution of this paper.\n\n1\n\n\fAs the second contribution, we apply the derived AMP algorithm to K-means type clustering to\nobtain a novel ef\ufb01cient clustering algorithm. It is based on the observation that our formulation\nof the low-rank matrix reconstruction problem includes the clustering problem as a special case.\nAlthough the idea of applying low-rank matrix reconstruction to clustering is not new [10, 11], our\nproposed algorithm is, to our knowledge, the \ufb01rst that directly deals with the constraint that each\ndatum should be assigned to exactly one cluster in the framework of low-rank matrix reconstruction.\nWe present results of numerical experiments, which show that the proposed algorithm outperforms\nLloyd\u2019s K-means algorithm [12] when data are high-dimensional.\nRecently, AMP algorithms for dictionary learning and blind calibration [13] and for matrix recon-\nstruction with a generalized observation model [14] were proposed. Although our work has some\nsimilarities to these studies, it differs in that we \ufb01x the rank r rather than the ratio r=m when taking\nthe limit m; N ! 1 in the derivation of the algorithm. Another difference is that our formulation,\nexplained in the next section, does not assume statistical independence among the components of\neach row of U0 and V0. A detailed comparison among these algorithms remains to be made.\n\n2 Problem setting\n\n\u22a4 2 Rm(cid:2)r and V0 := (v0;1; : : : ; v0;N )\n\n2.1 Low-rank matrix reconstruction\nWe consider the following problem setting. A matrix A0 2 Rm(cid:2)N to be estimated is de\ufb01ned\n\u22a4 2 RN(cid:2)r as\nby two matrices U0 := (u0;1; : : : ; u0;m)\n0 , where u0;i; v0;j 2 Rr. We consider the case where r \u226a m; N. Observations of A0\n\u22a4\nA0 := U0V\nare corrupted by additive noise W 2 Rm(cid:2)N , whose components Wi;j are i.i.d. Gaussian random\nvariables following N (0; m(cid:28) ). Here (cid:28) > 0 is a noise variance parameter and N (a; (cid:27)2) denotes the\nGaussian distribution with mean a and variance (cid:27)2. The factor m in the noise variance is introduced\nto allow a proper scaling in the limit where m and N go to in\ufb01nity in the same order, which is\nemployed in deriving the algorithm. An observed matrix A 2 Rm(cid:2)N is given by A := A0 + W .\nReconstructing A0 and (U0; V0) from A is the problem considered in this paper.\nWe take the Bayesian approach to address this problem, in which one requires prior distributions\nof variables to be estimated, as well as conditional distributions relating observations with variables\nto be estimated. These distributions need not be the true ones because in some cases they are not\navailable so that one has to assume them arbitrarily, and in some other cases one expects advantages\nby assuming them in some speci\ufb01c manner in view of computational ef\ufb01ciencies. In this paper, we\nsuppose that one uses the true conditional distribution\n\np(AjU0; V0) =\n\n1\n\nexp\n\n(2(cid:25)m(cid:28) ) mN\n\n(1)\nwhere \u2225 (cid:1) \u2225F denotes the Frobenius norm. Meanwhile, we suppose that the assumed prior distribu-\ntions of U0 and V0, denoted by ^pU and ^pV, respectively, may be different from the true distributions\npU and pV, respectively. We restrict ^pU and ^pV to distributions of the form ^pU(U0) =\ni ^pu(u0;i)\nj ^pv(v0;j), respectively, which allows us to construct computationally ef\ufb01cient\nand ^pV(V0) =\nalgorithms. When U (cid:24) ^pU(U ) and V (cid:24) ^pV(V ), the posterior distribution of (U; V ) given A is\n\n\u220f\n\n\u220f\n\nF\n\n;\n\n2\n\n(cid:0) 1\n2m(cid:28)\n\n\u2225A (cid:0) U0V\n\n\u22a4\n0\n\n\u22252\n\n(\n\n)\n\n)\n\n(\n^p(U; V jA) / exp\n\n(cid:0) 1\n2m(cid:28)\n\n\u2225A (cid:0) U V\n\n\u22a4\u22252\n\nF\n\n^pU(U )^pV(V ):\n\n(2)\n\nPrior probability density functions (p.d.f.s) ^pu and ^pv can be improper, that is, they can integrate to\nin\ufb01nity, as long as the posterior p.d.f. (2) is proper. We also consider cases where the assumed rank\n^r may be different from the true rank r. We thus suppose that estimates U and V are of size m (cid:2) ^r\nand N (cid:2) ^r, respectively.\nWe consider two problems appearing in the Bayesian approach. The \ufb01rst problem, which we call\nthe marginalization problem, is to calculate the marginal posterior distributions given A,\n\n\u222b\n\n\u220f\n\n\u220f\n\n^pi;j(ui; vjjA) :=\n\n^p(U; V jA)\n\n\u222b\n\nThese are used to calculate the posterior mean E[U V\n^pi;j(u; vjA)dv and vMMAP\nuMMAP\n\n:= arg maxu\n\ni\n\nl\u0338=j\n\nduk\n\ndvl:\n\n(3)\nk\u0338=i\n\u22a4jA] and the marginal MAP estimates\n^pi;j(u; vjA)du. Because\n:= arg maxv\n\n\u222b\n\nj\n\n2\n\n\fcalculation of ^pi;j(ui; vjjA) typically involves high-dimensional integrations requiring high com-\nputational cost, approximation methods are needed.\nThe second problem, which we call\narg maxU;V ^p(U; V jA). It is formulated as the following optimization problem:\n\nis to calculate the MAP estimate\n\nthe MAP problem,\n\nwhere CMAP(U; V ) is the negative logarithm of (2):\n\nCMAP(U; V ) :=\n\n\u2225A (cid:0) U V\n\n\u22a4\u22252\n\nF\n\n1\n\n2m(cid:28)\n\nCMAP(U; V );\n\nmin\nU;V\n\n(cid:0) m\u2211\n\nlog ^pu(ui) (cid:0) N\u2211\n\n(4)\n\n(5)\n\nlog ^pv(vj):\n\nBecause \u2225A (cid:0) U V\noptimal solutions of (4) and therefore approximation methods are needed in this problem as well.\n\nF is a non-convex function of (U; V ), it is generally hard to \ufb01nd the global\n\n\u22a4\u22252\n\ni=1\n\nj=1\n\n2.2 Clustering as low-rank matrix reconstruction\n\nA clustering problem can be formulated as a problem of low-rank matrix reconstruction [11]. Sup-\npose that v0;j 2 fe1; : : : ; erg; j = 1; : : : ; N, where el 2 f0; 1gr is the vector whose lth component\nis 1 and the others are 0. When V0 and U0 are \ufb01xed, aj follows one of the r Gaussian distributions\nN ( ~u0;l; m(cid:28) I); l = 1; : : : ; r, where ~u0;l is the lth column of U0. We regard that each Gaussian\ndistribution de\ufb01nes a cluster, ~u0;l being the center of cluster l and v0;j representing the cluster\nassignment of the datum aj. One can then perform clustering on the dataset fa1; : : : ; aNg by re-\nconstructing U0 and V0 from A = (a1; : : : ; aN ) under the structural constraint that every row of V0\nshould belong to fe1; : : : ; e^rg, where ^r is an assumed number of clusters.\n\u2211\nLet us consider maximum likelihood estimation arg maxU;V p(AjU; V ), or equivalently, MAP esti-\nl=1 (cid:14)(v(cid:0)el).\nmation with the (improper) uniform prior distributions ^pu(u) = 1 and ^pv(v) = ^r\nThe corresponding MAP problem is\n\u2211\n\u2211\nsubject to vj 2 fe1; : : : ; e^rg:\n\n(6)\n\u2225aj (cid:0)\nWhen V satis\ufb01es the constraints, the objective function \u2225A (cid:0) U V\n~ul\u22252\n2I(vj = el) is the sum of squared distances, each of which is between a datum and the center of\nthe cluster that the datum is assigned to. The optimization problem (6), its objective function, and\nclustering based on it are called in this paper the K-means problem, the K-means loss function, and\nthe K-means clustering, respectively.\nOne can also use the marginal MAP estimation for clustering. If U0 and V0 follow ^pU and ^pV, re-\nspectively, the marginal MAP estimation is optimal in the sense that it maximizes the expectation of\naccuracy with respect to ^p(V0jA). Here, accuracy is de\ufb01ned as the fraction of correctly assigned data\namong all data. We call the clustering using approximate marginal MAP estimation the maximum\naccuracy clustering, even when incorrect prior distributions are used.\n\nU2Rm(cid:2)^r;V 2f0;1gN(cid:2)^r\n\n\u2225A (cid:0) U V\n\n\u22a4\u22252\n\n\u22a4\u22252\n\nF\n\nF =\n\nN\nj=1\n\n^r\nl=1\n\n(cid:0)1\n\n^r\n\nmin\n\n3 Previous work\n\nU (U ) and pVB\nU (U )pVB\n\nExisting methods for approximately solving the marginalization problem and the MAP problem\nare divided into stochastic methods such as Markov-Chain Monte-Carlo methods and deterministic\nones. A popular deterministic method is to use the variational Bayesian formalism. The variational\nBayes matrix factorization [4, 5] approximates the posterior distribution p(U; V jA) as the product\nof two functions pVB\nV (V ), which are determined so that the Kullback-Leibler (KL)\nV (V ) to p(U; V jA) is minimized. Global minimization of the KL di-\ndivergence from pVB\nvergence is dif\ufb01cult except for some special cases [15], so that an iterative method to obtain a local\nminimum is usually adopted. Applying the variational Bayes matrix factorization to the MAP prob-\nlem, one obtains the iterated conditional modes (ICM) algorithm, which alternates minimization of\nCMAP(U; V ) over U for \ufb01xed V and minimization over V for \ufb01xed U.\nThe representative algorithm to solve the K-means problem approximately is Lloyd\u2019s K-means algo-\nrithm [12]. Lloyd\u2019s K-means algorithm is regarded as the ICM algorithm: It alternates minimization\nof the K-means loss function over U for \ufb01xed V and minimization over V for \ufb01xed U iteratively.\n\n3\n\n\fAlgorithm 1 (Lloyd\u2019s K-means algorithm).\n\nN\u2211\n\nnt\n\nl =\n\nI(vt\n\nj = el);\n\nj=1\n\nlt+1\nj = arg min\n\nl2f1;:::;^rg\n\n~ut\nl =\n\u22252\n\u2225aj (cid:0) ~ut\n2;\n\nl\n\nN\u2211\n\najI(vt\n\nj = el);\n\nj=1\nvt+1\nj = elt+1\n\nj\n\n:\n\n1\nnt\nl\n\n(7a)\n\n(7b)\n\nThroughout this paper, we represent an algorithm by a set of equations as in the above. This repre-\nsentation means that the algorithm begins with a set of initial values and repeats the update of the\nvariables using the equations presented until it satis\ufb01es some stopping criteria. Lloyd\u2019s K-means\nalgorithm begins with a set of initial assignments V 0 2 fe1; : : : ; e^rgN . This algorithm easily gets\nstuck in local minima and its performance heavily depends on the initial values of the algorithm.\nSome methods for initialization to obtain a better local minimum are proposed [16].\nMaximum accuracy clustering can be solved approximately by using the variational Bayes matrix\nfactorization, since it gives an approximation to the marginal posterior distribution of vj given A.\n\n4 Proposed algorithm\n\n4.1 Approximate message passing algorithm for low-rank matrix reconstruction\n\nWe \ufb01rst discuss the general idea of the AMP algorithm and advantages of the AMP algorithm com-\npared with the variational Bayes matrix factorization. The AMP algorithm is derived by approximat-\ning the belief propagation message passing algorithm in a way thought to be asymptotically exact for\nlarge-scale problems with appropriate randomness. Fixed points of the belief propagation message\npassing algorithm correspond to local minima of the KL divergence between a kind of trial function\nand the posterior distribution [17]. Therefore, the belief propagation message passing algorithm can\nbe regarded as an iterative algorithm based on an approximation of the posterior distribution, which\nis called the Bethe approximation. The Bethe approximation can re\ufb02ect dependence of random vari-\nables (dependence between U and V in ^p(U; V jA) in our problem) to some extent. Therefore, one\ncan intuitively expect that performance of the AMP algorithm is better than that of the variational\nBayes matrix factorization, which treats U and V as if they were independent in ^p(U; V jA).\nAn important property of the AMP algorithm, aside from its ef\ufb01ciency and effectiveness, is that\none can predict performance of the algorithm accurately for large-scale problems by using a set of\nequations, called the state evolution [6]. Analysis with the state evolution also shows that required\niteration numbers are O(1) even when the problem size is large. Although we can present the state\nevolution for the algorithm proposed in this paper and give a proof of its validity like [8, 18], we do\nnot discuss the state evolution here due to the limited space available.\nWe introduce a one-parameter extension of the posterior distribution ^p(U; V jA) to treat the marginal-\nization problem and the MAP problem in a uni\ufb01ed manner. It is de\ufb01ned as follows:\n\n^p(U; V jA; (cid:12)) / exp\n\n\u2225A (cid:0) U V\n\n\u22a4\u22252\n\nF\n\n2m(cid:28)\n\n^pU(U )^pV(V )\n\n(8)\nwhich is proportional to ^p(U; V jA)(cid:12), where (cid:12) > 0 is the parameter. When (cid:12) = 1, ^p(U; V jA; (cid:12))\nis reduced to ^p(U; V jA). In the limit (cid:12) ! 1, the distribution ^p(U; V jA; (cid:12)) concentrates on the\nmaxima of ^p(U; V jA). An algorithm for the marginalization problem on ^p(U; V jA; (cid:12)) is particu-\nlarized to the algorithms for the marginalization problem and for the MAP problem for the original\nposterior distribution ^p(U; V jA) by letting (cid:12) = 1 and (cid:12) ! 1, respectively. The AMP algorithm\nfor the marginalization problem on ^p(U; V jA; (cid:12)) is derived in a way similar to that described in [9],\nas detailed in the Supplementary Material.\nIn the derived algorithm, the values of variables Bt\nu;m)\nv =\n\u22a4 2 Rm(cid:2)^r,\n(bt\nv;1; : : : ; bt\n1; : : : ; ut\n2 R^r(cid:2)^r are calculated it-\nV t = (vt\neratively, where the superscript t 2 N [ f0g represents iteration numbers. Variables with a negative\niteration number are de\ufb01ned as 0. The algorithm is as follows:\n\n\u22a4 2 RN(cid:2)^r, (cid:3)t\n\u22a4 2 RN(cid:2)^r, St\nN )\n\n2 R^r(cid:2)^r, (cid:3)t\n1; : : : ; St\nm\n\n\u22a4 2 Rm(cid:2)^r, Bt\n\nu;1; : : : ; bt\n2 R^r(cid:2)^r, U t = (ut\n\nu = (bt\n2 R^r(cid:2)^r, and T t\n\n1; : : : ; T t\nN\n\nv\n\n1; : : : ; vt\n\nv;N )\n\nm)\n\nu\n\n((cid:0) (cid:12)\n\n)(\n\n)(cid:12)\n\n;\n\n4\n\n\fN\u2211\nm\u2211\n\nj=1\n\nN\u2211\nm\u2211\n\nj=1\n\nU t(cid:0)1\n\nT t\nj ; (cid:3)t\n\nu =\n\n\u22a4\n\n(V t)\n\nV t +\n\n1\nm(cid:28)\n\n1\n\n(cid:12)m(cid:28)\n\nT t\nj\n\n(cid:0) 1\nm(cid:28)\n\nT t\nj ;\n\n(9a)\n\nj=1\nSt\ni = G(bt\n\nu;i; (cid:3)t\n\nu; ^pu);\n\nV t\n\nSt\ni ; (cid:3)t\n\nv =\n\ni=1\nT t+1\nj = G(bt\n\nv;j; (cid:3)t\n\nv; ^pv):\n\n\u22a4\n\n(U t)\n\nU t +\n\n1\nm(cid:28)\n\n1\n\n(cid:12)m(cid:28)\n\nSt\ni\n\n(cid:0) 1\nm(cid:28)\n\ni=1\n\ni=1\n\nSt\ni ;\n\n(9b)\n\n(9c)\n\n(9d)\n\nAlgorithm 2.\n\nBt\n\nu =\n\n1\nm(cid:28)\n\nAV t (cid:0) 1\nm(cid:28)\n\nut\n\ni = f (bt\n\nu;i; (cid:3)t\n\nBt\n\nv =\n\n\u22a4\n\nA\n\n1\nm(cid:28)\n\nu; ^pu);\nU t (cid:0) 1\nm(cid:28)\n\nvt+1\nj = f (bt\n\nv;j; (cid:3)t\n\nv; ^pv);\n\nN\u2211\nm\u2211\n\n\u222b\n\nAlgorithm 2 is almost symmetric in U and V . Equations (9a)\u2013(9b) and (9c)\u2013(9d) update quantities\nrelated to the estimates of U0 and V0, respectively. The algorithm requires an initial value V 0 and\nj = O. The functions f ((cid:1);(cid:1); ^p) : R^r(cid:2)R^r(cid:2)^r ! R^r and G((cid:1);(cid:1); ^p) : R^r(cid:2)R^r(cid:2)^r ! R^r(cid:2)^r,\nbegins with T 0\nwhich have a p.d.f. ^p : R^r ! R as a parameter, are de\ufb01ned by\n\nf (b; (cid:3); ^p) :=\n\nu^q(u; b; (cid:3); ^p)du;\n\nG(b; (cid:3); ^p) :=\n\nwhere ^q(u; b; (cid:3); ^p) is the normalized p.d.f. of u de\ufb01ned by\n(cid:3)u (cid:0) b\n\n(\n^q(u; b; (cid:3); ^p) / exp\n\n(cid:0)(cid:12)\n\n(\n\n\u22a4\n\nu\n\n\u22a4\n\n@b\n\n))\nu (cid:0) log ^p(u)\n\n1\n2\n\n@f (b; (cid:3); ^p)\n\n;\n\n(10)\n\n:\n\n(11)\n\nOne can see that f (b; (cid:3); ^p) is the mean of the distribution ^q(u; b; (cid:3); ^p) and that G(b; (cid:3); ^p) is its\ncovariance matrix scaled by (cid:12). The function f (b; (cid:3); ^p) need not be differentiable everywhere;\nAlgorithm 2 works if f (b; (cid:3); ^p) is differentiable at b for which one needs to calculate G(b; (cid:3); ^p) in\nrunning the algorithm.\nWe assume in the rest of this section the convergence of Algorithm 2, although the convergence is\n1 be the converged values\nnot guaranteed in general. Let B\nof the respective variables. First, consider running Algorithm 2 with (cid:12) = 1. The marginal posterior\ndistribution is then approximated as\n\n1, and V\n\n1\nj , U\n\n1\nu , B\n\n1\nv , (cid:3)\n\n1\nu , (cid:3)\n\n1\nv , S\n\n1\ni\n\n, T\n\n1\nSince u\ni\nposterior mean E[U V\n\nand v\n\n1\nj\n\n\u222b\n\n^pi;j(ui; vjjA) (cid:25) ^q(ui; b\n1\n1\nu ; ^pu)^q(vj; b\nu;i; (cid:3)\n1\n1\nare the means of ^q(u; b\nu ; ^pu) and ^q(v; b\nu;i; (cid:3)\n\u22a4jA] =\n\n1\nv;j; (cid:3)\n1\n1\nv ; ^pv), respectively, the\nv;j; (cid:3)\n^p(U; V jA)dU dV is approximated as\nE[U V\n\n1\nv ; ^pv):\n\n(12)\n\nU V\n\n(13)\n\n(V\n\n1\n\n1\n\n\u22a4\n\n\u22a4\n\n)\n\n:\n\ni\n\ni\n\nj\n\nThe marginal MAP estimates uMMAP\n1\nu;i; (cid:3)\n\nuMMAP\n\n1\nu ; ^pu);\n\n^q(u; b\n\n(14)\nTaking the limit (cid:12) ! 1 in Algorithm 2 yields an algorithm for the MAP problem (4). In this case,\nthe functions f and G are replaced with\n(cid:3)u (cid:0) b\n\nu (cid:0) log ^p(u)\n\n; G1(b; (cid:3); ^p) :=\n\n@f1(b; (cid:3); ^p)\n\nvMMAP\nj\n\n^q(v; b\n\n]\n\n(cid:25) arg max\n[\n\n1\nv ; ^pv):\n\n1\nv;j; (cid:3)\n\nare approximated as\n(cid:25) arg max\n\n(15)\n\n\u22a4\n\n\u22a4\n\nu\n\nu\n\nv\n\n:\n\nf1(b; (cid:3); ^p) := arg min\nu\n\n1\n2\n\n@b\n\nOne may calculate G1(b; (cid:3); ^p) from the Hessian of log ^p(u) at u = f1(b; (cid:3); ^p), denoted by H,\nvia the identity G1(b; (cid:3); ^p) =\n: This identity follows from the implicit function theorem\nunder some additional assumptions and helps in the case where the explicit form of f1(b; (cid:3); ^p) is\nnot available. The MAP estimate is approximated by (U\n\n; V\n\n1\n\n1\n\n).\n\n(\n(cid:3)(cid:0)H\n\n)(cid:0)1\n\n\u22a4jA] (cid:25) U\nand vMMAP\n\n4.2 Properties of the algorithm\n\nAlgorithm 2 has several plausible properties. First, it has a low computational cost. The compu-\ntational cost per iteration is O(mN ), which is linear in the number of components of the matrix\nA. Calculation of f ((cid:1);(cid:1); ^p) and G((cid:1);(cid:1); ^p) is performed O(N + m) times per iteration. The constant\n\n5\n\n\ffactor depends on ^p and (cid:12). Calculation of f for (cid:12) < 1 generally involves an ^r-dimensional numer-\nical integration, although they are not needed in cases where an analytic expression of the integral\nis available and cases where the variables take only discrete values. Calculation of f1 involves\nminimization over an ^r-dimensional vector. When (cid:0) log ^p is a convex function and (cid:3) is positive\nsemide\ufb01nite, this minimization problem is convex and can be solved at relatively low cost.\nSecond, Algorithm 2 has a form similar to that of an algorithm based on the variational Bayesian\nmatrix factorization. In fact, if the last terms on the right-hand sides of the four equations in (9a)\nand (9c) are removed, the resulting algorithm is the same as an algorithm based on the variational\nBayesian matrix factorization proposed in [4] and, in particular, the same as the ICM algorithm when\n(cid:12) ! 1. (Note, however, that [4] only treats the case where the priors ^pu and ^pv are multivariate\nGaussian distributions.) Note that additional computational cost for these extra terms is O(m + N ),\nwhich is insigni\ufb01cant compared with the cost of the whole algorithm, which is O(mN ).\nThird, when one deals with the MAP problem, the value of C MAP(U; V ) may increase in itera-\ntions of Algorithm 2. The following proposition, however, guarantees optimality of the output of\nAlgorithm 2 in a certain sense, if it has converged.\n1\n1\nProposition 1. Let (U\n1 ; : : : ; T\n1 ; : : : ; S\n1\nfor the MAP problem and suppose that\ni and\n1 is a global minimum of CMAP(U; V\nU\n\n1\nN ) be a \ufb01xed point of the AMP algorithm\n1\nj are positive semide\ufb01nite. Then\n\n1 is a global minimum of CMAP(U\n\n1\nm ; T\nm\ni=1 S\n) and V\n\n\u2211\n\n\u2211\n\nN\nj=1 T\n\n; V ).\n\n; S\n\n; V\n\n1\n\n1\n\n1\n\n1\n\nThe proof is in the Supplementary Material. The key to the proof is the following reformulation:\n\n( 1\n\nN\u2211\n\n)\n\n2m(cid:28)\n\nj=1\n\n)]\n\n(U (cid:0) U t(cid:0)1)\n\n(U (cid:0) U t(cid:0)1)\n\n\u22a4\n\nT t\nj\n\n(16)\n\nU t = arg min\n\nU\n\n\u2211\n\n[\n(\nC MAP(U; V t) (cid:0) tr\n\u2211\n\u2211\n\nN\nj=1 T t\n\nj is positive semide\ufb01nite, the second term of the minimand is the negative squared pseudo-\nIf\nmetric between U and U t(cid:0)1, which is interpreted as a penalty on nearness to the temporal estimate.\nPositive semide\ufb01niteness of\nj holds in almost all cases. In fact, we only have\nto assume lim(cid:12)!1 G(b; (cid:3); ^p) = G1(b; (cid:3); ^p), since G(b; (cid:3); ^p) is a scaled covariance matrix of\n^q(u; b; (cid:3); ^p), which is positive semide\ufb01nite. It follows from Proposition 1 that any \ufb01xed point of the\nAMP algorithm is also a \ufb01xed point of the ICM algorithm. It has two implications: (i) Execution\nof the ICM algorithm initialized with the converged values of the AMP algorithm does not improve\nCMAP(U t; V t). (ii) The AMP algorithm has not more \ufb01xed points than the ICM algorithm. The\nsecond implication may help the AMP algorithm avoid getting stuck in bad local minima.\n\nN\nj=1 T t\n\nm\ni=1 St\n\ni and\n\n4.3 Clustering via AMP algorithm\n\n\u2211\n\nOne can use the AMP algorithm for the MAP problem to perform the K-means clustering by letting\nl=1 (cid:14)(v (cid:0) el): Noting that f1(b; (cid:3); ^pv) is piecewise constant with\n^pu(u) = 1 and ^pv(v) = ^r\nrespect to b and hence G1(b; (cid:3); ^pv) is O almost everywhere, we obtain the following algorithm:\nAlgorithm 3 (AMP algorithm for the K-means clustering).\n\n(cid:0)1\n\n^r\n\nBt\n\nu =\n\nBt\n\nv =\n\n1\nm(cid:28)\n1\nm(cid:28)\n\nu =\n\n\u22a4\n\n(V t)\n\n1\nm(cid:28)\n\nV tSt; (cid:3)t\n\nAV t; (cid:3)t\nU t (cid:0) 1\n(cid:28)\n\n\u22a4\n\nA\n\n[\n\nvt+1\nj = arg\n\nmin\n\nv2fe1;:::;e^rg\n\n\u22a4\n\nv\n\n1\n2\n\nu((cid:3)t\nu)\nU t (cid:0) 1\n(cid:28)\n\nSt;\n\nV t; U t = Bt\n\n\u22a4\n\n(U t)\n\n]\n\n:\n\n1\nm(cid:28)\n\u22a4\nbt\nv;j\n\n(cid:0)1;\n(cid:0)1; St = ((cid:3)t\nu)\n\n(17a)\n\n(17b)\n\n(17c)\n\n(18a)\n\n(18b)\n\nN\u2211\n\nj=1\n\nIt is initialized with an assignment V 0 2 fe1; : : : ; e^rgN . Algorithm 3 is rewritten as follows:\n\nnt\n\nl =\n\nI(vt\n\nj = el);\n\nlt+1\nj = arg min\n\nl2f1;:::;^rg\n\n1\nm(cid:28)\n\n[\n\n~ut\n\nl =\n\n1\nnt\nl\n\u2225aj (cid:0) ~ut\n\nl\n\najI(vt\n\nj = el);\nj = el) (cid:0) m\nnt\nl\n\nI(vt\n\n]\n\n;\n\nvt+1\nj = elt+1\n\nj\n\n:\n\nv =\nvv (cid:0) v\n(cid:3)t\nN\u2211\n\nj=1\n\n\u22252\n2 +\n\n2m\nnt\nl\n\n6\n\n\f\u2211\n\n\u2211\n\n(cid:0)1\n\n(cid:0)2\n\nijSt\n\n\u22a4\u22252\n\n(cid:0)2N\n\nm\ni=1 St\n\nm\ni=1 A2\n\ni was estimated by (cid:28) m\n\nIn practice, we propose using m\n\nThe parameter (cid:28) appearing in the algorithm does not exist in the K-means clustering problem. In\nfact, (cid:28) appears because m\ni in deriving Algorithm 2,\n(cid:0)1\u2225A (cid:0)\nwhich can be justi\ufb01ed for large-sized problems.\nF as a temporary estimate of (cid:28) at tth iteration. While the AMP algorithm for the K-\nU t(V t)\nmeans clustering updates the value of U in the same way as Lloyd\u2019s K-means algorithm, it performs\nassignments of data to clusters in a different way. In the AMP algorithm, in addition to distances\nfrom data to centers of clusters, the assignment at present is taken into consideration in two ways:\n(i) A datum is less likely to be assigned to the cluster that it is assigned to at present. (ii) Data are\nmore likely to be assigned to a cluster whose size at present is smaller. The former can intuitively be\nj = el, one should take account of the fact that the cluster center\nunderstood by observing that if vt\n(cid:0)1I(vt\nl is biased toward aj. The term 2m(nt\nj = el) in (18b) corrects this bias, which, as it\n~ut\nl)\nshould be, is inversely proportional to the cluster size.\nThe AMP algorithm for maximum accuracy clustering is obtained by letting (cid:12) = 1 and ^pv(v) be\na discrete distribution on fe1; : : : ; e^rg. After the algorithm converges, arg maxv ^q(v; v\n1\nv ; ^pv)\n1 gives the estimate of the cluster centers.\ngives the \ufb01nal cluster assignment of the jth datum and U\n\n1\nj ; (cid:3)\n\n5 Numerical experiments\n\nWe conducted numerical experiments on both arti\ufb01cial and real data sets to evaluate performance\nof the proposed algorithms for clustering. In the experiment on arti\ufb01cial data sets, we set m = 800\nand N = 1600 and let ^r = r. Cluster centers ~u0;l; l = 1; : : : ; r; were generated according to the\nmultivariate Gaussian distribution N (0; I). Cluster assignments v0;j; j = 1; : : : ; N; were generated\naccording to the uniform distribution on fe1; : : : ; erg. For \ufb01xed (cid:28) = 0:1 and r, we generated 500\nproblem instances and solved them with \ufb01ve algorithms: Lloyd\u2019s K-means algorithm (K-means),\nthe AMP algorithm for the K-means clustering (AMP-KM), the variational Bayes matrix factoriza-\ntion [4] for maximum accuracy clustering (VBMF-MA), the AMP algorithm for maximum accuracy\nclustering (AMP-MA), and the K-means++ [16]. The K-means++ updates the variables in the same\nway as Lloyd\u2019s K-means algorithm with an initial value chosen in a sophisticated manner. For the\nother algorithms, initial values v0\nj ; j = 1; : : : ; N; were randomly generated from the same distribu-\ntion as v0;j. We used the true prior distributions of U and V for maximum accuracy clustering.\nWe ran Lloyd\u2019s K-means algorithm and the K-means++ until no change was observed. We ran the\nAMP algorithm for the K-means clustering until either V t = V t(cid:0)1 or V t = V t(cid:0)2 is satis\ufb01ed.\nThis is because we observed oscillations of assignments of a small number of data. For the other\nF and \u2225V t (cid:0)\ntwo algorithms, we terminated the iteration when \u2225U t (cid:0) U t(cid:0)1\u22252\nF < 10\nV t(cid:0)1\u22252\nF were met or the number of iterations exceeded 3000. We then evaluated\n\u2211\n(cid:3)\nthe following performance measures for the obtained solution (U\n\n(cid:15) Normalized K-means loss \u2225A(cid:0)U\n(V\n(cid:15) Accuracy maxP N\n(cid:3)\nj = v0;j), where the maximization is taken over all\nr-by-r permutation matrices. We used the Hungarian algorithm [19] to solve this maxi-\nmization problem ef\ufb01ciently.\n\n):\n\u2225aj(cid:0) (cid:22)a\u22252\n\n2), where (cid:22)a := 1\nN\n\n(cid:0)15\u2225V t(cid:0)1\u22252\n\n(cid:0)15\u2225U t(cid:0)1\u22252\n\nN\nj=1 I(P v\n\nj=1 aj.\n\n\u2211\n\n\u2211\n\nF < 10\n\n\u22a4\u22252\n\nN\nj=1\n\nF =(\n\n(cid:3)\n\n; V\n\nN\n\n(cid:3)\n\n(cid:3)\n\n)\n\n(cid:0)1\n\n(cid:15) Number of iterations needed to converge.\n\nWe calculated the averages and the standard deviations of these performance measures over 500\ninstances. We conducted the above experiments for various values of r.\nFigure 1 shows the results. The AMP algorithm for the K-means clustering achieves the smallest K-\nmeans loss among the \ufb01ve algorithms, while the Lloyd\u2019s K-means algorithm and K-means++ show\nlarge K-means losses for r (cid:21) 5. We emphasize that all the three algorithms are aimed to minimize\nthe same K-means loss and the differences lie in the algorithms for minimization. The AMP algo-\nrithm for maximum accuracy clustering achieves the highest accuracy among the \ufb01ve algorithms. It\nalso shows fast convergence. In particular, the convergence speed of the AMP algorithm for max-\nimum accuracy clustering is comparable to that of the AMP algorithm for the K-means clustering\nwhen the two algorithms show similar accuracy (r < 9). This is in contrast to the common observa-\ntion that the variational Bayes method often shows slower convergence than the ICM algorithm.\n\n7\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1: (a)\u2013(c) Performance for different r: (a) Normalized K-means loss. (b) Accuracy. (c)\nNumber of iterations needed to converge.\n(d) Dynamics for r = 5. Average accuracy at each\niteration is shown. Error bars represent standard deviations.\n\n(a)\n\n(b)\n\nFigure 2: Performance measures in real-data experiments. (a) Normalized K-means loss. (b) Accu-\nracy. The results for the 50 trials are shown in the descending order of performance for AMP-KM.\nThe worst two results for AMP-KM are out of the range.\n\nIn the experiment on real data, we used the ORL Database of Faces [20], which contains 400 images\nof human faces, ten different images of each of 40 distinct subjects. Each image consists of 112 (cid:2)\n92 = 10304 pixels whose value ranges from 0 to 255. We divided N = 400 images into ^r = 40\nclusters with the K-means++ and the AMP algorithm for the K-means clustering. We adopted the\ninitialization method of the K-means++ also for the AMP algorithm, because random initialization\noften yielded empty clusters and almost all data were assigned to only one cluster. The parameter (cid:28)\nwas estimated in the way proposed in Subsection 4.3. We ran 50 trials with different initial values,\nand Figure 2 summarizes the results.\nThe AMP algorithm for the K-means clustering outperformed the standard K-means++ algorithm\nin 48 out of the 50 trials in terms of the K-means loss and in 47 trials in terms of the accuracy.\nThe AMP algorithm yielded just one cluster with all data assigned to it in two trials. The attained\nminimum value of K-means loss is 0.412 with the K-means++ and 0.400 with the AMP algorithm.\nThe accuracies at these trials are 0.635 with the K-means++ and 0.690 with the AMP algorithm. The\naverage number of iterations was 6.6 with the K-means++ and 8.8 with the AMP algorithm. These\nresults demonstrate ef\ufb01ciency of the proposed algorithm on real data.\n\n8\n\n 0.97 0.975 0.98 0.985 0.99 0.995 1 2 4 6 8 10 12 14 16 18rNormalized K-means lossK-meansAMP-KMVBMF-MAAMP-MAK-means++ 0 0.2 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18rAccuracyK-meansAMP-KMVBMF-MAAMP-MAK-means++ 0 500 1000 1500 2000 2500 2 4 6 8 10 12 14 16 18rNumber of iterationsK-meansAMP-KMVBMF-MAAMP-MAK-means++ 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50Iteration numberAccuracyAMP-KMVBMF-MAAMP-MA 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0 10 20 30 40 50Number of trialsNormalized K-means lossK-means++AMP-KM 0.5 0.55 0.6 0.65 0.7 0.75 0 10 20 30 40 50Number of trialsAccuracyK-means++AMP-KM\fReferences\n[1] P. Paatero, \u201cLeast squares formulation of robust non-negative factor analysis,\u201d Chemometrics and Intelli-\n\ngent Laboratory Systems, vol. 37, no. 1, pp. 23\u201335, May 1997.\n\n[2] P. O. Hoyer, \u201cNon-negative matrix factorization with sparseness constraints,\u201d The Journal of Machine\n\nLearning Research, vol. 5, pp. 1457\u20131469, Dec. 2004.\n\n[3] R. Salakhutdinov and A. Mnih, \u201cBayesian probabilistic matrix factorization using Markov chain Monte\nCarlo,\u201d in Proceedings of the 25th International Conference on Machine Learning, New York, NY, Jul. 5\u2013\nAug. 9, 2008, pp. 880\u2013887.\n\n[4] Y. J. Lim and Y. W. Teh, \u201cVariational Bayesian approach to movie rating prediction,\u201d in Proceedings of\n\nKDD Cup and Workshop, San Jose, CA, Aug. 12, 2007.\n\n[5] T. Raiko, A. Ilin, and J. Karhunen, \u201cPrincipal component analysis for large scale problems with lots\nof missing values,\u201d in Machine Learning: ECML 2007, ser. Lecture Notes in Computer Science, J. N.\nKok, J. Koronacki, R. L. de Mantaras, S. Matwin, D. Mladeni\u02c7c, and A. Skowron, Eds. Springer Berlin\nHeidelberg, 2007, vol. 4701, pp. 691\u2013698.\n\n[6] D. L. Donoho, A. Maleki, and A. Montanari, \u201cMessage-passing algorithms for compressed sensing,\u201d\nProceedings of the National Academy of Sciences USA, vol. 106, no. 45, pp. 18 914\u201318 919, Nov. 2009.\n[7] S. Rangan, \u201cGeneralized approximate message passing for estimation with random linear mixing,\u201d in Pro-\nceedings of 2011 IEEE International Symposium on Information Theory, St. Petersburg, Russia, Jul. 31\u2013\nAug. 5, 2011, pp. 2168\u20132172.\n\n[8] S. Rangan and A. K. Fletcher, \u201cIterative estimation of constrained rank-one matrices in noise,\u201d in Pro-\nceedings of 2012 IEEE International Symposium on Information Theory, Cambridge, MA, Jul. 1\u20136, 2012,\npp. 1246\u20131250.\n\n[9] R. Matsushita and T. Tanaka, \u201cApproximate message passing algorithm for low-rank matrix reconstruc-\ntion,\u201d in Proceedings of the 35th Symposium on Information Theory and its Applications, Oita, Japan,\nDec. 11\u201314, 2012, pp. 314\u2013319.\n\n[10] W. Xu, X. Liu, and Y. Gong, \u201cDocument clustering based on non-negative matrix factorization,\u201d in Pro-\nceedings of the 26th annual international ACM SIGIR conference on Research and development in infor-\nmaion retrieval, Toronto, Canada, Jul. 28\u2013Aug. 1, 2003, pp. 267\u2013273.\n\n[11] C. Ding, T. Li, and M. Jordan, \u201cConvex and semi-nonnegative matrix factorizations,\u201d IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 45\u201355, Jan. 2010.\n\n[12] S. P. Lloyd, \u201cLeast squares quantization in PCM,\u201d IEEE Transactions on Information Theory, vol. IT-28,\n\nno. 2, pp. 129\u2013137, Mar. 1982.\n\n[13] F. Krzakala, M. M\u00b4ezard, and L. Zdeborov\u00b4a, \u201cPhase diagram and approximate message passing for blind\n\ncalibration and dictionary learning,\u201d preprint, Jan. 2013, arXiv:1301.5898v1 [cs.IT].\n\n[14] J. T. Parker, P. Schniter, and V. Cevher, \u201cBilinear generalized approximate message passing,\u201d preprint,\n\nOct. 2013, arXiv:1310.2632v1 [cs.IT].\n\n[15] S. Nakajima and M. Sugiyama, \u201cTheoretical analysis of Bayesian matrix factorization,\u201d Journal of Ma-\n\nchine Learning Research, vol. 12, pp. 2583\u20132648, Sep. 2011.\n\n[16] D. Arthur and S. Vassilvitskii, \u201ck-means++: the advantages of careful seeding,\u201d in SODA \u201907 Proceedings\nof the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana, Jan. 7\u20139,\n2007, pp. 1027\u20131035.\n\n[17] J. S. Yedidia, W. T. Freeman, and Y. Weiss, \u201cConstructing free-energy approximations and generalized\nbelief propagation algorithms,\u201d IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2282\u20132312,\nJul. 2005.\n\n[18] M. Bayati and A. Montanari, \u201cThe dynamics of message passing on dense graphs, with applications to\ncompressed sensing,\u201d IEEE Transactions on Information Theory, vol. 57, no. 2, pp. 764\u2013785, Feb. 2011.\n[19] H. W. Kuhn, \u201cThe Hungarian method for the assignment problem,\u201d Naval Research Logistics Quarterly,\n\nvol. 2, no. 1\u20132, pp. 83\u201397, Mar. 1955.\n\n[20] F. S. Samaria and A. C. Harter, \u201cParameterisation of a stochastic model for human face identi\ufb01cation,\u201d in\nProceedings of 2nd IEEE Workshop on Applications of Computer Vision, Sarasota FL, Dec. 1994, pp.\n138\u2013142. [Online]. Available: http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html\n\n9\n\n\f", "award": [], "sourceid": 501, "authors": [{"given_name": "Ryosuke", "family_name": "Matsushita", "institution": "NTT DATA Mathematical Systems Inc."}, {"given_name": "Toshiyuki", "family_name": "Tanaka", "institution": "Kyoto University"}]}