{"title": "RTRMC: A Riemannian trust-region method for low-rank matrix completion", "book": "Advances in Neural Information Processing Systems", "page_first": 406, "page_last": 414, "abstract": "We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances.", "full_text": "RTRMC: A Riemannian trust-region method for\n\nlow-rank matrix completion\n\nNicolas Boumal\u2217\nICTEAM Institute\n\nUniversit\u00b4e catholique de Louvain\n\nB-1348 Louvain-la-Neuve\n\nnicolas.boumal@uclouvain.be\n\nP.-A. Absil\n\nICTEAM Institute\n\nUniversit\u00b4e catholique de Louvain\n\nB-1348 Louvain-la-Neuve\n\nabsil@inma.ucl.ac.be\n\nAbstract\n\nWe consider large matrices of low rank. We address the problem of recovering such matrices\nwhen most of the entries are unknown. Matrix completion \ufb01nds applications in recommender\nsystems. In this setting, the rows of the matrix may correspond to items and the columns may\ncorrespond to users. The known entries are the ratings given by users to some items. The\naim is to predict the unobserved ratings. This problem is commonly stated in a constrained\noptimization framework. We follow an approach that exploits the geometry of the low-rank\nconstraint to recast the problem as an unconstrained optimization problem on the Grassmann\nmanifold. We then apply \ufb01rst- and second-order Riemannian trust-region methods to solve it.\nThe cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1\nand 2, outperform state-of-the-art algorithms on a wide range of problem instances.\n\n1\n\nIntroduction\n\nWe address the problem of recovering a low-rank m-by-n matrix X of which a few entries are observed, possibly\nwith noise. Throughout, we assume that r = rank(X) (cid:28) m \u2264 n and note \u2126 \u2282 {1 . . . m}\u00d7{1 . . . n} the set of\nindices of the observed entries of X, i.e., Xij is known iff (i, j) \u2208 \u2126. Solving this problem is namely useful in\nrecommender systems, where one tries to predict the ratings users would give to items they have not purchased.\n\n1.1 Related work\n\nIn the noiseless case, one could state the minimum rank matrix recovery problem as follows:\n\nrank \u02c6X, such that \u02c6Xij = Xij \u2200(i, j) \u2208 \u2126.\n\nmin\n\n\u02c6X\u2208Rm\u00d7n\n\n(1)\n\nThis problem, however, is NP hard [CR09]. A possible convex relaxation of (1) introduced by Cand`es and\nRecht [CR09] is to use the nuclear norm of \u02c6X as objective function, i.e., the sum of its singular values,\nnoted (cid:107) \u02c6X(cid:107)\u2217. The SVT method [CCS08] attempts to solve such a convex problem using tools from compressed\nsensing and the ADMiRA method [LB10] does so using matching pursuit-like techniques.\nAlternatively, one may minimize the discrepancy between \u02c6X and X at entries \u2126 under the constraint that\nrank( \u02c6X) \u2264 r for some small constant r. Since any matrix \u02c6X of rank at most r may be written in the form U W\nwith U \u2208 Rm\u00d7r and W \u2208 Rr\u00d7n, a reasonable formulation of the problem reads:\n\n(cid:88)\n\n(cid:0)(U W )ij \u2212 Xij\n\n(cid:1)2\n\nmin\n\nU\u2208Rm\u00d7r\n\nmin\n\nW\u2208Rr\u00d7n\n\n(i,j)\u2208\u2126\n\n.\n\n(2)\n\n\u2217Web: http://perso.uclouvain.be/nicolas.boumal/\n\n1\n\n\fThe LMaFit method [WYZ10] does a good job at solving this problem by alternatively \ufb01xing either of the\nvariables and solving the resulting least-squares problem ef\ufb01ciently.\nOne drawback of the latter formulation is that the factorization of a matrix \u02c6X into the product U W is not unique.\nIndeed, for any r-by-r invertible matrix M, we have U W = (U M )(M\u22121W ). All the matrices U M share the\nsame column space. Hence, the optimal value of the inner optimization problem in (2) is a function of col(U )\u2014\nthe column space of U\u2014rather than U speci\ufb01cally. Dai et al. [DMK11, DKM10] exploit this to recast (2) on\nthe Grassmann manifold G(m, r), i.e., the set of r-dimensional vector subspaces of Rm (see Section 2):\n\n(cid:88)\n\n(cid:0)(U W )ij \u2212 Xij\n\n(cid:1)2\n\nmin\n\nU \u2208G(m,r)\n\nmin\n\nW\u2208Rr\u00d7n\n\n(i,j)\u2208\u2126\n\n,\n\n(3)\n\nwhere U \u2208 Rm\u00d7r is any matrix such that col(U ) = U and is often chosen to be orthonormal. Unfortunately,\nthe objective function of the outer minimization in (3) may be discontinuous at points U for which the least-\nsquares problem in W does not have a unique solution. Dai et al. proposed ingenious ways to deal with the\ndiscontinuity. Their focus, though, was on deriving theoretical performance guarantees rather than developing\nfast algorithms.\nKeshavan et al. [KO09, KM10] state the problem on the Grassmannian too, but propose to simultaneously\noptimize on the row and column spaces, yielding a smaller least-squares problem which is unlikely to not have\na unique solution, resulting in a smooth objective function. In one of their recent papers [KM10], they solve:\n\n(cid:88)\n\n(i,j)\u2208\u2126\n\n(cid:0)(U SV (cid:62))ij \u2212 Xij\n\n(cid:1)2\n\n+ \u03bb2(cid:13)(cid:13)U SV (cid:62)(cid:13)(cid:13)2\n\nF ,\n\nmin\n\nU \u2208G(m,r),V \u2208G(n,r)\n\nmin\nS\u2208Rr\u00d7r\n\n(4)\n\nwhere U and V are any orthonormal bases of U and V , respectively, and \u03bb is a regularization parameter. The\nauthors propose an ef\ufb01cient SVD-based initial guess for U and V which they re\ufb01ne using a steepest descent\nmethod, along with strong theoretical guarantees.\nMeyer et al. [MBS11] proposed a Riemannian approach to linear regression on \ufb01xed-rank matrices. Their\nregression framework encompasses matrix completion problems. Likewise, Balzano et al. [BNR10] intro-\nduced GROUSE for subspace identi\ufb01cation on the Grassmannian, applicable to matrix completion. Finally,\nin the preprint [Van11] which became public while we were preparing the camera-ready version of this paper,\nVandereycken proposes an approach based on the submanifold geometry of the sets of \ufb01xed-rank matrices.\n\n1.2 Our contribution and outline of the paper\n\nization term is ef\ufb01ciently computable since(cid:13)(cid:13)U SV (cid:62)(cid:13)(cid:13)F = (cid:107)S(cid:107)F, but it penalizes all entries instead of just the\n\nDai et al.\u2019s initial formulation (3) has a discontinuous objective function on the Grassmannian. The OptSpace\nformulation (4) on the other hand has a continuous objective and comes with a smart initial guess, but optimizes\non a higher-dimensional search space, while it is arguably preferable to keep the dimension of the manifold\nsearch space low, even at the expense of a larger least-squares problem. Furthermore, the OptSpace regular-\nentries (i, j) /\u2208 \u2126.\nIn an effort to combine the best of both worlds, we equip (3) with a regularization term weighted by \u03bb > 0,\nwhich yields a smooth objective function de\ufb01ned over an appropriate search space:\n\nmin\n\nU \u2208G(m,r)\n\nmin\n\nW\u2208Rr\u00d7n\n\n1\n2\n\nC 2\nij\n\n(i,j)\u2208\u2126\n\n(cid:88)\n\n(cid:0)(U W )ij \u2212 Xij\n\n(cid:1)2\n\n(cid:88)\n\n(i,j) /\u2208\u2126\n\n+\n\n\u03bb2\n2\n\n(U W )2\nij.\n\n(5)\n\nHere, we introduced a con\ufb01dence index Cij > 0 for each observation Xij, which may be useful in applications.\nAs we will see, introducing a regularization term is essential to ensure smoothness of the objective and hence\nobtain good convergence properties. It may not be critical for practical problem instances though.\nWe further innovate on previous works by using a Riemannian trust-region method, GenRTR [ABG07], as\noptimization algorithm to minimize (5) on the Grassmannian. GenRTR is readily available as a free Matlab\npackage and comes with strong convergence results that are naturally inherited by our algorithms.\nIn Section 2, we rapidly cover the essential useful tools on the Grassmann manifold. In Section 3, we derive\nexpressions for the gradient and the Hessian of our objective function while paying special attention to com-\nplexity. Section 4 sums up the main properties of the Riemannian trust-region method. Section 5 shows a few\nresults of numerical experiments demonstrating the effectiveness of our approach.\n\n2\n\n\f2 Geometry of the Grassmann manifold\nOur objective function f (10) is de\ufb01ned over the Grassmann manifold G(m, r), i.e., the set of r-dimensional\nvector subspaces of Rm. Absil et al. [AMS08] give a computation-oriented description of the geometry of this\nmanifold. Here, we only give a summary of the important tools we use.\nEach point U \u2208 G(m, r) is a vector subspace we may represent numerically as the column space of a full-\nrank matrix U \u2208 Rm\u00d7r: U = col(U ). For numerical reasons, we will only use orthonormal matrices U \u2208\nU(m, r) = {U \u2208 Rm\u00d7r : U(cid:62)U = Ir}. The set U(m, r) is the Stiefel manifold.\nThe Grassmannian is a Riemannian manifold, and as such we can de\ufb01ne a tangent space to G(m, r) at each\npoint U , noted TU G(m, r). The latter is a vector space of dimension dimG(m, r) = r(m \u2212 r). A tangent\nvector H \u2208 TU G(m, r), where we represent U as the orthonormal matrix U, is represented by a unique\nmatrix H \u2208 Rm\u00d7r verifying U(cid:62)H = 0 and d\nslight abuse of notation we often commit hereafter, write H = H\u2014assuming U is known from the context\u2014\nand TUG(m, r) = {H \u2208 Rm\u00d7r : U(cid:62)H = 0}. Each tangent space is endowed with an inner product, the\nRiemannian metric, that varies smoothly from point to point. It is inherited from the embedding space of the\nmatrix representation of tangent vectors Rm\u00d7r: \u2200H1, H2 \u2208 TUG(m, r) : (cid:104)H1, H2(cid:105)U = Trace(H(cid:62)\n2 H1). The\northogonal projector from Rm\u00d7r onto the tangent space TUG(m, r) is given by:\n\ndt col(U + tH)(cid:12)(cid:12)t=0 = H . For practical purposes we may, with a\n\nPU : Rm\u00d7r \u2192 TUG(m, r) : H (cid:55)\u2192 PU H = (I \u2212 U U(cid:62))H.\n\nOne can also project a vector onto the tangent space of the Stiefel manifold:\n\nU : Rm\u00d7r \u2192 TUU(m, r) : H (cid:55)\u2192 P St\nP St\n\nU H = (I \u2212 U U(cid:62))H + U skew(U(cid:62)H),\n\nwhere skew(X) = (X \u2212 X(cid:62))/2 extracts the skew-symmetric part of X. This is useful for the computation of\ngradf (U ) \u2208 TUG(m, r). Indeed, according to [AMS08, eqs. (3.37) and (3.39)], considering \u00aff : Rm\u00d7r \u2192 R,\n\nits restriction \u00aff(cid:12)(cid:12)U (m,r) to the Stiefel manifold and f : G(m, r) \u2192 R such that f (col(U )) = \u00aff(cid:12)(cid:12)U (m,r) (U ) is\n\nwell-de\ufb01ned, as will be the case in Section 3, we have (with a slight abuse of notation):\n\ngradf (U ) = grad \u00aff(cid:12)(cid:12)U (m,r) (U ) = P St\n\nSimilarly, since PU \u25e6 P St\n\n(6)\nU = PU , the Hessian of f at U along H is given by [AMS08, eqs. (5.14) and (5.18)]:\n(7)\nwhere Dg(X)[H] is the directional derivative of g at X along H, in the classical sense. For our optimization\nalgorithms, it is important to be able to move along the manifold from some initial point U in some prescribed\ndirection speci\ufb01ed by a tangent vector H. To this end, we use the retraction:\n\nHessf (U )[H] = PU (D(U (cid:55)\u2192 P St\n\nU grad \u00aff (U ))(U )[H]),\n\nU grad \u00aff (U ).\n\nwhere qf(X) \u2208 U(m, r) designates the m-by-r Q-factor of the QR decomposition of X \u2208 Rm\u00d7r.\n\nRU (H) = qf(U + H),\n\n(8)\n\n3 Computation of the objective function and its derivatives\n\nWe seek an m-by-n matrix \u02c6X of rank not more than r such that \u02c6X is as close as possible to a given matrix X\nat the entries in the observation set \u2126. Furthermore, we are given a weight matrix C \u2208 Rm\u00d7n indicating the\ncon\ufb01dence we have in each observed entry of X. The matrix C is positive at entries in \u2126 and zero elsewhere.\nTo this end, we consider the following function, where (X\u2126)ij equals Xij if (i, j) \u2208 \u2126 and is zero otherwise:\n\n(cid:44)(cid:80)\n\n\u02c6f : Rm\u00d7r \u00d7 Rr\u00d7n \u2192 R : (U, W ) (cid:55)\u2192 \u02c6f (U, W ) =\n\n(9)\nwhere (cid:12) is the entry-wise product, \u03bb > 0 is a regularization parameter, \u00af\u2126 is the complement of the set \u2126 and\n(cid:107)M(cid:107)2\nij. Picking a small but positive \u03bb will ensure that the objective function f (10) is smooth.\nFor a \ufb01xed U, computing the matrix W that minimizes \u02c6f is a least-squares problem. The mapping between U\nand this (unique) optimal W , noted WU ,\n\n(cid:107)C (cid:12) (U W \u2212 X\u2126)(cid:107)2\n\n(cid:107)U W(cid:107)2\n\u00af\u2126 ,\n\n(i,j)\u2208\u2126 M 2\n\n1\n2\n\n\u2126 +\n\n\u03bb2\n2\n\n\u2126\n\nU (cid:55)\u2192 WU = argmin\nW\u2208Rr\u00d7n\n\n\u02c6f (U, W ),\n\n3\n\n\fis smooth and easily computable\u2014see Section 3.3.\nBy virtue of the discussion in Section 1, we know that the mapping U (cid:55)\u2192 \u02c6f (U, WU ), with U \u2208 Rm\u00d7r, is\nconstant over sets of full-rank matrices U spanning the same column space. Hence, considering these sets as\nequivalence classes U , the following function f over the Grassmann manifold is well-de\ufb01ned:\n\nf : G(m, r) \u2192 R : U (cid:55)\u2192 f (U ) = \u02c6f (U, WU ),\n\n(10)\nwith any full-rank U \u2208 Rm\u00d7r such that col(U ) = U . The interpretation is as follows: we are looking for an\noptimal matrix \u02c6X = U W of rank at most r; we have con\ufb01dence Cij that \u02c6Xij should equal Xij for (i, j) \u2208 \u2126\nand (very small) con\ufb01dence \u03bb that \u02c6Xij should equal 0 for (i, j) /\u2208 \u2126.\n\n3.1 Rearranging the objective\n\nConsidering (9), it looks like evaluating \u02c6f (U, W ) will require the computation of the product U W at the entries\nin \u2126 and \u00af\u2126, i.e., we would need to compute the whole matrix U W , which cannot cost much less than O(mnr).\nSince applications typically involve very large values of the product mn, this is not acceptable. Alternatively, if\nwe restrict ourselves\u2014without loss of generality\u2014to orthonormal matrices U, we observe that\n\nF = (cid:107)W(cid:107)2\nF .\nConsequently, for all U in U(m, r), we have \u02c6f (U, WU ) = \u02c7f (U, WU ), where\nF \u2212 \u03bb2\n\n(cid:107)C (cid:12) (U W \u2212 X\u2126)(cid:107)2\n\n\u00af\u2126 = (cid:107)U W(cid:107)2\n\n\u2126 + (cid:107)U W(cid:107)2\n\n\u02c7f (U, W ) =\n\n(cid:107)U W(cid:107)2\n\n(cid:107)W(cid:107)2\n\n(11)\nThis only requires the computation of U W at entries in \u2126, at a cost of O(|\u2126|r). Finally, let \u00aff : Rm\u00d7r \u2192 R :\n\nU (cid:55)\u2192 \u02c7f (U, WU ), and observe that f (col(U )) = \u00aff(cid:12)(cid:12)U (m,r) (U ) for all U in U(m, r), as in the setting of Section 2.\n\n\u2126 +\n\n2\n\n1\n2\n\n\u03bb2\n2\n\n(cid:107)U W(cid:107)2\n\u2126 .\n\n3.2 Gradient and Hessian of the objective\n\nWe now derive formulas for the \ufb01rst and second order derivatives of f. In deriving these formulas, it is useful\nto note that, for a suitably smooth mapping g,\n\n(12)\nis the adjoint of the differential of g at X. For ease of notation, let us de\ufb01ne the following\n\n[g(X)],\n\nF\n\nwhere(cid:0)Dg(X)(cid:1)\u2217\n\ngrad(cid:0)X (cid:55)\u2192 1/2(cid:107)g(X)(cid:107)2\n(cid:26)C 2\n\n\u02c6Cij =\n\n0\n\n(cid:1)(X) =(cid:0)Dg(X)(cid:1)\u2217\n\nif (i, j) \u2208 \u2126,\notherwise.\n\nm-by-n matrix with the sparsity structure induced by \u2126:\nij \u2212 \u03bb2\n\nWe also introduce a sparse residue matrix RU that will come up in various formulas:\n\nRU = \u02c6C (cid:12) (U WU \u2212 X\u2126) \u2212 \u03bb2X\u2126.\n\nSuccessively using the chain rule, the optimality of WU and (12), we obtain:\n\n(13)\n\n(14)\n\nd\ndU\n\ngrad \u00aff (U ) =\n\n\u02c7f (U, WU ) =\n\n\u02c7f (U, WU ) +\n\n\u2202\n\u2202U\n\u02c7f (U, WU ) = U(cid:62)RU + \u03bb2WU = 0. Then, according to the identity (6) and\n\n\u02c7f (U, WU ) = RU W (cid:62)\nU .\n\n\u02c7f (U, WU ) \u00b7 d\ndU\n\n\u2202\n\u2202W\n\n\u2202\n\u2202U\n\nWU =\n\nIndeed, since WU is optimal,\nsince U(cid:62)RU = \u2212\u03bb2WU , the gradient of f at U = col(U ) on the Grassmannian is given by:\n\n\u2202\n\ngradf (U ) = grad \u00aff(cid:12)(cid:12)U (m,r) (U ) = P St\n\n\u2202W\n\nU grad \u00aff (U ) = (I \u2212 U U(cid:62))RU W (cid:62)\n\nU + U skew(U(cid:62)RU W (cid:62)\nU )\n\n= (I \u2212 U U(cid:62))RU W (cid:62)\n\nU \u2212 \u03bb2U skew(WU W (cid:62)\n\n(15)\nWe now differentiate (15) according to the identity (7) to get a matrix representation of the Hessian of f at U\nalong H . We note H a matrix representation of the tangent vector H chosen in accordance with U and\n\nU + \u03bb2U (WU W (cid:62)\nU ),\n\nU ) = RU W (cid:62)\n\nthe derivative of the mapping U (cid:55)\u2192 WU at U along the tangent direction H. Then:\n\nWU,H (cid:44) D(U (cid:55)\u2192 WU )(U )[H]\n\nHessf (U )[H] = (I \u2212 U U(cid:62))Dgradf (U )[H]\n\n(cid:104) \u02c6C (cid:12) (HWU + U WU,H )\n(cid:105)\n\n= (I \u2212 U U(cid:62))\n\nW (cid:62)\n\nU + RU W (cid:62)\n\nU,H + \u03bb2H(WU W (cid:62)\n\nU ) + \u03bb2U (WU W (cid:62)\n\nU,H).\n\n(16)\n\n4\n\n\f3.3 WU and its derivative WU,H\nWe still need to provide an explicit formula for WU and WU,H. We assume U \u2208 U(m, r) since we use\northonormal matrices to represent points on the Grassmannian and U(cid:62)H = 0 since H is a tangent vector at U .\nWe use the vectorization operator, vec, that transforms matrices into vectors by stacking their columns\u2014in\nMatlab notation, vec(A) = A(:). Denoting the Kronecker product of two matrices by \u2297, we will use the\nwell-known identity for matrices A, Y, B of appropriate sizes [Bro05]:\nvec(AY B) = (B(cid:62)\u2297 A)vec(Y ).\n\nWe also write I\u2126 for the orthonormal |\u2126|-by-mn matrix such that vec\u2126(M ) = I\u2126vec(M ) is a vector of length\n|\u2126| corresponding to the entries Mij for (i, j) \u2208 \u2126, taken in order from vec(M ).\nComputing WU comes down to minimizing the least-squares objective \u02c7f (U, W ) (11) with respect to W . We\n\ufb01rst manipulate \u02c7f to reach a standard form for least-squares, with S = I\u2126 diag(vec(C)):\n\n\u02c7f (U, W ) =\n\n=\n\n=\n\n=\n\n=\n\n1\n2\n1\n2\n1\n2\n1\n2\n1\n2\n\n\u2126 +\n\n\u03bb2\n2\n\nF \u2212 \u03bb2\n(cid:107)W(cid:107)2\n(cid:107)C (cid:12) (U W \u2212 X\u2126)(cid:107)2\n(cid:107)U W(cid:107)2\n2\n\u03bb2\n(cid:107)Svec(U W ) \u2212 vec\u2126(C (cid:12) X\u2126)(cid:107)2\n(cid:107)vec(W )(cid:107)2\n2\n(cid:107)S(In \u2297 U )vec(W ) \u2212 vec\u2126(C (cid:12) X\u2126)(cid:107)2\n\n2 +\n\n\u2126\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20)S(In \u2297 U )\n(cid:21)\n\n\u03bbIrn\n\nvec(W ) \u2212\n\n(cid:107)A1w \u2212 b1(cid:107)2\n\n2 \u2212 1\n2\n\n(cid:107)A2w(cid:107)2\n2 ,\n\n2 \u2212 \u03bb2\n\n2\n\n(cid:107)vec\u2126(U W )(cid:107)2\n\n2\n\n1\n2\n\n2 +\n\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:20)vec\u2126(C (cid:12) X\u2126)\n\n0rn\n\n2\n\n(cid:107)\u03bbIrnvec(W )(cid:107)2\n\u2212 1\n2\n\n(cid:107)[\u03bbI\u2126(In \u2297 U )] vec(W )(cid:107)2\n\n2\n\n2 \u2212 1\n2\n\n(cid:107)\u03bbI\u2126(In \u2297 U )vec(W )(cid:107)2\n\n2\n\nwhere w = vec(W ) \u2208 Rrn, 0rn \u2208 Rrn is the zero-vector and the de\ufb01nitions for A1, A2 and b1 are obvious. If\n1 A1 \u2212 A(cid:62)\nA(cid:62)\n\n2 A2 is positive de\ufb01nite there is a unique minimizing vector vec(WU ), given by:\n\nvec(WU ) = (A(cid:62)\n\n1 A1 \u2212 A(cid:62)\n\n2 A2)\u22121A(cid:62)\n\n1 b1.\n\nIt is easy to compute the following:\n\n1 A1 = (In \u2297 U(cid:62))(S(cid:62)S)(In \u2297 U ) + \u03bb2Irn,\nA(cid:62)\n2 A2 = (In \u2297 U(cid:62))(\u03bb2I(cid:62)\nA(cid:62)\n1 b1 = (In \u2297 U(cid:62))S(cid:62)vec\u2126(C (cid:12) X\u2126) = (In \u2297 U(cid:62))vec(C (2) (cid:12) X\u2126).\nA(cid:62)\n\n\u2126I\u2126)(In \u2297 U ),\n\nThroughout the text, we use the notation M (n) for entry-wise exponentiation, i.e., (M (n))ij = (Mij)n. Note\nthat S(cid:62)S \u2212 \u03bb2I(cid:62)\n\n(cid:17)\n\u2126I\u2126 = diag(vec( \u02c6C)). We then de\ufb01ne A \u2208 Rrn\u00d7rn as:\ndiag(vec( \u02c6C))\n\n2 A2 = (In \u2297 U(cid:62))\n\n1 A1 \u2212 A(cid:62)\n\nA (cid:44) A(cid:62)\n\n(cid:16)\n\n(In \u2297 U ) + \u03bb2Irn.\n\n(17)\n\nObserve that the matrix A is block-diagonal, with n symmetric blocks of size r. Each block is indeed positive-\nde\ufb01nite provided \u03bb > 0 (making A positive-de\ufb01nite too). Thanks to the sparsity of \u02c6C, we can compute these n\nblocks with O(|\u2126|r2) \ufb02ops. To solve systems in A, we compute the Cholesky factorization of each block, at a\ntotal cost of O(nr3). Once these factorizations are computed, each system only costs O(nr2) to solve [TB97].\nCollecting all equations in this subsection, we obtain a closed-form formula for WU :\n\nvec(WU ) = A\u22121vec\n\n(18)\n\nwhere A is a function of U. We would like to differentiate WU with respect to U. Using bilinearity and\nassociativity of \u2297 as well as the formula D(Y (cid:55)\u2192 Y \u22121)(X)[H] = \u2212X\u22121HX\u22121 [Bro05], some algebra yields:\n\nvec(WU,H ) = \u2212A\u22121vec\n\n.\n\n(19)\n\n(cid:17)\n\nU(cid:62)[C (2) (cid:12) X\u2126]\n\n(cid:16)\nH(cid:62)RU + U(cid:62)(cid:0) \u02c6C (cid:12) (HWU )(cid:1)(cid:17)\n\n,\n\n(cid:16)\n\n5\n\n\fThe most expensive operation involved in computing WU,H ought to be the resolution of a linear system in A.\nFortunately, we already factored the n small diagonal blocks of A in Cholesky form to compute WU . Conse-\nquently, after computing WU , computing WU,H is cheaper than computing WU(cid:48) for a new U(cid:48). This means that\nwe can bene\ufb01t from computing this information before we move on to a new candidate on the Grassmannian,\ni.e., it is worth trying second order methods. We summarize the complexities in the next subsection.\n\n3.4 Numerical complexities\n\nBy exploiting the sparsity of many of the matrices involved and the special structure of the matrix A appearing\nin the computation of WU and WU,H, it is possible to compute the objective f as well as its gradient and its\nHessian on the Grassmannian in time essentially linear in the size of the data |\u2126|. Memory complexities are also\nlinear in |\u2126|. We summarize the computational complexities in Table 1. Please note that most computations are\neasily parallelizable, but we do not take advantage of it here.\n\nTable 1: All complexities are essentially linear in |\u2126|, the number of observed entries.\n\nComputation\nWU and f (U )\ngradf (U )\nWU,H and Hessf (U )[H] O(|\u2126|r + (m + n)r2)\n\nComplexity\nO(|\u2126|r2 + nr3)\nO(|\u2126|r + (m + n)r2) RU and WU W (cid:62)\n\nU\n\nBy-products\nCholesky form of A (9), (10), (17), (18)\n\nFormulas\n\n(13), (14), (15)\n(16), (19)\n\n4 Riemannian trust-region method\n\nWe use a Riemannian trust-region (RTR) method [ABG07] to minimize (10), via the freely available Matlab\npackage GenRTR (version 0.3.0) with its default parameter values. The package is available at this address:\nhttp://www.math.fsu.edu/\u02dccbaker/GenRTR/?page=download.\nAt the current iterate U = col(U ), the RTR method uses the retraction RU (8) to build a quadratic model\nmU : TUG(m, r) \u2192 R of the lifted objective function f \u25e6 RU (lift). It then classically minimizes the model\ninside a trust region on this vector space (solve), and retracts the resulting tangent vector H to a candidate\nU + = RU (H) on the Grassmannian (retract). The quality of U + = col(U +) is assessed using f and the step\nis accepted or rejected accordingly. Likewise, the radius of the trust region is adapted based on the observed\nquality of the model.\nThe model mU of f \u25e6 RU has the form:\n\nmU (H) = f (U ) + (cid:104)gradf (U ), H(cid:105)U +\n\n(cid:104)A(U )[H], H(cid:105)U ,\n\n1\n2\n\nwhere A(U ) is some symmetric linear operator on TUG(m, r). Typically, the faster one can compute A(U )[H],\nthe faster one can minimize mU (H) in the trust region.\nA powerful property of the RTR method is that global convergence of the algorithm toward critical points\u2014local\nminimizers in practice since it is a descent method\u2014is guaranteed independently of A(U ) [ABG07, Thm 4.24,\nCor. 4.6]. We take advantage of this and \ufb01rst set it to the identity. This yields a steepest-descent algorithm\nwe later refer to as RTRMC 1. Additionally, if we take A(U ) to be the Hessian of f at U (16), we get a\nquadratic convergence rate, even if we only approximately minimize mU within the trust region using a few\nsteps of a well chosen iterative method [ABG07, Thm 4.14]. This means that the RTR method only requires a\nfew computations of the Hessian along speci\ufb01c directions. We call our method using the Hessian RTRMC 2.\n\n5 Numerical experiments\n\nWe test our algorithms on both synthetic and real data and compare their performances against OptSpace,\nADMiRA, SVT, LMaFit and Balanced Factorization in terms of accuracy and computation time. All algorithms\nare run sequentially by Matlab on the same personal computer1. Table 2 speci\ufb01es a few implementation details.\n\n1Intel Core i5 670 @ 3.60GHz (4), 8Go RAM, Matlab 7.10 (R2010a).\n\n6\n\n\fTable 2: All Matlab implementations call subroutines in non-Matlab code to ef\ufb01ciently deal with the sparsity of\nthe matrices involved. PROPACK [Lar05] is a free package for large and sparse SVD computations.\nMethod\nRTRMC 1\n\nEnvironment\nMatlab + some C-Mex\n\nComment\nOur method with \u201capproximate Hessian\u201d set to identity,\ni.e., no second order information. \u03bb = 10\u22126. For the\ninitial guess U0, we use the OptSpace trimmed SVD.\nSame as RTRMC 1 but with exact Hessian.\n[KO09] with \u03bb = 0. Trimmed SVD + descent on Grass.\n\nMatlab + some C-Mex\nRTRMC 2\nC code\nOptSpace\nMatlab with PROPACK [LB10] Matching pursuit based.\nADMiRA\nMatlab with PROPACK [CCS08] default \u03c4 and \u03b4. Nuclear norm minimization.\nSVT\nLMaFit\nMatlab + some C-Mex\nBalanced Factorization Matlab + some C-Mex\n\n[WYZ10] Alternating minimization.\n[MBS11] One of their Riemannian regression methods.\n\nOur methods (RTRMC 1 and 2) and Balanced Factorization require knowledge of the target rank r. OptSpace,\nADMiRA and LMaFit include a mechanism to guess the rank, but bene\ufb01t from knowing it, hence we provide\nthe true rank to these methods too. As is, the SVT code does not permit the user to specify the rank.\nWe use the root mean square error (RMSE) criterion to assess the quality of reconstruction of X with \u02c6X:\n\n\u221a\nRMSE(X, \u02c6X) = (cid:107)X \u2212 \u02c6X(cid:107)F/\n\nmn.\n\nScenario 1. We \ufb01rst compare convergence behavior of the different methods on synthetic data. We pick\nm = n = 10 000 and r = 10. The dimension of the manifold of m-by-n matrices of rank r is d = r(m+n\u2212r).\nWe generate A \u2208 Rm\u00d7r and B \u2208 Rr\u00d7n with i.i.d. normal entries of zero mean and unit variance. The target\nmatrix is X = AB. We sample 2.5d entries uniformly at random, which yields a sampling ratio of 0.5%.\nFigure 1 is typical and shows the evolution of the RMSE as a function of time (left) and iteration count (right).\nFor \u02c6X = U V with U \u2208 Rm\u00d7r, V \u2208 Rr\u00d7n, we compute the RMSE in O((m + n)r2) \ufb02ops using:\n\n(mn)RMSE(AB, U V )2 = Trace((A(cid:62)A)(BB(cid:62))) + Trace((U(cid:62)U )(V V (cid:62))) \u2212 2Trace((U(cid:62)A)(BV (cid:62))).\n\nBe wary though that this formula is numerically inaccurate when the RMSE is much smaller than the norm of\neither AB or U V , owing to the computation of the difference of close large numbers.\n\nScenario 2.\nIn this second test, we repeat the previous experience with rectangular matrices: m = 1 000, n =\n30 000, r = 5 and a sampling ratio of 2.6% (5d known entries). We expect RTRMC to perform well on rect-\nangular matrices since the dimension of the Grassmann manifold we optimize on only grows linearly with\nmin(m, n), whereas it is the (simple) least-squares problem dimension that grows linearly in max(m, n). Fig-\nure 2 is typical and shows indeed that RTRMC is the fastest tested algorithm on this test.\n\nScenario 3. Following the protocol in [KMO09], we test our method on the Jester dataset 1 [GRGP01] of\nratings of a hundred jokes by 24 983 users. We randomly select 4 000 users and the corresponding continuous\nratings in the range [\u221210, 10]. For each user, we extract two ratings at random as test data. We run the different\nmatrix completion algorithms with a prescribed rank on the remaining training data, N = 100 times for each\nrank. Table 3 reports the average Normalized Mean Absolute Error (NMAE) on the test data along with a\ncon\ufb01dence interval computed as the standard deviation of the NMAE\u2019s obtained for the different runs divided\nby\n\nN. All methods but ADMiRA minimize a similar cost function and consequently perform the same.\n\n\u221a\n\n6 Conclusion\n\nOur contribution is an ef\ufb01cient numerical method to solve large low-rank matrix completion problems. RTRMC\ncompetes with the state-of-the-art and enjoys proven global and local convergence to local optima, with a\nquadratic convergence rate for RTRMC 2. Our methods are particularly ef\ufb01cient on rectangular matrices. To\nobtain such results, we exploited the geometry of the low-rank constraint and applied techniques from the \ufb01eld\nof optimization on manifolds. Matlab code for RTRMC 1 and 2 is available at:\nhttp://www.inma.ucl.ac.be/\u02dcabsil/RTRMC/.\n\n7\n\n\fTable 3: NMAE\u2019s on the Jester dataset 1 (Scenario 3). All algorithms solve the problem in well under a minute\nfor rank 7. All but ADMiRA reach similar results. As a reference, consider that a random guesser would obtain\na score of 0.33. Goldberg et al. [GRGP01] report a score of 0.187 but use a different protocol.\nrank RTRMC 2\n1\n3\n5\n7\n\nLMaFit\n0.1799\u00b1 2\u00b7 10\u22124\n0.1624\u00b1 2\u00b7 10\u22124\n0.1584\u00b1 2\u00b7 10\u22124\n0.1578\u00b1 2\u00b7 10\u22124\n\nOptSpace\n0.1799\u00b1 2\u00b7 10\u22124\n0.1625\u00b1 2\u00b7 10\u22124\n0.1584\u00b1 2\u00b7 10\u22124\n0.1581\u00b1 2\u00b7 10\u22124\n\n0.1799\u00b1 2\u00b7 10\u22124\n0.1624\u00b1 2\u00b7 10\u22124\n0.1584\u00b1 2\u00b7 10\u22124\n0.1578\u00b1 2\u00b7 10\u22124\n\nBal. Fac.\n0.1799\u00b1 2\u00b7 10\u22124\n0.1626\u00b1 2\u00b7 10\u22124\n0.1584\u00b1 2\u00b7 10\u22124\n0.1580\u00b1 2\u00b7 10\u22124\n\nADMiRA\n0.1836\u00b1 2\u00b7 10\u22124\n0.1681\u00b1 2\u00b7 10\u22124\n0.1635\u00b1 2\u00b7 10\u22124\n0.1618\u00b1 2\u00b7 10\u22124\n\nFigure 1: Evolution of the RMSE for the six methods under Scenario 1 (m = n = 10 000, r = 10, |\u2126|/(mn) =\n0.5%, i.e., 99.5% of the entries are unknown). For RTRMC 2, we count the number of inner iterations, i.e., the\nnumber of parallelizable steps. ADMiRA stagnates and SVT diverges. All other methods eventually \ufb01nd the\nexact solution.\n\nFigure 2: Evolution of the RMSE for the six methods under Scenario 2 (m = 1 000, n = 30 000, r = 5,\n|\u2126|/(mn) = 2.6%). For rectangular matrices, RTRMC is especially ef\ufb01cient owing to the linear growth of the\ndimension of the search space in min(m, n), whereas for most methods the growth is linear in m + n.\n\nAcknowledgments\nThis paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Op-\ntimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science\nPolicy Of\ufb01ce. NB is an FNRS research fellow (Aspirant). The scienti\ufb01c responsibility rests with its authors.\n\n8\n\nTime[s]RMSERTRMC2RTRMC1ADMiRAOptSpaceLMaFitSVTBal.Fac.Iterationcount0501000255010\u2212810\u2212510\u2212210110\u2212810\u2212510\u22122101Time[s]RMSERTRMC2RTRMC1ADMiRAOptSpaceLMaFitSVTBal.Fac.Iterationcount0501000255010\u2212810\u2212510\u2212210110\u2212810\u2212510\u22122101\fReferences\n[ABG07] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds.\n\nFound. Comput. Math., 7(3):303\u2013330, July 2007.\n\n[AMS08] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton\n\nUniversity Press, Princeton, NJ, 2008.\n\n[BNR10] L. Balzano, R. Nowak, and B. Recht. Online identi\ufb01cation and tracking of subspaces from highly\nincomplete information. In Communication, Control, and Computing (Allerton), 2010 48th Annual\nAllerton Conference on, pages 704\u2013711. IEEE, 2010.\n\n[Bro05] M. Brookes. The matrix reference manual. Imperial College London, 2005.\n[CCS08]\n\nJ.F. Cai, E.J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion.\nArxiv preprint arXiv:0810.3286, 2008.\nE.J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of\nComputational Mathematics, 9(6):717\u2013772, 2009.\n\n[CR09]\n\n[DKM10] W. Dai, E. Kerman, and O. Milenkovic. A Geometric Approach to Low-Rank Matrix Completion.\n\nArxiv preprint arXiv:1006.2086, 2010.\n\n[DMK11] W. Dai, O. Milenkovic, and E. Kerman. Subspace evolution and transfer (SET) for low-rank matrix\n\ncompletion. Signal Processing, IEEE Transactions on, PP(99):1, 2011.\n\n[GRGP01] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative\n\n\ufb01ltering algorithm. Information Retrieval, 4(2):133\u2013151, 2001.\nR.H. Keshavan and A. Montanari. Regularization for matrix completion. In Information Theory\nProceedings (ISIT), 2010 IEEE International Symposium on, pages 1503\u20131507. IEEE, 2010.\n\n[KM10]\n\n[KO09]\n\n[KMO09] R.H. Keshavan, A. Montanari, and S. Oh. Low-rank matrix completion with noisy observations: a\nquantitative comparison. In Communication, Control, and Computing, 2009. Allerton 2009. 47th\nAnnual Allerton Conference on, pages 1216\u20131222. IEEE, 2009.\nR.H. Keshavan and S. Oh. OptSpace: A gradient descent algorithm on the Grassman manifold for\nmatrix completion. Arxiv preprint arXiv:0910.5260 v2, 2009.\nR.M. Larsen. PROPACK\u2013Software for large and sparse SVD calculations. Available online. URL\nhttp://sun. stanford. edu/rmunk/PROPACK, 2005.\nK. Lee and Y. Bresler. ADMiRA: Atomic decomposition for minimum rank approximation. Infor-\nmation Theory, IEEE Transactions on, 56(9):4402\u20134416, 2010.\n\n[Lar05]\n\n[LB10]\n\n[MBS11] G. Meyer, S. Bonnabel, and R. Sepulchre. Linear regression under \ufb01xed-rank constraints: a Rie-\n\n[TB97]\n[Van11]\n\nmannian approach. In 28th International Conference on Machine Learning. ICML, 2011.\nL.N. Trefethen and D. Bau. Numerical linear algebra. Society for Industrial Mathematics, 1997.\nB. Vandereycken. Low-rank matrix completion by riemannian optimization. Technical report,\nANCHP-MATHICSE, Mathematics Section, \u00b4Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne, 2011.\n\n[WYZ10] Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion by\na nonlinear successive over-relaxation algorithm. Technical report, Rice University, 2010. CAAM\nTechnical Report TR10-07.\n\n9\n\n\f", "award": [], "sourceid": 304, "authors": [{"given_name": "Nicolas", "family_name": "Boumal", "institution": null}, {"given_name": "Pierre-antoine", "family_name": "Absil", "institution": null}]}