{"title": "Matrix Completion from Noisy Entries", "book": "Advances in Neural Information Processing Systems", "page_first": 952, "page_last": 960, "abstract": "Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the \u2018Netflix problem\u2019) to structure-from-motion and positioning. We study a low complexity algorithm introduced in [1], based on a combination of spectral techniques and manifold optimization, that we call here OPTSPACE. We prove performance guarantees that are order-optimal in a number of circumstances.", "full_text": "Matrix Completion from Noisy Entries\n\nRaghunandan H. Keshavan\u2217, Andrea Montanari\u2217\u2020, and Sewoong Oh\u2217\n\nAbstract\n\nGiven a matrix M of low-rank, we consider the problem of reconstructing it from\nnoisy observations of a small, random subset of its entries. The problem arises\nin a variety of applications, from collaborative \ufb01ltering (the \u2018Net\ufb02ix problem\u2019)\nto structure-from-motion and positioning. We study a low complexity algorithm\nintroduced in [1], based on a combination of spectral techniques and manifold\noptimization, that we call here OPTSPACE. We prove performance guarantees\nthat are order-optimal in a number of circumstances.\n\n1 Introduction\n\nSpectral techniques are an authentic workhorse in machine learning, statistics, numerical analysis,\nand signal processing. Given a matrix M, its largest singular values \u2013and the associated singular\nvectors\u2013 \u2018explain\u2019 the most signi\ufb01cant correlations in the underlying data source. A low-rank ap-\nproximation of M can further be used for low-complexity implementations of a number of linear\nalgebra algorithms [2].\nIn many practical circumstances we have access only to a sparse subset of the entries of an m \u00d7 n\nmatrix M. It has recently been discovered that, if the matrix M has rank r, and unless it is too\n\u2018structured\u2019, a small random subset of its entries allow to reconstruct it exactly. This result was \ufb01rst\nproved by Cand\u00b4es and Recht [3] by analyzing a convex relaxation indroduced by Fazel [4]. A tighter\nanalysis of the same convex relaxation was carried out in [5]. A number of iterative schemes to solve\nthe convex optimization problem appeared soon thereafter [6, 7, 8] (also see [9] for a generalization).\n\nIn an alternative line of work, the authors of [1] attacked the same problem using a combination\nof spectral techniques and manifold optimization: we will refer to their algorithm as OPTSPACE.\nOPTSPACE is intrinsically of low complexity, the most complex operation being computing r sin-\ngular values and the corresponding singular vectors of a sparse m \u00d7 n matrix. The performance\nroughly\nguarantees proved in [1] are comparable with the information theoretic lower bound:\nnr max{r, log n} random entries are needed to reconstruct M exactly (here we assume m of or-\nder n). A related approach was also developed in [10], although without performance guarantees for\nmatrix completion.\n\nThe above results crucially rely on the assumption that M is exactly a rank r matrix. For many\napplications of interest, this assumption is unrealistic and it is therefore important to investigate\ntheir robustness. Can the above approaches be generalized when the underlying data is \u2018well ap-\nproximated\u2019 by a rank r matrix? This question was addressed in [11] within the convex relaxation\napproach of [3]. The present paper proves a similar robustness result for OPTSPACE. Remark-\nably the guarantees we obtain are order-optimal in a variety of circumstances, and improve over the\nanalogus results of [11].\n\n\u2217Department of Electrical Engineering, Stanford University\n\u2020Departments of Statistics, Stanford University\n\n1\n\n\f1.1 Model de\ufb01nition\n\nLet M be an m \u00d7 n matrix of rank r, that is\n\n(1)\nwhere U has dimensions m \u00d7 r, V has dimensions n \u00d7 r, and \u03a3 is a diagonal r \u00d7 r matrix. We\nassume that each entry of M is perturbed, thus producing an \u2018approximately\u2019 low-rank matrix N,\nwith\n\nM = U \u03a3V T .\n\nNij = Mij + Zij ,\n\n(2)\n\nwhere the matrix Z will be assumed to be \u2018small\u2019 in an appropriate sense.\nOut of the m \u00d7 n entries of N, a subset E \u2286 [m] \u00d7 [n] is revealed. We let N E be the m \u00d7 n matrix\nthat contains the revealed entries of N, and is \ufb01lled with 0\u2019s in the other positions\n\n(3)\n\nij =(cid:26) Nij\n\nN E\n\nif (i, j) \u2208 E ,\n\n0 otherwise.\n\nThe set E will be uniformly random given its size |E|.\n1.2 Algorithm\n\nFor the reader\u2019s convenience, we recall the algorithm introduced in [1], which we will analyze here.\nThe basic idea is to minimize the cost function F (X, Y ), de\ufb01ned by\n\nF (X, Y ) \u2261 min\nS\u2208Rr\u00d7r F(X, Y, S) ,\n2 X(i,j)\u2208E\n1\n\nF(X, Y, S) \u2261\n\n(Nij \u2212 (XSY T )ij)2 .\n\n(4)\n\n(5)\n\nHere X \u2208 Rn\u00d7r, Y \u2208 Rm\u00d7r are orthogonal matrices, normalized by X T X = m1, Y T Y = n1.\nMinimizing F (X, Y ) is an a priori dif\ufb01cult task, since F is a non-convex function. The key insight\nis that the singular value decomposition (SVD) of N E provides an excellent initial guess, and that the\nminimum can be found with high probability by standard gradient descent after this initialization.\nTwo caveats must be added to this decription: (1) In general the matrix N E must be \u2018trimmed\u2019\nto eliminate over-represented rows and columns; (2) For technical reasons, we consider a slightly\n\n0 ;\n\nOPTSPACE( matrix N E )\n\nmodi\ufb01ed cost function to be denoted by eF (X, Y ).\n1: Trim N E, and let eN E be the output;\n2: Compute the rank-r projection of eN E, Tr(eN E) = X0S0Y T\n3: Minimize eF (X, Y ) through gradient descent, with initial condition (X0, Y0).\nWe may note here that the rank of the matrix M, if not known, can be reliably estimated from eN E.\nThe various steps of the above algorithm are de\ufb01ned as follows.\nTrimming. We say that a row is \u2018over-represented\u2019 if it contains more than 2|E|/m revealed entries\n(i.e. more than twice the average number of revealed entries). Analogously, a column is over-\nrepresented if it contains more than 2|E|/n revealed entries. The trimmed matrix eN E is obtained\nfrom N E by setting to 0 over-represented rows and columns. fM E and eZ E are de\ufb01ned similarly.\nHence, eN E = fM E + eZ E.\n\nWe refer to the journal version of this paper for further details.\n\nRank-r projection. Let\n\nmin(m,n)Xi=1\n\neN E =\n\n\u03c3ixiyT\ni ,\n\n2\n\n(6)\n\n\fbe the singular value decomposition of eN E, with singular vectors \u03c31 \u2265 \u03c32 \u2265 . . . . We then de\ufb01ne\n\n(7)\n\nTr(eN E) =\n\nmn\n|E|\n\nrXi=1\n\n\u03c3ixiyT\ni .\n\nnorm.\n\nApart from an overall normalization, Tr(eN E) is the best rank-r approximation to eN E in Frobenius\nMinimization. The modi\ufb01ed cost function eF is de\ufb01ned as\nG1(cid:18)||X (i)||2\n\neF (X, Y ) = F (X, Y ) + \u03c1 G(X, Y )\n\nG1(cid:18)||Y (j)||2\n3\u00b50r (cid:19) ,\n\n3\u00b50r (cid:19) + \u03c1\n\n\u2261 F (X, Y ) + \u03c1\n\n(8)\n\n(9)\n\nmXi=1\n\nnXj=1\n\nwhere X (i) denotes the i-th row of X, and Y (j) the j-th row of Y . See Section 1.3 below for the\nde\ufb01nition of \u00b50. The function G1 : R+ \u2192 R is such that G1(z) = 0 if z \u2264 1 and G1(z) =\ne(z\u22121)2\nLet us stress that the regularization term is mainly introduced for our proof technique to work (and\na broad family of functions G1 would work as well). In numerical experiments we did not \ufb01nd any\nperformance loss in setting \u03c1 = 0.\n\n\u2212 1 otherwise. Further, we can choose \u03c1 = \u0398(n\u01eb).\n\nthe r-dimensional subspaces of Rm and Rn generated (respectively) by the columns of X and Y .\nThis interpretation is justi\ufb01ed by the fact that F (X, Y ) = F (XA, Y B) for any two orthogonal\n\nOne important feature of OPTSPACE is that F (X, Y ) and eF (X, Y ) are regarded as functions of\nmatrices A, B \u2208 Rr\u00d7r (the same property holds for eF ). The set of r dimensional subspaces of Rm\nalgorithm is applied to the function eF : M(m, n) \u2261 G(m, r) \u00d7 G(n, r) \u2192 R. For further details on\n\nis a differentiable Riemannian manifold G(m, r) (the Grassman manifold). The gradient descent\n\noptimization by gradient descent on matrix manifolds we refer to [12, 13].\n\n1.3 Main results\n\nOur \ufb01rst result shows that, in great generality, the rank-r projection of eN E provides a reasonable\napproximation of M. Throughout this paper, without loss of generality, we assume \u03b1 \u2261 m/n \u2265 1.\nTheorem 1.1. Let N = M + Z, where M has rank r and |Mij| \u2264 Mmax for all (i, j) \u2208 [m] \u00d7 [n],\nand assume that the subset of revealed entries E \u2286 [m] \u00d7 [n] is uniformly random with size |E|.\nThen there exists numerical constants C and C \u2032 such that\n\n1\n\n|E| (cid:19)1/2\n\u221amn||M \u2212 Tr(eN E)||F \u2264 CMmax (cid:18) nr\u03b13/2\n\n+ C \u2032 n\u221ar\u03b1\n|E|\n\n||eZ E||2 ,\n\n(10)\n\nwith probability larger than 1 \u2212 1/n3.\nProjection onto rank-r matrices through SVD is pretty standard (although trimming is crucial for\nachieving the above guarantee). The key point here is that a much better approximation is obtained\n\nby minimizing the cost eF (X, Y ) (step 3 in the pseudocode above), provided M satis\ufb01es an appro-\n\npriate incoherence condition. Let M = U \u03a3V T be a low rank matrix, and assume, without loss of\ngenerality, U T U = m1 and V T V = n1. We say that M is (\u00b50, \u00b51)-incoherent if the following\nconditions hold.\n\nA1. For all i \u2208 [m], j \u2208 [n] we have,Pr\nA2. There exists \u00b51 such that |Pr\n\nik \u2264 \u00b50r,Pr\n\nk=1 V 2\nk=1 Uik(\u03a3k/\u03a31)Vjk| \u2264 \u00b51r1/2.\n\nk=1 U 2\n\nik \u2264 \u00b50r.\n\nTheorem 1.2. Let N = M + Z, where M is a (\u00b50, \u00b51)-incoherent matrix of rank r, and assume\nthat the subset of revealed entries E \u2286 [m] \u00d7 [n] is uniformly random with size |E|. Further, let\n\u03a3min = \u03a31 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03a3r = \u03a3max with \u03a3max/\u03a3min \u2261 \u03ba. Let cM be the output of OPTSPACE on\ninput N E. Then there exists numerical constants C and C \u2032 such that if\n\n|E| \u2265 Cn\u221a\u03b1\u03ba2 max(cid:8)\u00b50r\u221a\u03b1 log n ; \u00b52\n\n0r2\u03b1\u03ba4 ; \u00b52\n\n1r2\u03b1\u03ba4(cid:9) ,\n\n(11)\n\n3\n\n\fthen, with probability at least 1 \u2212 1/n3,\n\n1\n\n\u221amn ||cM \u2212 M||F \u2264 C \u2032 \u03ba2 n\u221a\u03b1r\n\n|E|\n\nprovided that the right-hand side is smaller than \u03a3min.\n\n||Z E||2 .\n\n(12)\n\nApart from capturing the effect of additive noise, these two theorems improve over the work of [1]\neven in the noiseless case. Indeed they provide quantitative bounds in \ufb01nite dimensions, while the\nresults of [1] were only asymptotic.\n\n1.4 Noise models\n\nIn order to make sense of the above results, it is convenient to consider a couple of simple models\nfor the noise matrix Z:\n\nIndependent entries model. We assume that Z\u2019s entries are independent random variables, with zero\nmean E{Zij} = 0 and sub-gaussian tails. The latter means that\n\nP{|Zij| \u2265 x} \u2264 2 e\u2212 x\n\n2\u03c3\n\n2\n\n2 ,\n\n(13)\n\nfor some bounded constant \u03c32.\nWorst case model. In this model Z is arbitrary, but we have an uniform bound on the size of its\nentries: |Zij| \u2264 Zmax.\nThe basic parameter entering our main results is the operator norm of eZ E, which is bounded as\n\nfollows.\nTheorem 1.3. If Z is a random matrix drawn according to the independent entries model, then\nthere is a constant C such that,\n\n||eZ E||2 \u2264 C\u03c3(cid:18)\u221a\u03b1|E| log |E|\n\nn\n\nwith probability at least 1 \u2212 1/n3.\nIf Z is a matrix from the worst case model, then\n\nfor any realization of E.\n\n2|E|\nn\u221a\u03b1\n\nZmax ,\n\n||eZ E||2 \u2264\n\n(cid:19)1/2\n\n,\n\n(14)\n\n(15)\n\nNote that for |E| = \u2126(n log n) , no row or column is over-represented with high probability. It\nfollows that in the regime of |E| for which the conditions of Theorem 1.2 are satis\ufb01ed, we have\nZ E = eZ E. Then, among the other things, this result implies that for the independent entries model\nthe right-hand side of our error estimate, Eq. (12), is with high probability smaller than \u03a3min, if\n|E| \u2265 Cr\u03b13/2n log n \u03ba4(\u03c3/\u03a3min)2. For the worst case model, the same statement is true if Zmax \u2264\n\u03a3min/C\u221ar\u03ba2.\nDue to space constraints, the proof of Theorem 1.3 will be given in the journal version of this paper.\n\n1.5 Comparison with related work\n\nLet us begin by mentioning that a statement analogous to our preliminary Theorem 1.1 was proved\nin [14]. Our result however applies to any number of revealed entries, while the one of [14] requires\n|E| \u2265 (8 log n)4n (which for n \u2264 5 \u00b7 108 is larger than n2).\nAs for Theorem 1.2, we will mainly compare our algorithm with the convex relaxation approach\nrecently analyzed in [11]. Our basic setting is indeed the same, while the algorithms are rather\ndifferent.\n\n4\n\n\fConvex Relaxation\nLower Bound\nrank-r projection\nOptSpace : 1 iteration\n2 iterations\n3 iterations\n10 iterations\n\n 1\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\nE\nS\nM\nR\n\n 0\n\n 0\n\n 100\n\n 200\n\n 400\n\n 500\n\n 600\n\n 300\n|E|/n\n\nFigure 1: Root mean square error achieved by OPTSPACE for reconstructing a random rank-2 matrix, as a\nfunction of the number of observed entries |E|, and of the number of line minimizations. The performance of\nnuclear norm minimization and an information theory lower bound are also shown.\n\nFigure 1 compares the average root mean square error for the two algorithms as a function of |E|.\nHere M is a random rank r = 2 matrix of dimension m = n = 600, generated by letting M = eUeV T\nwith eUij,eVij i.i.d. N (0, 20/\u221an). The noise is distributed according to the independent entries\nmodel with Zij \u223c N (0, 1). This example is taken from [11] Figure 2, from which we took the\ndata for the convex relaxation approach, as well as the information theory lower bound. After one\niteration, OPTSPACE has a smaller root mean square error than [11], and in about 10 iterations it\nbecomes indistiguishable from the information theory lower bound.\n\nNext let us compare our main result with the performance guarantee in [11], Theorem 7. Let us\nstress that we require some bound on the condition number \u03ba, while the analysis of [11, 5] requires\na stronger incoherence assumption. As far as the error bound is concerned, [11] proved\n\n1\n\n\u221amn ||cM \u2212 M||F \u2264 7r n\n\n|E| ||Z E||F +\n\n2\nn\u221a\u03b1 ||Z E||F .\n\n(16)\n\n(The constant in front of the \ufb01rst term is in fact slightly smaller than 7 in [11], but in any case larger\nthan 4\u221a2).\nTheorem 1.2 improves over this result in several respects: (1) We do not have the second term on\nthe right hand side of (16), that actually increases with the number of observed entries; (2) Our\nerror decreases as n/|E| rather than (n/|E|)1/2; (3) The noise enters Theorem 1.2 through the\noperator norm ||Z E||2 instead of its Frobenius norm ||Z E||F \u2265 ||Z E||2. For E uniformly random,\none expects ||Z E||F to be roughly of order ||Z E||2\u221an. For instance, within the intependent entries\nmodel with bounded variance \u03c3, ||Z E||F = \u0398(p|E|) while ||Z E||2 is of order p|E|/n (up to\n\nlogarithmic terms).\n\n2 Some notations\n\nThe matrix M to be reconstructed takes the form (1) where U \u2208 Rm\u00d7r, V \u2208 Rn\u00d7r. We write\nU = [u1, u2, . . . , ur] and V = [v1, v2, . . . , vr] for the columns of the two factors, with ||ui|| = \u221am,\n||vi|| = \u221an, and uT\ni vj = 0 for i 6= j (there is no loss of generality in this, since\n\ni uj = 0, vT\n\nnormalizations can be absorbed by rede\ufb01ning \u03a3).\n\n5\n\n\fWe shall write \u03a3 = diag(\u03a31, . . . , \u03a3r) with \u03a31 \u2265 \u03a32 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03a3r > 0. The maximum and mini-\nmum singular values will also be denoted by \u03a3max = \u03a31 and \u03a3min = \u03a3r. Further, the maximum\nsize of an entry of M is Mmax \u2261 maxij |Mij|.\nProbability is taken with respect to the uniformly random subset E \u2286 [m]\u00d7 [n] given |E| and (even-\ntually) the noise matrix Z. De\ufb01ne \u01eb \u2261 |E|/\u221amn. In the case when m = n, \u01eb corresponds to the\naverage number of revealed entries per row or column. Then it is convenient to work with a model in\nwhich each entry is revealed independently with probability \u01eb/\u221amn. Since, with high probability\n|E| \u2208 [\u01eb\u221a\u03b1 n \u2212 A\u221an log n, \u01eb\u221a\u03b1 n + A\u221an log n], any guarantee on the algorithm performances\n\nthat holds within one model, holds within the other model as well if we allow for a vanishing shift\nin \u01eb. We will use C, C \u2032 etc. to denote universal numerical constants.\nGiven a vector x \u2208 Rn, ||x|| will denote its Euclidean norm. For a matrix X \u2208 Rn\u00d7n\u2032, ||X||F is\nits Frobenius norm, and ||X||2 its operator norm (i.e. ||X||2 = supu6=0 ||Xu||/||u||). The standard\nscalar product between vectors or matrices will sometimes be indicated by hx, yi or hX, Y i, respec-\ntively. Finally, we use the standard combinatorics notation [N ] = {1, 2, . . . , N} to denote the set of\n\ufb01rst N integers.\n\n3 Proof of Theorem 1.1\n\nAs explained in the introduction, the crucial idea is to consider the singular value decomposition\n\nof the trimmed matrix eN E instead of the original matrix N E. Apart from a trivial rescaling, these\nsingular values are close to the ones of the original matrix M.\nLemma 3.1. There exists a numerical constant C such that, with probability greater than 1\u2212 1/n3,\n(17)\n\n\u03c3q\n\n1\n\n+\n\nwhere it is understood that \u03a3q = 0 for q > r.\n\n\u01eb \u2212 \u03a3q(cid:12)(cid:12)(cid:12) \u2264 CMmaxr \u03b1\n\n\u01eb\n\n(cid:12)(cid:12)(cid:12)\n\n\u01eb||eZ E||2 ,\n\nProof. For any matrix A, let \u03c3q(A) denote the qth singular value of A. Then, \u03c3q(A+B) \u2264 \u03c3q(A)+\n\u03c31(B), whence\n\n(18)\n\n\u03c3r+1(cid:19)\n\nWe will now prove Theorem 1.1.\n\n1\n\nProof. (Theorem 1.1) For any matrix A of rank at most 2r, ||A||F \u2264 \u221a2r||A||2, whence\n\u221amn||M \u2212 Tr(eN E)||F \u2264\n\u2264\n\n\u221a2r\n\u221amn\n\u221a2r\n\n\u221amn\n\nM \u2212\n\n\u03c3ixiyT\n\n\u221amn\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\ni (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\u221amn(cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)M \u2212\n||eZ E||2 +\n\u2264 2CMmaxp2\u03b1r/\u01eb + (2\u221a2r/\u01eb)||eZ E||2\n|E| (cid:19) ||eZ E||2 .\n\u2264 C \u2032Mmax (cid:18) nr\u03b13/2\n+ 2\u221a2(cid:18) n\u221ar\u03b1\n\n\u01eb (cid:16)eN E \u2212 Xi\u2265r+1\n\u01eb fM E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n|E| (cid:19)1/2\n\n\u221amn\n\nThis proves our claim.\n\n+\n\n\u01eb\n\n\u01eb\n\n\u221amn\n\n6\n\nwhere the second inequality follows from the following Lemma as shown in [1].\nLemma 3.2 (Keshavan, Montanari, Oh, 2009 [1]). There exists a numerical constant C such that,\nwith probability larger than 1 \u2212 1/n3,\n\n(cid:12)(cid:12)(cid:12)\n\n\u03c3q\n\n\u01eb \u2212 \u03a3q(cid:12)(cid:12)(cid:12) \u2264 (cid:12)(cid:12)(cid:12)\u03c3q(fM E)/\u01eb \u2212 \u03a3q(cid:12)(cid:12)(cid:12) + \u03c31(eZ E)/\u01eb\n\n1\n\n\u2264 CMmaxr \u03b1\n\n\u01eb\n\n+\n\n\u01eb||eZ E||2 ,\n(cid:12)(cid:12)(cid:12)(cid:12)2 \u2264 CMmaxr \u03b1\n\n\u01eb\n\n.\n\n1\n\n\u221amn(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)M \u2212\n\n\u221amn\n\n\u01eb fM E(cid:12)(cid:12)(cid:12)(cid:12)\n\n\f4 Proof of Theorem 1.2\n\nRecall that the cost function is de\ufb01ned over the Riemannian manifold M(m, n) \u2261 G(m, r)\u00d7G(n, r).\nThe proof of Theorem 1.2 consists in controlling the behavior of F in a neighborhood of u = (U, V )\n(the point corresponding to the matrix M to be reconstructed). Throughout the proof we let K(\u00b5)\nbe the set of matrix couples (X, Y ) \u2208 Rm\u00d7r \u00d7 Rn\u00d7r such that ||X (i)||2 \u2264 \u00b5r, ||Y (j)||2 \u2264 \u00b5r for\nall i, j\n\n4.1 Preliminary remarks and de\ufb01nitions\n\nGiven x1 = (X1, Y1) and x2 = (X2, Y2) \u2208 M(m, n), two points on this manifold, their distance\nis de\ufb01ned as d(x1, x2) = pd(X1, X2)2 + d(Y1, Y2)2, where, letting (cos \u03b81, . . . , cos \u03b8r) be the\nsingular values of X T\n\n1 X2/m,\n\nGiven S achieving the minimum in Eq. (4), it is also convenient to introduce the notations\n\nd(X1, X2) = ||\u03b8||2 .\n\nd\u2212(x, u) \u2261q\u03a32\nd+(x, u) \u2261q\u03a32\n\nmind(x, u)2 + ||S \u2212 \u03a3||2\nF ,\nmaxd(x, u)2 + ||S \u2212 \u03a3||2\nF .\n\n4.2 Auxiliary lemmas and proof of Theorem 1.2\n\n(19)\n\n(20)\n\n(21)\n\nThe proof is based on the following two lemmas that generalize and sharpen analogous bounds in\n[1] (for proofs we refer to the journal version of this paper).\nLemma 4.1. There exists numerical constants C0, C1, C2 such that the following happens. Assume\n\u01eb \u2265 C0\u00b50r\u221a\u03b1 max{ log n ; \u00b50r\u221a\u03b1(\u03a3min/\u03a3max)4 } and \u03b4 \u2264 \u03a3min/(C0\u03a3max). Then,\nmax d(x, u)2 + C2n\u221ar\u03b1||Z E||2d+(x, u) ,\n\n(22)\n(23)\nfor all x \u2208 M(m, n) \u2229 K(4\u00b50) such that d(x, u) \u2264 \u03b4, with probability at least 1 \u2212 1/n4. Here\nS \u2208 Rr\u00d7r is the matrix realizing the minimum in Eq. (4).\nCorollary 4.2. There exist a constant C such that, under the hypotheses of Lemma 4.1\n\nF (x) \u2212 F (u) \u2265 C1n\u01eb\u221a\u03b1 d\u2212(x, u)2 \u2212 C1n\u221ar\u03b1||Z E||2d+(x, u) ,\nF (x) \u2212 F (u) \u2264 C2n\u01eb\u221a\u03b1 \u03a32\n\n||S \u2212 \u03a3||F \u2264 C\u03a3maxd(x, u) + C\n\n\u221ar\n\u01eb ||Z E||2 .\n\nFurther, for an appropriate choice of the constants in Lemma 4.1, we have\n\n\u03c3max(S) \u2264 2\u03a3max + C\n\u03a3min \u2212 C\n\u03c3min(S) \u2265\n\n1\n2\n\n\u221ar\n\u01eb ||Z E||2 ,\n\u221ar\n\u01eb ||Z E||2 .\n\n(24)\n\n(25)\n\n(26)\n\nLemma 4.3. There exists numerical constants C0, C1, C2 such that the following happens. Assume\n\u01eb \u2265 C0\u00b50r\u221a\u03b1 (\u03a3max/\u03a3min)2 max{ log n ; \u00b50r\u221a\u03b1(\u03a3max/\u03a3min)4 } and \u03b4 \u2264 \u03a3min/(C0\u03a3max).\n\nThen,\n\nmin(cid:20)d(x, u) \u2212 C2\n\n\u221ar\u03a3max\n\u01eb\u03a3min\n\n\u03a3min (cid:21)2\n||Z E||2\n\n+\n\n,\n\n(27)\n\n||grad eF (x)||2 \u2265 C1 n\u01eb2 \u03a34\n\nfor all x \u2208 M(m, n) \u2229 K(4\u00b50) such that d(x, u) \u2264 \u03b4, with probability at least 1 \u2212 1/n4. (Here\n[a]+ \u2261 max(a, 0).)\nWe can now turn to the proof of our main theorem.\n\n7\n\n\fProof. (Theorem 1.2). Let \u03b4 = \u03a3min/C0\u03a3max with C0 large enough so that the hypotheses of\nLemmas 4.1 and 4.3 are veri\ufb01ed.\nCall {xk}k\u22650 the sequence of pairs (Xk, Yk) \u2208 M(m, n) generated by gradient descent. By as-\nsumption, the following is true with a large enough constant C:\n\n||Z E||2 \u2264\n\n\u01eb\n\n\u03a3max(cid:19)2\nC\u221ar (cid:18) \u03a3min\n\n\u03a3min .\n\nFurther, by using Corollary 4.2 in Eqs. (22) and (23) we get\n\nF (x) \u2212 F (u) \u2265 C1n\u01eb\u221a\u03b1\u03a32\nF (x) \u2212 F (u) \u2264 C2n\u01eb\u221a\u03b1\u03a32\n\nmin(cid:8)d(x, u)2 \u2212 \u03b42\nmax(cid:8)d(x, u)2 + \u03b42\n\u221ar\u03a3max\n\u01eb\u03a3min\n\n0,\u2212(cid:9) ,\n0,+(cid:9) ,\n||Z E||2\n\u03a3max\n\n\u03b40,+ \u2261 C\n\nwhere\n\n\u221ar\u03a3max\n\u01eb\u03a3min\n\n||Z E||2\n\u03a3min\n\n,\n\n\u03b40,\u2212 \u2261 C\n\nBy Eq. (28), we can assume \u03b40,+ \u2264 \u03b40,\u2212 \u2264 \u03b4/10.\nFor \u01eb \u2265 C\u03b1\u00b52\nwith the bound d(u, x0) \u2264 ||M \u2212 X0SY T\n\n0 ||F /n\u221a\u03b1\u03a3min, we get\nd(u, x0) \u2264\n\n\u03b4\n10\n\n.\n\n1r2(\u03a3max/\u03a3min)4 as per our assumptions, using Eq. (28) in Theorem 1.1, together\n\n(28)\n\n(29)\n(30)\n\n(31)\n\n.\n\n(32)\n\n(33)\n\n(34)\n\nWe make the following claims :\n\n1. xk \u2208 K(4\u00b50) for all k.\nIndeed without loss of generality we can assume x0 \u2208 K(3\u00b50) (because otherwise we can\nrescale those lines of X0, Y0 that violate the constraint). Therefore eF (x0) = F (x0) \u2264\n4C2n\u01eb\u221a\u03b1\u03a32\nmax\u03b42/100. On the other hand eF (x) \u2265 \u03c1(e1/9 \u2212 1) for x 6\u2208 K(4\u00b50).\nSince eF (xk) is a non-increasing sequence, the thesis follows provided we take \u03c1 \u2265\nC2n\u01eb\u221a\u03b1\u03a32\n2. d(xk, u) \u2264 \u03b4/10 for all k.\n\nmin.\n\n1r2(\u03a3max/\u03a3min)6, we have d(x0, u)2 \u2264 (\u03a32\n\nAssuming \u01eb \u2265 C\u03b1\u00b52\nmax)(\u03b4/10)2.\nAlso assuming Eq. (28) with large enough C we can show the following. For all xk such\nthat d(xk, u) \u2208 [\u03b4/10, \u03b4], we have eF (x) \u2265 F (x) \u2265 F (x0). This contradicts the mono-\ntonicity of eF (x), and thus proves the claim.\nSince the cost function is twice differentiable, and because of the above, the sequence {xk} con-\nverges to\n\nmin/C \u2032\u03a32\n\n\u2126 =(cid:8)x \u2208 K(4\u00b50) \u2229 M(m, n) : d(x, u) \u2264 \u03b4 , grad eF (x) = 0(cid:9) .\n\nBy Lemma 4.3 for any x \u2208 \u2126,\n\n\u221ar\u03a3max\n\u01eb\u03a3min\n\n||Z E||2\n\u03a3min\n\nd(x, u) \u2264 C\nwhich implies the thesis using Corollary 4.2.\n\nAcknowledgements\n\nThis work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978\nand the NSF grant DMS-0806211.\n\n8\n\n\fReferences\n\n[1] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries.\n\narXiv:0901.3150, January 2009.\n\n[2] A. Frieze, R. Kannan, and S. Vempala. Fast monte-carlo algorithms for \ufb01nding low-rank\n\napproximations. J. ACM, 51(6):1025\u20131041, 2004.\n\n[3] E. J. Cand`es and B. Recht.\n\narxiv:0805.4471, 2008.\n\nExact matrix completion via convex optimization.\n\n[4] M. Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002.\n[5] E. J. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion.\n\narXiv:0903.1476, 2009.\n\n[6] J-F Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix com-\n\npletion. arXiv:0810.3286, 2008.\n\n[7] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank\n\nminimization. arXiv:0905.1643, 2009.\n\n[8] K. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized\n\n[9] J. Wright, A. Ganesh, S. Rao, and Y. Ma. Robust principal component analysis: Exact recovery\n\nleast squares problems. http://www.math.nus.edu.sg/\u223cmatys, 2009.\nof corrupted low-rank matrices. arXiv:0905.0233, 2009.\n\n[10] K. Lee and Y. Bresler. Admira: Atomic decomposition for minimum rank approximation.\n\narXiv:0905.0044, 2009.\n\n[11] E. J. Cand`es and Y. Plan. Matrix completion with noise. arXiv:0903.3131, 2009.\n[12] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality\n\nconstraints. SIAM J. Matr. Anal. Appl., 20:303\u2013353, 1999.\n\n[13] P.-A. Absil, R. Mahony, and R. Sepulchrer. Optimization Algorithms on Matrix Manifolds.\n\nPrinceton University Press, 2008.\n\n[14] D. Achlioptas and F. McSherry. Fast computation of low-rank matrix approximations. J. ACM,\n\n54(2):9, 2007.\n\n9\n\n\f", "award": [], "sourceid": 358, "authors": [{"given_name": "Raghunandan", "family_name": "Keshavan", "institution": null}, {"given_name": "Andrea", "family_name": "Montanari", "institution": null}, {"given_name": "Sewoong", "family_name": "Oh", "institution": null}]}