{"title": "A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements", "book": "Advances in Neural Information Processing Systems", "page_first": 109, "page_last": 117, "abstract": "We propose a simple, scalable, and fast gradient descent algorithm to optimize a nonconvex objective for the rank minimization problem and a closely related family of semidefinite programs.  With $O(r^3 \\kappa^2 n \\log n)$ random measurements of a positive semidefinite $n\\times n$ matrix of rank $r$ and condition number $\\kappa$, our method is guaranteed to converge linearly to the global optimum.", "full_text": "A Convergent Gradient Descent Algorithm for\n\nRank Minimization and Semide\ufb01nite Programming\n\nfrom Random Linear Measurements\n\nQinqing Zheng\n\nUniversity of Chicago\n\nqinqing@cs.uchicago.edu\n\nJohn Lafferty\n\nUniversity of Chicago\n\nlafferty@galton.uchicago.edu\n\nAbstract\n\nWe propose a simple, scalable, and fast gradient descent algorithm to optimize\na nonconvex objective for the rank minimization problem and a closely related\nfamily of semide\ufb01nite programs. With O(r3\u03ba2n log n) random measurements of\na positive semide\ufb01nite n\u00d7n matrix of rank r and condition number \u03ba, our method\nis guaranteed to converge linearly to the global optimum.\n\n1\n\nIntroduction\n\nSemide\ufb01nite programming has become a key optimization tool in many areas of applied mathemat-\nics, signal processing and machine learning. SDPs often arise naturally from the problem structure,\nor are derived as surrogate optimizations that are relaxations of dif\ufb01cult combinatorial problems\n[7, 1, 8]. In spite of the importance of SDPs in principle\u2014promising ef\ufb01cient algorithms with poly-\nnomial runtime guarantees\u2014it is widely recognized that current optimization algorithms based on\ninterior point methods can handle only relatively small problems. Thus, a considerable gap exists\nbetween the theory and applicability of SDP formulations. Scalable algorithms for semide\ufb01nite pro-\ngramming, and closely related families of nonconvex programs more generally, are greatly needed.\nA parallel development is the surprising effectiveness of simple classical procedures such as gradient\ndescent for large scale problems, as explored in the recent machine learning literature. In many areas\nof machine learning and signal processing such as classi\ufb01cation, deep learning, and phase retrieval,\ngradient descent methods, in particular \ufb01rst order stochastic optimization, have led to remarkably\nef\ufb01cient algorithms that can attack very large scale problems [3, 2, 10, 6]. In this paper we build on\nthis work to develop \ufb01rst-order algorithms for solving the rank minimization problem under random\nmeasurements and a closely related family of semide\ufb01nite programs. Our algorithms are ef\ufb01cient\nand scalable, and we prove that they attain linear convergence to the global optimum under natural\nassumptions.\nThe af\ufb01ne rank minimization problem is to \ufb01nd a matrix X (cid:63) \u2208 Rn\u00d7p of minimum rank satisfying\nconstraints A(X (cid:63)) = b, where A : Rn\u00d7p \u2212\u2192 Rm is an af\ufb01ne transformation. The underdetermined\ncase where m (cid:28) np is of particular interest, and can be formulated as the optimization\n\nmin\n\nrank(X)\n\nX\u2208Rn\u00d7p\nsubject to A(X) = b.\n\n(1)\n\nThis problem is a direct generalization of compressed sensing, and subsumes many machine learn-\ning problems such as image compression, low rank matrix completion and low-dimensional metric\nembedding [18, 12]. While the problem is natural and has many applications, the optimization is\nnonconvex and challenging to solve. Without conditions on the transformation A or the minimum\nrank solution X (cid:63), it is generally NP hard [15].\n\n1\n\n\fExisting methods, such as nuclear norm relaxation [18], singular value projection (SVP) [11], and\nalternating least squares (AltMinSense) [12], assume that a certain restricted isometry property\n(RIP) holds for A. In the random measurement setting, this essentially means that at least O(r(n +\np) log(n + p)) measurements are available, where r = rank(X (cid:63)) [18]. In this work, we assume that\n(i) X (cid:63) is positive semide\ufb01nite and (ii) A : Rn\u00d7n \u2212\u2192 Rm is de\ufb01ned as A(X)i = tr(AiX), where\neach Ai is a random n \u00d7 n symmetric matrix from the Gaussian Orthogonal Ensemble (GOE), with\n(Ai)jj \u223c N (0, 2) and (Ai)jk \u223c N (0, 1) for j (cid:54)= k. Our goal is thus to solve the optimization\n\nmin\nX(cid:23)0\nsubject to\n\nrank(X)\n\ntr(AiX) = bi,\n\ni = 1, . . . , m.\n\n(2)\n\nIn addition to the wide applicability of af\ufb01ne rank minimization, the problem is also closely con-\nnected to a class of semide\ufb01nite programs. In Section 2, we show that the minimizer of a particular\nclass of SDP can be obtained by a linear transformation of X (cid:63). Thus, ef\ufb01cient algorithms for prob-\nlem (2) can be applied in this setting as well.\nNoting that a rank-r solution X (cid:63) to (2) can be decomposed as X (cid:63) = Z (cid:63)Z (cid:63)(cid:62) where Z (cid:63) \u2208 Rn\u00d7r,\nour approach is based on minimizing the squared residual\n\n(cid:13)(cid:13)A(ZZ(cid:62)) \u2212 b(cid:13)(cid:13)2\n\nf (Z) =\n\n1\n4m\n\nm(cid:88)\n\ni=1\n\n=\n\n1\n4m\n\n(cid:0)tr(Z(cid:62)AiZ) \u2212 bi\n\n(cid:1)2\n\n.\n\nWhile this is a nonconvex function, we take motivation from recent work for phase retrieval by\nCand`es et al. [6], and develop a gradient descent algorithm for optimizing f (Z), using a carefully\nconstructed initialization and step size. Our main contributions concerning this algorithm are as\nfollows.\n\n\u2022 We prove that with O(r3n log n) constraints our gradient descent scheme can exactly re-\ncover X (cid:63) with high probability. Empirical experiments show that this bound may poten-\ntially be improved to O(rn log n).\n\u2022 We show that our method converges linearly, and has lower computational cost compared\n\u2022 We carry out a detailed comparison of rank minimization algorithms, and demonstrate that\nwhen the measurement matrices Ai are sparse, our gradient method signi\ufb01cantly outper-\nforms alternative approaches.\n\nwith previous methods.\n\nIn Section 3 we brie\ufb02y review related work. In Section 4 we discuss the gradient scheme in detail.\nOur main analytical results are presented in Section 5, with detailed proofs contained in the supple-\nmentary material. Our experimental results are presented in Section 6, and we conclude with a brief\ndiscussion of future work in Section 7.\n\n2 Semide\ufb01nite Programming and Rank Minimization\n\nBefore reviewing related work and presenting our algorithm, we pause to explain the connection\nbetween semide\ufb01nite programming and rank minimization. This connection enables our scalable\ngradient descent algorithm to be applied and analyzed for certain classes of SDPs.\nConsider a standard form semide\ufb01nite program\n\nmin(cid:101)X(cid:23)0\n\ntr((cid:101)C(cid:101)X)\ntr((cid:101)Ai(cid:101)X) = bi,\n\nwhere (cid:101)C, (cid:101)A1, . . . , (cid:101)Am \u2208 Sn. If (cid:101)C is positive de\ufb01nite, then we can write (cid:101)C = LL(cid:62) where L \u2208 Rn\u00d7n\n\ni = 1, . . . , m\n\nsubject to\n\nis invertible. It follows that the minimum of problem (3) is the same as\n\n(3)\n\nmin\nX(cid:23)0\nsubject to\n\ntr(X)\n\ntr(AiX) = bi,\n\ni = 1, . . . , m\n\n(4)\n\n2\n\n\fwhere Ai = L\u22121(cid:101)AiL\u22121(cid:62)\n\n(4) via the transformation\n\n. In particular, minimizers (cid:101)X\u2217 of (3) are obtained from minimizers X\u2217 of\n\n(cid:101)X\u2217 = L\u22121(cid:62)\n\nX\u2217L\u22121.\n\nSince X is positive semide\ufb01nite, tr(X) is equal to (cid:107)X(cid:107)\u2217. Hence, problem (4) is the nuclear norm\nrelaxation of problem (2). Next, we characterize the speci\ufb01c cases where X\u2217 = X (cid:63), so that the SDP\nand rank minimization solutions coincide. The following result is from Recht et al. [18].\nTheorem 1. Let A : Rn\u00d7n \u2212\u2192 Rm be a linear map. For every integer k with 1 \u2264 k \u2264 n, de\ufb01ne\nthe k-restricted isometry constant to be the smallest value \u03b4k such that\n(1 \u2212 \u03b4k)(cid:107)X(cid:107)F \u2264 (cid:107)A(X)(cid:107) \u2264 (1 + \u03b4k)(cid:107)X(cid:107)F\n\nholds for any matrix X of rank at most k. Suppose that there exists a rank r matrix X (cid:63) such that\nA(X (cid:63)) = b.\nIf \u03b42r < 1, then X (cid:63) is the only matrix of rank at most r satisfying A(X) = b.\nFurthermore, if \u03b45r < 1/10, then X (cid:63) can be attained by minimizing (cid:107)X(cid:107)\u2217 over the af\ufb01ne subset.\nIn other words, since \u03b42r \u2264 \u03b45r, if \u03b45r < 1/10 holds for the transformation A and one \ufb01nds a matrix\nX of rank r satisfying the af\ufb01ne constraint, then X must be positive semide\ufb01nite. Hence, one can\nignore the semide\ufb01nite constraint X (cid:23) 0 when solving the rank minimization (2). The resulting\nproblem then can be exactly solved by nuclear norm relaxation. Since the minimum rank solution\nis positive semide\ufb01nite, it then coincides with the solution of the SDP (4), which is a constrained\nnuclear norm optimization.\nThe observation that one can ignore the semide\ufb01nite constraint justi\ufb01es our experimental comparison\nwith methods such as nuclear norm relaxation, SVP, and AltMinSense, described in the following\nsection.\n\n3 Related Work\n\nBurer and Monteiro [4] proposed a general approach for solving semide\ufb01nite programs using fac-\ntored, nonconvex optimization, giving mostly experimental support for the convergence of the al-\ngorithms. The \ufb01rst nontrivial guarantee for solving af\ufb01ne rank minimization problem is given by\nRecht et al. [18], based on replacing the rank function by the convex surrogate nuclear norm, as\nalready mentioned in the previous section. While this is a convex problem, solving it in practice is\nnontrivial, and a variety of methods have been developed for ef\ufb01cient nuclear norm minimization.\nThe most popular algorithms are proximal methods that perform singular value thresholding [5] at\nevery iteration. While effective for small problem instances, the computational expense of the SVD\nprevents the method from being useful for large scale problems.\nRecently, Jain et al. [11] proposed a projected gradient descent algorithm SVP (Singular Value\nProjection) that solves\n\nmin\n\nX\u2208Rn\u00d7p\nsubject to\n\n(cid:107)A(X) \u2212 b(cid:107)2\nrank(X) \u2264 r,\n\nwhere (cid:107)\u00b7(cid:107) is the (cid:96)2 vector norm and r is the input rank. In the (t+1)th iteration, SVP updates X t+1 as\nthe best rank r approximation to the gradient update X t \u2212 \u00b5A(cid:62)(A(X t) \u2212 b), which is constructed\nfrom the SVD. If rank(X (cid:63)) = r, then SVP can recover X (cid:63) under a similar RIP condition as the\nnuclear norm heuristic, and enjoys a linear numerical rate of convergence. Yet SVP suffers from the\nexpensive per-iteration SVD for large problem instances.\nSubsequent work of Jain et al. [12] proposes an alternating least squares algorithm AltMinSense\nthat avoids the per-iteration SVD. AltMinSense factorizes X into two factors U \u2208 Rn\u00d7r, V \u2208\n\nRp\u00d7r such that X = U V (cid:62) and minimizes the squared residual(cid:13)(cid:13)A(U V (cid:62)) \u2212 b(cid:13)(cid:13)2 by updating U and\n\nV alternately. Each update is a least squares problem. The authors show that the iterates obtained\nby AltMinSense converge to X (cid:63) linearly under a RIP condition. However, the least squares\nproblems are often ill-conditioned, it is dif\ufb01cult to observe AltMinSense converging to X (cid:63) in\npractice.\nAs described above, considerable progress has been made on algorithms for rank minimization and\ncertain semide\ufb01nite programming problems. Yet truly ef\ufb01cient, scalable and provably convergent\n\n3\n\n\falgorithms have not yet been obtained. In the speci\ufb01c setting that X (cid:63) is positive semide\ufb01nite, our\nalgorithm exploits this structure to achieve these goals. We note that recent and independent work of\nTu et al. [21] proposes a hybrid algorithm called Procrustes Flow (PF), which uses a few iterations\nof SVP as initialization, and then applies gradient descent.\n\n4 A Gradient Descent Algorithm for Rank Minimization\n\nOur method is described in Algorithm 1. It is parallel to the Wirtinger Flow (WF) algorithm for\nphase retrieval [6], to recover a complex vector x \u2208 Cn given the squared magnitudes of its linear\nmeasurements bi = |(cid:104)ai, x(cid:105)|2, i \u2208 [m], where a1, . . . , am \u2208 Cn. Cand`es et al. [6] propose a\n\ufb01rst-order method to minimize the sum of squared residuals\n\nfWF(z) =\n\n.\n\n(5)\n\n(cid:0)|(cid:104)ai, z(cid:105)|2 \u2212 bi\n\n(cid:1)2\n\nn(cid:88)\n\ni=1\n\nThe authors establish the convergence of WF to the global optimum\u2014given suf\ufb01cient measurements,\nthe iterates of WF converge linearly to x up to a global phase, with high probability.\nIf z and the ais are real-valued, the function fWF(z) can be expressed as\n\n(cid:0)z(cid:62)aia(cid:62)\n\nn(cid:88)\n\ni=1\n\ni x(cid:1)2\n\n,\n\ni z \u2212 x(cid:62)aia(cid:62)\n\nfWF(z) =\n\n(cid:80)m\n\ni=1 biAi. Then 1\n\n2\n\nE(M ) = X (cid:63), where the expectation is with respect to\n\nwhich is a special case of f (Z) where Ai = aia(cid:62)\ni and each of Z and X (cid:63) are rank one. See Figure\n1a for an illustration; Figure 1b shows the convergence rate of our method. Our methods and results\nare thus generalizations of Wirtinger \ufb02ow for phase retrieval.\nBefore turning to the presentation of our technical results in the following section, we present some\nintuition and remarks about how and why this algorithm works. For simplicity, let us assume that\nthe rank is speci\ufb01ed correctly.\nInitialization is of course crucial in nonconvex optimization, as many local minima may be present.\nTo obtain a suf\ufb01ciently accurate initialization, we use a spectral method, similar to those used in\n[17, 6]. The starting point is the observation that a linear combination of the constraint values and\nmatrices yields an unbiased estimate of the solution.\nLemma 1. Let M = 1\nm\nthe randomness in the measurement matrices Ai.\nBased on this fact, let X (cid:63) = U (cid:63)\u03a3U (cid:63)(cid:62) be the eigenvalue decomposition of X (cid:63), where U (cid:63) =\nr] and \u03a3 = diag(\u03c31, . . . , \u03c3r) such that \u03c31 \u2265 . . . \u2265 \u03c3r are the nonzero eigenvalues of\n[u(cid:63)\n1, . . . , u(cid:63)\ns(cid:107) is the top sth eigenvector of E(M ) associated with\nX (cid:63). Let Z (cid:63) = U (cid:63)\u03a3 1\neigenvalue 2(cid:107)z(cid:63)\n2 vs where (vs, \u03bbs) is the top\nsth eigenpair of M. For suf\ufb01ciently large m, it is reasonable to expect that Z 0 is close to Z (cid:63); this is\ncon\ufb01rmed by concentration of measure arguments.\nCertain key properties of f (Z) will be seen to yield a linear rate of convergence. In the analysis\nof convex functions, Nesterov [16] shows that for unconstrained optimization, the gradient descent\nscheme with suf\ufb01ciently small step size will converge linearly to the optimum if the objective func-\ntion is strongly convex and has a Lipschitz continuous gradient. However, these two properties are\nglobal and do not hold for our objective function f (Z). Nevertheless, we expect that similar condi-\ntions hold for the local area near Z (cid:63). If so, then if we start close enough to Z (cid:63), we can achieve the\nglobal optimum.\nIn our subsequent analysis, we establish the convergence of Algorithm 1 with a constant step size of\nthe form \u00b5/(cid:107)Z (cid:63)(cid:107)2\n\nF , where \u00b5 is a small constant. Since (cid:107)Z (cid:63)(cid:107)F is unknown, we replace it by(cid:13)(cid:13)Z 0(cid:13)(cid:13)F .\n\ns(cid:107)2. Therefore, we initialize according to z0\n\n2 . Clearly, u(cid:63)\n\ns = z(cid:63)\n\n(cid:113)|\u03bbs|\n\ns /(cid:107)z(cid:63)\n\ns =\n\n5 Convergence Analysis\n\nIn this section we present our main result analyzing the gradient descent algorithm, and give a\nsketch of the proof. To begin, note that the symmetric decomposition of X (cid:63) is not unique, since\n\n4\n\n\f(a)\n\n(b)\n\nFigure 1: (a) An instance of f (Z) where X (cid:63) \u2208 R2\u00d72 is rank-1 and Z \u2208 R2. The underlying truth\nis Z (cid:63) = [1, 1](cid:62). Both Z (cid:63) and \u2212Z (cid:63) are minimizers. (b) Linear convergence of the gradient scheme,\nfor n = 200, m = 1000 and r = 2. The distance metric is given in De\ufb01nition 1.\n\n(cid:80)m\ni=1 biAi s.t. |\u03bb1| \u2265 \u00b7\u00b7\u00b7 \u2265 |\u03bbr|\n\nAlgorithm 1: Gradient descent for rank minimization\ninput: {Ai, bi}m\ninitialization\n\ni=1, r, \u00b5\n\nSet (v1, \u03bb1), . . . , (vr, \u03bbr) to the top r eigenpairs of 1\nm\nZ 0 = [z0\nk \u2190 0\n\n\u00b7 vs, s \u2208 [r]\n\nr ] where z0\n\n1, . . . , z0\n\ns =\n\n2\n\n(cid:113)|\u03bbs|\n\n(cid:17)\n\nAiZ k\n\n(cid:16)\nm(cid:80)\ntr(Z k(cid:62)\n\u00b5(cid:80)r\ns=1 |\u03bbs|/2\n\ni=1\n\nAiZ k) \u2212 bi\n\u2207f (Z k)\n\nrepeat\n\nm\n\n\u2207f (Z k) = 1\nZ k+1 = Z k \u2212\nk \u2190 k + 1\n\nuntil convergence;\n\noutput: (cid:98)X = Z kZ k(cid:62)\n\nNote that (cid:107)(cid:101)Z(cid:107)2\n\nX (cid:63) = (Z (cid:63)U )(Z (cid:63)U )(cid:62) for any r \u00d7 r orthonormal matrix U. Thus, the solution set is\n\nS =\n\n(cid:110)(cid:101)Z \u2208 Rn\u00d7r | (cid:101)Z = Z (cid:63)U for some U with U U(cid:62) = U(cid:62)U = I\nF = (cid:107)X (cid:63)(cid:107)\u2217 for any (cid:101)Z \u2208 S. We de\ufb01ne the distance to the optimal solution in terms of\n(cid:13)(cid:13)Z \u2212 (cid:101)Z(cid:13)(cid:13)F .\n\nd(Z, Z (cid:63)) =\n\n(cid:107)Z \u2212 Z (cid:63)U(cid:107)F = min(cid:101)Z\u2208S\n\nU U(cid:62)=U(cid:62)U =I\n\n(cid:111)\n\nmin\n\n.\n\nthis set.\nDe\ufb01nition 1. De\ufb01ne the distance between Z and Z (cid:63) as\n\nOur main result for exact recovery is stated below, assuming that the rank is correctly speci\ufb01ed.\nSince the true rank is typically unknown in practice, one can start from a very low rank and gradually\nincrease it.\nTheorem 2. Let the condition number \u03ba = \u03c31/\u03c3r denote the ratio of the largest to the smallest\nnonzero eigenvalues of X (cid:63). There exists a universal constant c0 such that if m \u2265 c0\u03ba2r3n log n,\nwith high probability the initialization Z 0 satis\ufb01es\n\n(6)\nMoreover, there exists a universal constant c1 such that when using constant step size \u00b5/(cid:107)Z (cid:63)(cid:107)2\nwith \u00b5 \u2264 c1\n\u03ban\n\nand initial value Z 0 obeying (6), the kth step of Algorithm 1 satis\ufb01es\n\n16 \u03c3r.\n\nF\n\nd(Z 0, Z (cid:63)) \u2264(cid:113) 3\n(cid:16)\n\nd(Z k, Z (cid:63)) \u2264(cid:113) 3\n\n(cid:17)k/2\n\n16 \u03c3r\n\n1 \u2212 \u00b5\n12\u03bar\n\nwith high probability.\n\n5\n\n2Z20-220Z1-210310210110010-1f(Z)iteration0200400600800dist(Z,Z\u22c6)kZ\u22c6kF10-1510-1010-5100\fWe now outline the proof, giving full details in the supplementary material. The proof has four main\nsteps. The \ufb01rst step is to give a regularity condition under which the algorithm converges linearly if\nwe start close enough to Z (cid:63). This provides a local regularity property that is similar to the Nesterov\n[16] criteria that the objective function is strongly convex and has a Lipschitz continuous gradient.\n\n(cid:13)(cid:13)Z \u2212 (cid:101)Z(cid:13)(cid:13)F denote the matrix closest to Z in the solution set.\n\nDe\ufb01nition 2. Let Z = arg min(cid:101)Z\u2208S\nWe say that f satis\ufb01es the regularity condition RC(\u03b5, \u03b1, \u03b2) if there exist constants \u03b1, \u03b2 such that\nfor any Z satisfying d(Z, Z (cid:63)) \u2264 \u03b5, we have\n\n(cid:104)\u2207f (Z), Z \u2212 Z(cid:105) \u2265 1\n\u03b1\n\n\u03c3r\n\nF +\n\n1\n\n\u03b2 (cid:107)Z (cid:63)(cid:107)2\n\nF\n\n(cid:107)\u2207f (Z)(cid:107)2\nF .\n\nUsing this regularity condition, we show that the iterative step of the algorithm moves closer to the\noptimum, if the current iterate is suf\ufb01ciently close.\nTheorem 3. Consider the update Z k+1 = Z k \u2212\nd(Z k, Z (cid:63)) \u2264 \u03b5, and 0 < \u00b5 < min(\u03b1/2, 2/\u03b2), then\n\nIf f satis\ufb01es RC(\u03b5, \u03b1, \u03b2),\n\n\u2207f (Z k).\n\n(cid:107)Z (cid:63)(cid:107)2\n\n\u00b5\n\nF\n\n(cid:13)(cid:13)Z \u2212 Z(cid:13)(cid:13)2\n\n(cid:114)\n\nd(Z k+1, Z (cid:63)) \u2264\n\n1 \u2212 2\u00b5\n\u03b1\u03bar\n\nd(Z k, Z (cid:63)).\n\nIn the next step of the proof, we condition on two events that will be shown to hold with high\nprobability using concentration results. Let \u03b4 denote a small value to be speci\ufb01ed later.\n\nA1\n\nA2\n\nFor any u \u2208 Rn such that (cid:107)u(cid:107) \u2264 \u221a\n(cid:34)\nFor any (cid:101)Z \u2208 S,\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u22022f ((cid:101)Z)\n\u2202(cid:101)zs\u2202(cid:101)z(cid:62)\n\n\u2212 E\n\nk\n\n\u03c31,\n\n(cid:13)(cid:13)(cid:13)(cid:13) 1\nm(cid:80)\n(cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u03b4\n\u22022f ((cid:101)Z)\n\u2202(cid:101)zs\u2202(cid:101)z(cid:62)\n\ni=1\n\nm\n\nk\n\nr ,\n\n(u(cid:62)Aiu)Ai \u2212 2uu(cid:62)\n\nfor all s, k \u2208 [r].\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u03b4\n\nr .\n\n(cid:113) 3\n16 \u03c3r, then f satis\ufb01es the regularity condition\n16 \u03c3r, 24, 513\u03ban) with probability at least 1\u2212 mCe\u2212\u03c1n, where C, \u03c1 are universal constants.\n\nHere the expectations are with respect to the random measurement matrices. Under these assump-\ntions, we can show that the objective satis\ufb01es the regularity condition with high probability.\nTheorem 4. Suppose that A1 and A2 hold. If \u03b4 \u2264 1\nRC(\nNext we show that under A1, a good initialization can be found.\nTheorem 5. Suppose that A1 holds. Let {vs, \u03bbs}r\nsuch that |\u03bb1| \u2265 \u00b7\u00b7\u00b7 \u2265 |\u03bbr|. Let Z 0 = [z1, . . . , zr] where zs =\n\nm(cid:80)\ni=1\n2 \u00b7 vs, s \u2208 [r]. If \u03b4 \u2264 \u03c3r\n\u221a\n4\n\ns=1 be the top r eigenpairs of M = 1\n\n(cid:113)|\u03bbs|\n\nr , then\n\nbiAi\n\nm\n\nd(Z 0, Z (cid:63)) \u2264(cid:112)3\u03c3r/16.\n\nFinally, we show that conditioning on A1 and A2 is valid since these events have high probability\nas long as m is suf\ufb01ciently large.\nTheorem 6. If the number of samples m \u2265\nsatisfying (cid:107)u(cid:107) \u2264 \u221a\n\nn log n, then for any u \u2208 Rn\n\n42\nmin(\u03b42/r2\u03c32\n\n1, \u03b4/r\u03c31)\n\n\u03c31,\n\nholds with probability at least 1 \u2212 mCe\u2212\u03c1n \u2212 2\nTheorem 7. For any x \u2208 Rn, if m \u2265\n\nn2 , where C and \u03c1 are universal constants.\n128\nmin(\u03b42/4r2\u03c32\n\n1, \u03b4/2r\u03c31)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u03b4\nn log n, then for any (cid:101)Z \u2208 S\n\nr\n\nfor all s, k \u2208 [r],\n\n,\n\nm\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u22022f ((cid:101)Z)\n\u2202(cid:101)zs\u2202(cid:101)z(cid:62)\n\nk\n\n\u2212 E\n\nm(cid:88)\n\ni=1\n\n(cid:34)\n\n(u(cid:62)Aiu)Ai \u2212 2uu(cid:62)\n\n\u22022f ((cid:101)Z)\n\u2202(cid:101)zs\u2202(cid:101)z(cid:62)\n\nk\n\n(cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u03b4\n\nr\n\n6\n\nwith probability at least 1 \u2212 6me\u2212n \u2212 4\nn2 .\n\n\f(cid:16) 1\n\n(cid:17)\n\n4\n\nr\n\n\u221a\n1\n\n16 ,\n\n\u03c3r, we have \u03b4\nr\u03c31\n\nNote that since we need \u03b4 \u2264 min\n\u2264 1, and the number of measure-\nments required by our algorithm scales as O(r3\u03ba2n log n), while only O(r2\u03ba2n log n) samples are\nrequired by the regularity condition. We conjecture this bound could be further improved to be\nO(rn log n); this is supported by the experimental results presented below.\nRecently, Tu et al. [21] establish a tighter O(r2\u03ba2n) bound overall. Speci\ufb01cally, when only one SVP\nstep is used in preprocessing, the initialization of PF is also the spectral decomposition of 1\n2 M. The\n\u221a\nauthors show that O(r2\u03ba2n) measurements are suf\ufb01cient for Z 0 to satisfy d(Z 0, Z (cid:63)) \u2264 O(\n\u03c3r)\nwith high probability, and demonstrate an O(rn) sample complexity for the regularity condition.\n6 Experiments\n\nIn this section we report the results of experiments on synthetic datasets. We compare our gradient\ndescent algorithm with nuclear norm relaxation, SVP and AltMinSense for which we drop the\npositive semide\ufb01niteness constraint, as justi\ufb01ed by the observation in Section 2. We use ADMM\nfor the nuclear norm minimization, based on the algorithm for the mixture approach in Tomioka\net al. [19]; see Appendix G. For simplicity, we assume that AltMinSense, SVP and the gradient\nscheme know the true rank. Krylov subspace techniques such as the Lanczos method could be used\ncompute the partial eigendecomposition; we use the randomized algorithm of Halko et al. [9] to\ncompute the low rank SVD. All methods are implemented in MATLAB and the experiments were\nrun on a MacBook Pro with a 2.5GHz Intel Core i7 processor and 16 GB memory.\n\n6.1 Computational Complexity\n\nIt is instructive to compare the per-iteration cost of the different approaches; see Table 1. Suppose\nthat the density (fraction of nonzero entries) of each Ai is \u03c1. For AltMinSense, the cost of solving\nthe least squares problem is O(mn2r2 + n3r3 + mn2r\u03c1). The other three methods have O(mn2\u03c1)\ncost to compute the af\ufb01ne transformation. For the nuclear norm approach, the O(n3) cost is from\nthe SVD and the O(m2) cost is due to the update of the dual variables. The gradient scheme requires\n2n2r operations to compute Z kZ k(cid:62)\nand to multiply Z k by n\u00d7 n matrix to obtain the gradient. SVP\nneeds O(n2r) operations to compute the top r singular vectors. However, in practice this partial\nSVD is more expensive than the 2n2r cost required for the matrix multiplies in the gradient scheme.\n\nMethod\n\nComplexity\n\nnuclear norm minimization via ADMM O(mn2\u03c1 + m2 + n3)\n\ngradient descent\n\nSVP\n\nAltMinSense\n\nO(mn2\u03c1) + 2n2r\nO(mn2\u03c1 + n2r)\nO(mn2r2 + n3r3 + mn2r\u03c1)\n\nTable 1: Per-iteration computational complexities of different methods.\n\nClearly, AltMinSense is the least ef\ufb01cient. For the other approaches, in the dense case (\u03c1 large),\nthe af\ufb01ne transformation dominates the computation. Our method removes the overhead caused by\nthe SVD. In the sparse case (\u03c1 small), the other parts dominate and our method enjoys a low cost.\n\n6.2 Runtime Comparison\n\nWe report the relative error measured in the Frobenius norm de\ufb01ned as (cid:107)(cid:98)X \u2212 X (cid:63)(cid:107)F /(cid:107)X (cid:63)(cid:107)F . For\n\nWe conduct experiments for both dense and sparse measurement matrices. AltMinSense is in-\ndeed slow, so we do not include it here.\nIn the \ufb01rst scenario, we randomly generate a 400\u00d7400 rank-2 matrix X (cid:63) = xx(cid:62)+yy(cid:62) where x, y \u223c\nN (0, I). We also generate m = 6n matrices A1, . . . , Am from the GOE, and then take b = A(X (cid:63)).\nthe nuclear norm approach, we set the regularization parameter to \u03bb = 10\u22125. We test three values\n\u03b7 = 10, 100, 200 for the penalty parameter and select \u03b7 = 100 as it leads to the fastest convergence.\nSimilarly, for SVP we evaluate the three values 5\u00d7 10\u22125, 10\u22124, 2\u00d7 10\u22124 for the step size, and select\n10\u22124 as the largest for which SVP converges. For our approach, we test the three values 0.6, 0.8, 1.0\nfor \u00b5 and select 0.8 in the same way.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Runtime comparison where X (cid:63) \u2208 R400\u00d7400 is rank-2 and Ais are dense. (b) Runtime\ncomparison where X (cid:63) \u2208 R600\u00d7600 is rank-2 and Ais are sparse. (c) Sample complexity comparison.\n\nIn the second scenario, we use a more general and practical setting. We randomly generate a rank-2\nmatrix X (cid:63) \u2208 R600\u00d7600 as before. We generate m = 7n sparse Ais whose entries are i.i.d. Bernoulli:\n\n(cid:26)1 with probability \u03c1,\n\n0 with probability 1 \u2212 \u03c1,\n\n(Ai)jk =\n\nwhere we use \u03c1 = 0.001. For all the methods we use the same strategies as before to select parame-\nters. For the nuclear norm approach, we try three values \u03b7 = 10, 100, 200 and select \u03b7 = 100. For\nSVP, we test the three values 5 \u00d7 10\u22123, 2 \u00d7 10\u22123, 10\u22123 for the step size and select 10\u22123. For the\ngradient algorithm, we check the three values 0.8, 1, 1.5 for \u00b5 and choose 1.\nThe results are shown in Figures 2a and 2b. In the dense case, our method is faster than the nuclear\nnorm approach and slightly outperforms SVP. In the sparse case, it is signi\ufb01cantly faster than the\nother approaches.\n\n6.3 Sample Complexity\n\nWe also evaluate the number of measurements required by each method to exactly recover X (cid:63),\nwhich we refer to as the sample complexity. We randomly generate the true matrix X (cid:63) \u2208 Rn\u00d7n and\ncompute the solutions of each method given m measurements, where the Ais are randomly drawn\nfrom the GOE. A solution with relative error below 10\u22125 is considered to be successful. We run 40\ntrials and compute the empirical probability of successful recovery.\nWe consider cases where n = 60 or 100 and X (cid:63) is of rank one or two. The results are shown in\nFigure 2c. For SVP and our approach, the phase transitions happen around m = 1.5n when X (cid:63) is\nrank-1 and m = 2.5n when X (cid:63) is rank-2. This scaling is close to the number of degrees of freedom\nin each case; this con\ufb01rms that the sample complexity scales linearly with the rank r. The phase\ntransition for the nuclear norm approach occurs later. The results suggest that the sample complexity\nof our method should also scale as O(rn log n) as for SVP and the nuclear norm approach [11, 18].\n\n7 Conclusion\nWe connect a special case of af\ufb01ne rank minimization to a class of semide\ufb01nite programs with\nrandom constraints. Building on a recently proposed \ufb01rst-order algorithm for phase retrieval [6],\nwe develop a gradient descent procedure for rank minimization and establish convergence to the\noptimal solution with O(r3n log n) measurements. We conjecture that O(rn log n) measurements\nare suf\ufb01cient for the method to converge, and that the conditions on the sampling matrices Ai can be\nsigni\ufb01cantly weakened. More broadly, the technique used in this paper\u2014factoring the semide\ufb01nite\nmatrix variable, recasting the convex optimization as a nonconvex optimization, and applying \ufb01rst-\norder algorithms\u2014\ufb01rst proposed by Burer and Monteiro [4], may be effective for a much wider class\nof SDPs, and deserves further study.\n\nAcknowledgements\nResearch supported in part by NSF grant IIS-1116730 and ONR grant N00014-12-1-0762.\n\n8\n\ntime (seconds)101102103kbX\u2212X\u22c6kFkX\u22c6kF10-1410-1210-1010-810-610-410-2100nuclear normSVPgradient descenttime (seconds)100101102kbX\u2212X\u22c6kFkX\u22c6kF10-1210-1010-810-610-410-2100m/n12345probability of successful recovery00.10.20.30.40.50.60.70.80.91rank=1 n=60gradientSVPnuclearrank=2 n=60gradientSVPnuclearrank=1 n=100gradientSVPnuclearrank=2 n=100gradientSVPnuclear\fReferences\n[1] Arash A. Amini and Martin J. Wainwright. High-dimensional analysis of semide\ufb01nite relax-\n\nations for sparse principal components. The Annals of Statistics, 37(5):2877\u20132921, 2009.\n\n[2] Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for\n\nlogistic regression. The Journal of Machine Learning Research, 15(1):595\u2013627, 2014.\n\n[3] Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algo-\nrithms for machine learning. In Advances in Neural Information Processing Systems (NIPS),\n2011.\n\n[4] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving\nsemide\ufb01nite programs via low-rank factorization. Mathematical Programming, 95(2):329\u2013\n357, 2003.\n\n[5] Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholding algorithm\n\nfor matrix completion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[6] Emmanuel Cand`es, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger \ufb02ow:\n\nTheory and algorithms. arXiv preprint arXiv:1407.1065, 2014.\n\n[7] A. d\u2019Aspremont, L. El Ghaoui, M. I. Jordan, and G. Lanckriet. A direct formulation for\nsparse PCA using semide\ufb01nite programming. In S. Thrun, L. Saul, and B. Schoelkopf (Eds.),\nAdvances in Neural Information Processing Systems (NIPS), 2004.\n\n[8] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maxi-\nmum cut and satis\ufb01ability problems using semide\ufb01nite programming. Journal of the ACM, 42\n(6):1115\u20131145, November 1995. ISSN 0004-5411.\n\n[9] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM review,\n53(2):217\u2013288, 2011.\n\n[10] Matt Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference.\n\nThe Journal of Machine Learning Research, 14, 2013.\n\n[11] Prateek Jain, Raghu Meka, and Inderjit S Dhillon. Guaranteed rank minimization via singular\nIn Advances in Neural Information Processing Systems, pages 937\u2013945,\n\nvalue projection.\n2010.\n\n[12] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using\nalternating minimization. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory\nof computing, pages 665\u2013674. ACM, 2013.\n\n[13] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model\n\nselection. Annals of Statistics, pages 1302\u20131338, 2000.\n\n[14] Michel Ledoux and Brian Rider. Small deviations for beta ensembles. Electron. J. Probab.,\nISSN 1083-6489. doi: 10.1214/EJP.v15-798. URL http:\n\n15:no. 41, 1319\u20131343, 2010.\n//ejp.ejpecp.org/article/view/798.\n\n[15] Raghu Meka, Prateek Jain, Constantine Caramanis, and Inderjit S Dhillon. Rank minimization\nvia online learning. In Proceedings of the 25th International Conference on Machine learning,\npages 656\u2013663. ACM, 2008.\n\n[16] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science &\n\nBusiness Media, 2004.\n\n[17] Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating mini-\n\nmization. In Advances in Neural Information Processing Systems, pages 2796\u20132804, 2013.\n\n[18] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of\n\nlinear matrix equations via nuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[19] Ryota Tomioka, Kohei Hayashi, and Hisashi Kashima. Estimation of low-rank tensors via\n\nconvex optimization. arXiv preprint arXiv:1010.0789, 2010.\n\n[20] Joel A Tropp. An introduction to matrix concentration inequalities.\n\narXiv:1501.01571, 2015.\n\narXiv preprint\n\n[21] Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of\n\nlinear matrix equations via procrustes \ufb02ow. arXiv preprint arXiv:1507.03566, 2015.\n\n9\n\n\f", "award": [], "sourceid": 72, "authors": [{"given_name": "Qinqing", "family_name": "Zheng", "institution": "University of Chicago"}, {"given_name": "John", "family_name": "Lafferty", "institution": "University of Chicago"}]}