{"title": "On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 710, "page_last": 718, "abstract": "Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately.  Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence.  In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear.  Our result is established without any strong convexity assumption on the loss function.  A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm-regularized problem, which may be of independent interest.", "full_text": "On the Linear Convergence of the Proximal Gradient\n\nMethod for Trace Norm Regularization\n\nKe Hou, Zirui Zhou, Anthony Man\u2013Cho So\n\nDepartment of Systems Engineering & Engineering Management\n\nThe Chinese University of Hong Kong\n\nShatin, N. T., Hong Kong\n\n{khou,zrzhou,manchoso}@se.cuhk.edu.hk\n\nDepartment of Electrical & Computer Engineering\n\nZhi\u2013Quan Luo\n\nUniversity of Minnesota\n\nMinneapolis, MN 55455, USA\n\nluozq@ece.umn.edu\n\nAbstract\n\nMotivated by various applications in machine learning, the problem of minimiz-\ning a convex smooth loss function with trace norm regularization has received\nmuch attention lately. Currently, a popular method for solving such problem is\nthe proximal gradient method (PGM), which is known to have a sublinear rate of\nconvergence. In this paper, we show that for a large class of loss functions, the\nconvergence rate of the PGM is in fact linear. Our result is established without any\nstrong convexity assumption on the loss function. A key ingredient in our proof\nis a new Lipschitzian error bound for the aforementioned trace norm\u2013regularized\nproblem, which may be of independent interest.\n\n1 Introduction\n\nThe problem of \ufb01nding a low\u2013rank matrix that (approximately) satis\ufb01es a given set of conditions\nhas recently generated a lot of interest in many communities. Indeed, such a problem arises in a\nwide variety of applications, including approximation algorithms [17], automatic control [5], matrix\nclassi\ufb01cation [20], matrix completion [6], multi\u2013label classi\ufb01cation [1], multi\u2013task learning [2],\nnetwork localization [7], subspace learning [24], and trace regression [9], just to name a few. Due to\nthe combinatorial nature of the rank function, the task of recovering a matrix with the desired rank\nand properties is generally intractable. To circumvent this, a popular approach is to use the trace\nnorm1 (also known as the nuclear norm) as a surrogate for the rank function. Such an approach is\nquite natural, as the trace norm is the tightest convex lower bound of the rank function over the set\nof matrices with spectral norm at most one [13]. In the context of machine learning, the trace norm\nis typically used as a regularizer in the minimization of certain convex loss function. This gives rise\nto convex optimization problems of the form\n\nmin\n\nX\u2208Rm\u00d7n\n\n{F (X) = f (X) + \u03c4 kXk\u2217} ,\n\n(1)\n\nwhere f : Rm\u00d7n \u2192 R is the convex loss function, kXk\u2217 denotes the trace norm of X, and \u03c4 > 0\nis a regularization parameter. By standard results in convex optimization [4], the above formulation\nis tractable (i.e., polynomial\u2013time solvable) for many choices of the loss function f . In practice,\n\n1Recall that the trace norm of a matrix is de\ufb01ned as the sum of its singular values.\n\n1\n\n\fhowever, one is often interested in settings where the decision variable X is of high dimension.\nThus, there has been much research effort in developing fast algorithms for solving (1) lately.\n\nCurrently, a popular method for solving (1) is the proximal gradient method (PGM), which exploits\nthe composite nature of the objective function F and certain smoothness properties of the loss func-\ntion f [8, 19, 11]. The attractiveness of PGM lies not only in its excellent numerical performance,\nbut also in its strong theoretical convergence rate guarantees. Indeed, for the trace norm\u2013regularized\nproblem (1) with f being convex and continuously differentiable and \u2207f being Lipschitz continu-\nous, the standard PGM will achieve an additive error of O(1/k) in the optimal value after k itera-\ntions. Moreover, this error can be reduced to O(1/k2) using acceleration techniques; see, e.g., [19].\nThe sublinear O(1/k2) convergence rate is known to be optimal if f is simply given by a \ufb01rst\u2013order\noracle [12]. On the other hand, if f is strongly convex, then the convergence rate can be improved to\nO(ck) for some c \u2208 (0, 1) (i.e., a linear convergence rate) [16]. However, in machine learning, the\nloss functions of interest are often highly structured and hence not just given by an oracle, but they\nare not necessarily strongly convex either. For instance, in matrix completion, a commonly used loss\n2/2, where A : Rm\u00d7n \u2192 Rp is a linear measurement\nfunction is the square loss f (\u00b7) = kA(\u00b7) \u2212 bk2\noperator and b \u2208 Rp is a given set of observations. Clearly, f is not strongly convex when A has a\nnon\u2013trivial nullspace (or equivalently, when A is not injective). In view of this, it is natural to ask\nwhether linear convergence of the PGM can be established for a larger class of loss functions.\n\nIn this paper, we take a \ufb01rst step towards answering this question. Speci\ufb01cally, we show that when\nthe loss function f takes the form f (X) = h(A(X)), where A : Rm\u00d7n \u2192 Rp is an arbitrary\nlinear operator and h : Rp \u2192 R is strictly convex with certain smoothness and curvature properties,\nthe PGM for solving (1) has an asymptotic linear rate of convergence. Note that f need not be\nstrictly convex even if h is, as A is arbitrary. Our result covers a wide range of loss functions used\nin the literature, such as square loss and logistic loss. Moreover, to the best of our knowledge, it\nis the \ufb01rst linear convergence result concerning the application of a \ufb01rst\u2013order method to the trace\nnorm\u2013regularized problem (1) that does not require the strong convexity of f .\n\nThe key to our convergence analysis is a new Lipschitzian error bound for problem (1). Roughly,\nit says that the distance between a point X \u2208 Rm\u00d7n and the optimal solution set of (1) is on the\norder of the residual norm kprox\u03c4 (X \u2212 \u2207f (X)) \u2212 XkF , where prox\u03c4 is the proximity operator\nassociated with the regularization term \u03c4 kXk\u2217. Once we have such a bound, a routine applica-\ntion of the powerful analysis framework developed by Luo and Tseng [10] will yield the desired\nlinear convergence result. Prior to this work, Lipschitzian error bounds for composite function min-\nimization are available for cases where the non\u2013smooth part either has a polyhedral epigraph (such\nas the \u21131\u2013norm) [23] or is the (sparse) group LASSO regularization [22, 25]. However, the ques-\ntion of whether a similar bound holds for trace norm regularization has remained open, despite its\napparent similarity to \u21131\u2013norm regularization. Indeed, unlike the \u21131\u2013norm, the trace norm has a non\u2013\npolyhedral epigraph; see, e.g., [18]. Moreover, the existing approach for establishing error bounds\nfor \u21131\u2013norm or (sparse) group LASSO regularization is based on splitting the decision variables into\ngroups, where variables from different groups do not interfere with one another, so that each group\ncan be analyzed separately. However, the trace norm of a matrix is determined by its singular values,\nand each of them depends on every single entry of the matrix. Thus, we cannot use the same split-\nting approach to analyze the entries of the matrix. To overcome the above dif\ufb01culties, we make the\ncrucial observation that if \u00afX is an optimal solution to (1), then both \u00afX and \u2212\u2207f ( \u00afX) have the same\nset of left and right singular vectors; see Proposition 4.2. As a result, we can use matrix perturbation\ntheory to get hold of the spectral structure of the points that are close to the optimal solution set. This\nin turn allows us to establish a Lipschitzian error bound for the trace norm\u2013regularized problem (1),\nthereby resolving the aforementioned open question in the af\ufb01rmative.\n\n2 Preliminaries\n\n2.1 Basic Setup\n\nWe consider the trace norm\u2013regularized optimization problem (1), in which the loss function f :\nRm\u00d7n \u2192 R takes the form\n\n(2)\nwhere A : Rm\u00d7n \u2192 Rp is a linear operator and h : Rp \u2192 R is a function satisfying the following\nassumptions:\n\nf (X) = h(A(X)),\n\n2\n\n\fAssumption 2.1\n\n(a) The effective domain of h, denoted by dom(h), is open and non\u2013empty.\n\n(b) The function h is continuously differentiable with Lipschitz\u2013continuous gradient on dom(h)\n\nand is strongly convex on any convex compact subset of dom(h).\n\nNote that Assumption 2.1(b) implies the strict convexity of h on dom(h) and the Lipschitz continuity\nof \u2207f . Now, let X denote the set of optimal solutions to problem (1). We make the following\nassumption concerning X :\n\nAssumption 2.2 The optimal solution set X is non\u2013empty.\n\nThe above assumptions can be justi\ufb01ed in various applications. For instance, in matrix completion,\nthe square loss f (\u00b7) = kA(\u00b7) \u2212 bk2\n2/2 induced by the linear measurement operator A and the set\nof observations b \u2208 Rp is of the form (2), with h(\u00b7) = k(\u00b7) \u2212 bk2\n2/2. Moreover, it is clear that\nsuch an h satis\ufb01es Assumptions 2.1 and 2.2. In multi\u2013task learning, the loss function takes the form\nt=1 \u2113(At(\u00b7), yt), where T is the number of learning tasks, At : Rm\u00d7n \u2192 Rp is the linear\noperator de\ufb01ned by the input data for the t\u2013th task, yt \u2208 Rp is the output data for the t\u2013th task, and\n\u2113 : Rp \u00d7 Rp \u2192 R measures the learning error. Note that f can be put into the form (2), where\nA : Rm\u00d7n \u2192 RT p is given by A(X) = (A1(X), A2(X), . . . , AT (X)), and h : RT p \u2192 R is\nt=1 \u2113(zt, yt) with zt \u2208 Rp for t = 1, . . . , T and z = (z1, . . . , zT ). Moreover,\n2/2) or the logistic loss (i.e.,\n\nin the case where \u2113 is, say, the square loss (i.e., \u2113(zt, yt) = kzt \u2212 ytk2\n\ni=1 log(1 + exp(\u2212ztiyti))), it can be veri\ufb01ed that Assumptions 2.1 and 2.2 hold.\n\nf (\u00b7) = PT\n\ngiven by h(z) = PT\n\u2113(zt, yt) = Pp\n\n2.2 Some Facts about the Optimal Solution Set\n\nSince f (\u00b7) = h(A(\u00b7)) by (2) and h(\u00b7) is strictly convex on dom(h) by Assumption 2.1(b), it is easy\nto verify that the map X 7\u2192 A(X) is invariant over the optimal solution set X . In other words, there\nexists a \u00afz \u2208 dom(h) such that for any X \u2217 \u2208 X , we have A(X \u2217) = \u00afz. Thus, we can express X as\n\nX = (cid:8)X \u2208 Rm\u00d7n : \u03c4 kXk\u2217 = v\u2217 \u2212 h(\u00afz), A(X) = \u00afz(cid:9) ,\n\nwhere v\u2217 > \u2212\u221e is the optimal value of (1). In particular, X is a non\u2013empty convex compact set.\nThis implies that every X \u2208 Rm\u00d7n has a unique projection \u00afX \u2208 X onto X , which is given by the\nsolution to the following optimization problem:\n\ndist(X, X ) = min\nY \u2208X\n\nkX \u2212 Y kF .\n\nIn addition, since X is bounded and F is convex, it follows from [14, Corollary 8.7.1] that the level\nset {X \u2208 Rm\u00d7n : F (X) \u2264 \u03b6} is bounded for any \u03b6 \u2208 R.\n\n2.3 Proximal Gradient Method and the Residual Map\n\nTo motivate the PGM for solving (1), we recall an alternative characterization of the optimal solution\nset X . Consider the proximity operator prox\u03c4 : Rm\u00d7n \u2192 Rm\u00d7n, which is de\ufb01ned as\n\nprox\u03c4 (X) = arg min\n\nZ\u2208Rm\u00d7n(cid:26)\u03c4 kZk\u2217 +\n\n1\n2\n\nkX \u2212 Zk2\n\nF(cid:27) .\n\n(3)\n\nBy comparing the optimality conditions for (1) and (3), it is immediate that a solution X \u2217 \u2208 Rm\u00d7n\nis optimal for (1) if and only if it satis\ufb01es the following \ufb01xed\u2013point equation:\n\nThis naturally lead to the following PGM for solving (1):\n\nX \u2217 = prox\u03c4 (X \u2217 \u2212 \u2207f (X \u2217)).\n\n(cid:26) Y k+1 = X k \u2212 \u03b1k\u2207f (X k),\nX k+1 = prox\u03c4 \u03b1k (Y k+1),\n\n(4)\n\n(5)\n\nwhere \u03b1k > 0 is the step size in the k\u2013th iteration, for k = 0, 1, . . .; see, e.g., [8, 19, 11]. As is\nwell\u2013known, the proximity operator de\ufb01ned above can be expressed in terms of the so\u2013called matrix\n\n3\n\n\fshrinkage operator. To describe this result, we introduce some de\ufb01nitions. Let \u00b5 > 0 be given. The\nnon\u2013negative vector shrinkage operator s\u00b5 : Rp\n+ is de\ufb01ned as (s\u00b5(z))i = max{0, zi \u2212 \u00b5},\nwhere i = 1, . . . , p. The matrix shrinkage operator S\u00b5 : Rm\u00d7n \u2192 Rm\u00d7n is de\ufb01ned as S\u00b5(X) =\nU \u03a3\u00b5V T , where X = U \u03a3V T is the singular value decomposition of X with \u03a3 = Diag(\u03c3(X)) and\n\u03c3(X) being the vector of singular values of X, and \u03a3\u00b5 = Diag(s\u00b5(\u03c3(X))). Then, it can be shown\nthat\n\n+ \u2192 Rp\n\nprox\u03c4 (X) = S\u03c4 (X);\n\n(6)\n\nsee, e.g., [11, Theorem 3].\n\nOur goal in this paper is to study the convergence rate of the PGM (5). Towards that end, we need a\nmeasure to quantify its progress towards optimality. One natural candidate would be dist(\u00b7, X ), the\ndistance to the optimal solution set X . Despite its intuitive appeal, such a measure is hard to compute\nor analyze. In view of (4) and (6), a reasonable alternative would be the norm of the residual map\nR : Rm\u00d7n \u2192 Rm\u00d7n, which is de\ufb01ned as\n\nR(X) = S\u03c4 (X \u2212 \u2207f (X)) \u2212 X.\n\n(7)\nIntuitively, the residual map measures how much a solution X \u2208 Rm\u00d7n violates the optimality\ncondition (4). In particular, X is an optimal solution to (1) if and only if R(X) = 0. However, since\nkR(\u00b7)kF is only a surrogate of dist(\u00b7, X ), we need to establish a relationship between them. This\nmotivates the development of a so\u2013called error bound for problem (1).\n\n3 Main Results\n\nKey to our convergence analysis of the PGM (5) is the following error bound for problem (1), which\nconstitutes the main contribution of this paper:\n\nTheorem 3.1 (Error Bound for Trace Norm Regularization) Suppose that in problem (1), f is of\nthe form (2), and Assumptions 2.1 and 2.2 are satis\ufb01ed. Then, for any \u03b6 \u2265 v\u2217, there exist constants\n\u03b7 > 0 and \u01eb > 0 such that\n\ndist(X, X ) \u2264 \u03b7kR(X)kF whenever F (X) \u2264 \u03b6, kR(X)kF \u2264 \u01eb.\n\n(8)\n\nArmed with Theorem 3.1 and some standard properties of the PGM (5), we can apply the con-\nvergence analysis framework developed by Luo and Tseng [10] to establish the linear conver-\ngence of (5). Recall that a sequence of vectors {wk}k\u22650 is said to converge Q\u2013linearly (resp. R\u2013\nlinearly) to a vector w\u221e if there exist an index K \u2265 0 and a constant \u03c1 \u2208 (0, 1) such that\nkwk+1 \u2212 w\u221ek2/kwk \u2212 w\u221ek2 \u2264 \u03c1 for all k \u2265 K (resp. if there exist constants \u03b3 > 0 and \u03c1 \u2208 (0, 1)\nsuch that kwk \u2212 w\u221ek2 \u2264 \u03b3 \u00b7 \u03c1k for all k \u2265 0).\n\nTheorem 3.2 (Linear Convergence of the Proximal Gradient Method) Suppose that in problem\n(1), f is of the form (2), and Assumptions 2.1 and 2.2 are satis\ufb01ed. Moreover, suppose that the step\nsize \u03b1k in the PGM (5) satis\ufb01es 0 < \u03b1 < \u03b1k < \u00af\u03b1 < 1/Lf for k = 0, 1, 2, . . ., where Lf is the\nLipschitz constant of \u2207f , and \u03b1, \u00af\u03b1 are given constants. Then, the sequence of solutions {X k}k\u22650\ngenerated by the PGM (5) converges R\u2013linearly to an element in the optimal solution set X , and the\nassociated sequence of objective values {F (X k)}k\u22650 converges Q\u2013linearly to the optimal value v\u2217.\n\nProof. Under the given setting, it can be shown that there exist scalars \u03ba1, \u03ba2, \u03ba3 > 0, which depend\non \u03b1, \u00af\u03b1, and Lf , such that\n\nF (X k) \u2212 F (X k+1) \u2265 \u03ba1kX k \u2212 X k+1k2\nF ,\n\nF (X k+1) \u2212 v\u2217 \u2264 \u03ba2(cid:2)(dist(X k, X ))2 + kX k+1 \u2212 X kk2\nF(cid:3) ,\n\n(9)\n\n(10)\n\nkR(X k)kF \u2264 \u03ba3kX k \u2212 X k+1kF ;\n\n(11)\nsee the supplementary material. Since {F (X k)}k\u22650 is a monotonically decreasing sequence by (9)\nand F (X k) \u2265 v\u2217 for all k \u2265 0, we conclude, again by (9), that X k \u2212 X k+1 \u2192 0. This, together\nwith (11), implies that R(X k) \u2192 0. Thus, by (9), (10) and Theorem 3.1, there exist an index K \u2265 0\nand a constant \u03ba4 > 0 such that for all k \u2265 K,\n\nF (X k+1) \u2212 v\u2217 \u2264 \u03ba4kX k \u2212 X k+1k2\n\nF \u2264\n\n\u03ba4\n\u03ba1\n\n(F (X k) \u2212 F (X k+1)).\n\n4\n\n\fIt follows that\n\nF (X k+1) \u2212 v\u2217 \u2264\n\n\u03ba4\n\n\u03ba1 + \u03ba4\n\n(F (X k) \u2212 v\u2217),\n\n(12)\n\nwhich establishes the Q\u2013linear convergence of {F (X k)}k\u22650 to v\u2217. Using (9) and (12), we can\nshow that {kX k+1 \u2212 X kk2\nF }k\u22650 converges R\u2013linearly to 0, which, together with (11), implies that\n{X k}k\u22650 converges R\u2013linearly to a point in X ; see the supplementary material.\n(cid:3)\n\n4 Proof of the Error Bound\n\nThe structure of our proof of Theorem 3.1 largely follows that laid out in [22, Section 6]. However,\nas explained in Section 1, some new ingredients are needed in order to analyze the spectral properties\nof a point that is close to the optimal solution set X . Before we proceed, let us set up the notation\nthat will be used in the proof. Let L > 0 denote the Lipschitz constant of \u2207h and \u2202k \u00b7 k\u2217 denote the\nsubdifferential of k \u00b7 k\u2217. Given a sequence {X k}k\u22650 \u2208 Rm\u00d7n\\X , de\ufb01ne\n\nRk = R(X k),\n\n\u00afX k = arg minY \u2208X kX k \u2212 Y kF ,\nzk = A(X k), Gk = \u2207f (X k) = A\u2217(\u2207h(zk)),\n\n(13)\nwhere A\u2217 is the adjoint operator of A. The crux of the proof of Theorem 3.1 is the following lemma:\n\n\u03b4k = kX k \u2212 \u00afX kkF ,\n\u00afG = A\u2217(\u2207h(\u00afz)),\n\nLemma 4.1 Under the setting of Theorem 3.1, suppose that there exists a convergent sequence\n{X k}k\u22650 \u2208 Rm\u00d7n\\X satisfying\n\nF (X k) \u2264 \u03b6 for all k \u2265 0, Rk \u2192 0,\n\nRk\n\u03b4k\n\n\u2192 0.\n\n(14)\n\nThen, the following hold:\n\n(a) (Asymptotic Optimality) The limit point \u00afX of {X k}k\u22650 belongs to X .\n\n(b) (Bounded Iterates) There exists a convex compact subset Z of dom(h) such that zk, \u00afz \u2208 Z\n\nfor all k \u2265 0. Consequently, there exists a constant \u03c3 \u2208 (0, L] such that for all k \u2265 0,\n\n(\u2207h(zk) \u2212 \u2207h(\u00afz))T (zk \u2212 \u00afz) \u2265 \u03c3kzk \u2212 \u00afzk2\n2.\n\n(15)\n\n(c) (Restricted Invertibility) There exists a constant \u03ba > 0 such that\nkX k \u2212 \u00afX kkF \u2264 \u03bakzk \u2212 \u00afzk2 = \u03bakA(X k \u2212 \u00afX k)k2\n\nfor all k \u2265 0.\n\n(16)\n\nIt is clear that kA(X k \u2212 \u00afX k)k2 \u2264 kAk \u00b7 kX k \u2212 \u00afX kkF , where kAk = supkY kF =1 kA(Y )k2 is\nthe spectral norm of A. Thus, the key element in Lemma 4.1 is the restricted invertibility property\n(16). For the sake of continuity, let us proceed to prove Theorem 3.1 by assuming the validity of\nLemma 4.1.\n\nProof. [Theorem 3.1] We argue by contradiction. Suppose that there exists \u03b6 \u2265 v\u2217 such that (8) fails\nto hold for all \u03b7 > 0 and \u01eb > 0. Then, there exists a sequence {X k}k\u22650 \u2208 Rm\u00d7n \\X satisfying (14).\nSince {X \u2208 Rm\u00d7n : F (X) \u2264 \u03b6} is bounded (see Section 2.2), by passing to a subsequence if\nnecessary, we may assume that {X k}k\u22650 converges to some \u00afX. Hence, the premises of Lemma 4.1\nare satis\ufb01ed. Now, by Fermat\u2019s rule [15, Theorem 10.1], for each k \u2265 0,\n\nHence, we have\n\nRk \u2208 arg min\n\nD (cid:8)hGk + Rk, Di + \u03c4 kX k + Dk\u2217(cid:9) .\n\n(17)\n\nhGk + Rk, Rki + \u03c4 kX k + Rkk\u2217 \u2264 hGk + Rk, \u00afX k \u2212 X ki + \u03c4 k \u00afX kk\u2217.\nSince \u00afX k \u2208 X and \u2207f ( \u00afX k) = \u00afG, we also have \u2212 \u00afG \u2208 \u03c4 \u2202k \u00afX kk\u2217, which implies that\n\n\u03c4 k \u00afX kk\u2217 \u2264 h \u00afG, X k + Rk \u2212 \u00afX ki + \u03c4 kX k + Rkk\u2217.\n\nAdding the two inequalities above and simplifying yield\n\nhGk \u2212 \u00afG, X k \u2212 \u00afX ki + kRkk2\n\nF \u2264 h \u00afG \u2212 Gk, Rki + hRk, \u00afX k \u2212 X ki.\n\n(18)\n\n5\n\n\fSince zk = A(X k) and \u00afz = A( \u00afX k), by Lemma 4.1(b,c),\n\nhGk \u2212 \u00afG, X k \u2212 \u00afX ki = (\u2207h(zk) \u2212 \u2207h(\u00afz))T (zk \u2212 \u00afz) \u2265 \u03c3kzk \u2212 \u00afzk2\n\n2 \u2265\n\n\u03c3\n\u03ba2 kX k \u2212 \u00afX kk2\nF .\n\n(19)\n\nHence, it follows from (15), (18), (19) and the Lipschitz continuity of \u2207h that\n\n\u03c3\n\u03ba2 kX k \u2212 \u00afX kk2\n\nF + kRkk2\n\nF \u2264 (\u2207h(\u00afz) \u2212 \u2207h(zk))T A(Rk) + hRk, \u00afX k \u2212 X ki\n\n\u2264 LkAk2kX k \u2212 \u00afX kkF kRkkF + kX k \u2212 \u00afX kkF kRkkF .\n\nIn particular, this implies that\n\n\u03c3\n\u03ba2 kX k \u2212 \u00afX kk2\n\nF \u2264 (LkAk2 + 1)kX k \u2212 \u00afX kkF kRkkF\n\nfor all k \u2265 0, which, upon dividing both sides by kX k \u2212 \u00afX kkF , yields a contradiction to (14). (cid:3)\n\n4.1 Proof of Lemma 4.1\n\nWe now return to the proof of Lemma 4.1. Since Rk \u2192 0 by (14) and R is continuous, we have\nR( \u00afX) = 0, which implies that \u00afX \u2208 X . This establishes (a). To prove (b), observe that due to (a), the\nsequence {X k}k\u22650 is bounded. Hence, the sequence {A(X k)}k\u22650 is also bounded, which implies\nthat the points zk = A(X k) and \u00afz = A( \u00afX) lie in a convex compact subset Z of dom(h) for all\nk \u2265 0. The inequality (15) then follows from Assumption 2.1(b). Note that we have \u03c3 \u2264 L, as \u2207h\nis Lipschitz continuous with parameter L.\n\nTo prove (c), we argue by contradiction. Suppose that (16) is false. Then, by further passing to a\nsubsequence if necessary, we may assume that\n\n(20)\nIn the sequel, we will also assume without loss of generality that m \u2264 n. The following proposition\nestablishes a property of the optimal solution set X that will play a crucial role in our proof.\n\nkA(X k) \u2212 \u00afzk2(cid:14)kX k \u2212 \u00afX kkF \u2192 0.\n\nProposition 4.2 Consider a \ufb01xed \u00afX \u2208 X . Let \u00afX \u2212 \u00afG = \u00afU [Diag(\u00af\u03c3) 0] \u00afV T be the singular\nvalue decomposition of \u00afX \u2212 \u00afG, where \u00afU \u2208 Rm\u00d7m, \u00afV \u2208 Rn\u00d7n are orthogonal matrices and \u00af\u03c3\nis the vector of singular values of \u00afX \u2212 \u00afG. Then, the matrices \u00afX and \u2212 \u00afG can be simultaneously\nsingular\u2013value\u2013decomposed by \u00afU and \u00afV . Moreover, the set Xc \u2282 X , which is de\ufb01ned as\n\nis a non\u2013empty convex compact set.\n\nXc = (cid:8)X \u2208 X : X = \u00afU [Diag(\u03c3(X)) 0] \u00afV T(cid:9) ,\n\nBy Proposition 4.2, for every k \u2265 0, the point X k has a unique projection \u02dcX k \u2208 Xc onto Xc. Let\n\n\u03b3k = kX k \u2212 \u02dcX kkF = min\nY \u2208Xc\n\nkX k \u2212 Y kF .\n\n(21)\n\nSince Xc \u2282 X , we have \u03b3k = kX k \u2212 \u02dcX kkF \u2265 kX k \u2212 \u00afX kkF = \u03b4k. It follows from (20) that\nkA(X k) \u2212 \u00afzk2(cid:14)kX k \u2212 \u02dcX kkF \u2192 0. This is equivalent to A(Qk) \u2192 0, where\n\nQk =\n\nfor all k \u2265 0.\n\n(22)\n\nX k \u2212 \u02dcX k\n\n\u03b3k\n\nIn particular, we have kQkkF = 1 for all k \u2265 0. By further passing to a subsequence if necessary,\nwe will assume that {Qk}k\u22650 converges to some \u00afQ. Clearly, we have A( \u00afQ) = 0 and k \u00afQkF = 1.\n\n4.1.1 Decomposing \u00afQ\n\nOur goal now is to show that for k suf\ufb01ciently large and \u01eb > 0 suf\ufb01ciently small, the point \u02c6X =\n\u02dcX k + \u01eb \u00afQ belongs to Xc and is closer to X k than \u02dcX k is to X k. This would then contradict the\nfact that \u02dcX k is the projection of X k onto Xc. To begin, let \u03c3k be the vector of singular values of\nX k \u2212 Gk. Since X k \u2212 Gk \u2192 \u00afX \u2212 \u00afG, the sequence {\u03c3k}k\u22650 is bounded. Hence, for i = 1, . . . , m,\nby passing to a subsequence if necessary, we can classify the sequence {\u03c3k\ni }k\u22650 into one of the\ni > \u03c4 and \u03c3i( \u02dcX k) > 0 for all k \u2265 0; (C)\nfollowing three cases: (A) \u03c3k\ni > \u03c4 and \u03c3i( \u02dcX k) = 0 for all k \u2265 0. The following proposition gives the key structural properties\n\u03c3k\nof \u00afQ that will lead to the desired contradiction:\n\ni \u2264 \u03c4 for all k \u2265 0; (B) \u03c3k\n\n6\n\n\fProposition 4.3 The matrix \u00afQ admits the decomposition \u00afQ = \u00afU [Diag(\u03bb) 0] \u00afV T , where\n\n= \u2212 lim\nk\u2192\u221e\n\n\u03c3i( \u02dcX k)\n\n\u03b3k\n\n\u2264 0\n\nin Case (A),\n\n\u2208 R\n\n> 0\n\nin Case (B),\n\nin Case (C),\n\nfor i = 1, . . . , m.\n\n\u03bbi\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f3\n\nIt should be noted that the decomposition given in Proposition 4.3 is not necessarily the singular\nvalue decomposition of \u00afQ, as \u03bb could have negative components. A proof of Proposition 4.3 can be\nfound in the supplementary material.\n\n4.1.2 Completing the Proof\n\nArmed with Proposition 4.3, we are now ready to complete the proof of Lemma 4.1(c). Since\nQk 6= 0 for all k \u2265 0, it follows from (22) that hX k \u2212 \u02dcX k, \u00afQi > 0 for all k suf\ufb01ciently large. Fix\nany such k and let \u02c6X = \u02dcX k + \u01eb \u00afQ, where \u01eb > 0 is a parameter to be determined. Since A( \u00afQ) = 0,\nit follows from (13) that \u2207f ( \u02c6X) = \u2207f ( \u02dcX k) = \u00afG. Moreover, since \u02dcX k \u2208 Xc, by the optimality\ncondition (4) and Proposition 4.2, we have\n\nmaxn0, \u03c3i( \u02dcX k) + \u03c3i(\u2212 \u00afG) \u2212 \u03c4o = \u03c3i( \u02dcX k)\n\nNow, we claim that for \u01eb > 0 suf\ufb01ciently small, \u02c6X satis\ufb01es\n\nfor i = 1, . . . , m.\n\nS\u03c4 ( \u02c6X \u2212 \u00afG)\u00afvi = \u02c6X \u00afvi\ni S\u03c4 ( \u02c6X \u2212 \u00afG) = \u00afuT\n\u02c6X\n\u00afuT\n\ni\n\nfor i = 1, . . . , n,\n\nfor i = 1, . . . , m,\n\n(23)\n\n(24)\n\nwhere \u00afui (resp. \u00afvi) is the i\u2013th column of \u00afU (resp. \u00afV ). This would then imply that \u02c6X \u2208 Xc. To prove\nthe claim, observe that for i = m + 1, . . . , n, both sides of (24) are equal to 0. Moreover, since\n\u02dcX k \u2208 Xc, Propositions 4.2 and 4.3 give\n\nThus, it suf\ufb01ces to show that for \u01eb > 0 suf\ufb01ciently small,\n\n\u02c6X \u2212 \u00afG = \u00afU (cid:2)Diag(\u03c3( \u02dcX k) + \u01eb\u03bb + \u03c3(\u2212 \u00afG)) 0(cid:3) \u00afV T .\n\n\u03c3i( \u02dcX k) + \u01eb\u03bbi + \u03c3i(\u2212 \u00afG) \u2265 0\n\ns\u03c4 (\u03c3i( \u02dcX k) + \u01eb\u03bbi + \u03c3i(\u2212 \u00afG)) = \u03c3i( \u02dcX k) + \u01eb\u03bbi\n\nfor i = 1, . . . , m,\n\nfor i = 1, . . . , m.\n\n(25)\n\n(26)\n\nTowards that end, \ufb01x an index i = 1, . . . , m and consider the three cases de\ufb01ned in Section 4.1.1:\nCase (A). If \u03c3i( \u02dcX k) = 0 for all k suf\ufb01ciently large, then Proposition 4.3 gives \u03bbi = 0. Moreover,\nwe have \u03c3i(\u2212 \u00afG) \u2264 \u03c4 by (23). This implies that both (25) and (26) are satis\ufb01ed for any choice of\n\u01eb > 0. On the other hand, if \u03c3i( \u02dcX k) > 0 for all k suf\ufb01ciently large, then Proposition 4.3 gives\n\u03bbi < 0. Moreover, we have \u03c3i(\u2212 \u00afG) = \u03c4 by (23). By choosing \u01eb > 0 so that \u03c3i( \u02dcX k) + \u01eb\u03bbi \u2265 0, we\ncan guarantee that both (25) and (26) are satis\ufb01ed.\nCase (B). Since \u03c3i( \u02dcX k) > 0 for all k \u2265 0, we have \u03c3i(\u2212 \u00afG) = \u03c4 by (23). Hence, both (25) and (26)\ncan be satis\ufb01ed by choosing \u01eb > 0 so that \u03c3i( \u02dcX k) + \u01eb\u03bbi \u2265 0.\nCase (C). By Proposition 4.2, we have \u00afX \u2208 Xc. Since X k \u2192 \u00afX and \u03b3k = kX k \u2212 \u02dcX kkF \u2264\nkX k \u2212 \u00afXkF , we have \u02dcX k \u2192 \u00afX as well. It follows that \u03c3i( \u00afX) = 0, as \u03c3i( \u02dcX k) = 0 for all k \u2265 0 by\nassumption. Now, since X k \u2212 Gk \u2192 \u00afX \u2212 \u00afG and \u03c3k\ni > \u03c4 , we have \u00af\u03c3i \u2265 \u03c4 . Thus, Proposition 4.2\nimplies that \u03c4 \u2264 \u00af\u03c3i = \u03c3i( \u00afX \u2212 \u00afG) = \u03c3i( \u00afX) + \u03c3i(\u2212 \u00afG) = \u03c3i(\u2212 \u00afG). This, together with (23), yields\n\u03c3i(\u2212 \u00afG) = \u03c4 . Since \u03bbi > 0 by Proposition 4.3, we conclude that both (25) and (26) can be satis\ufb01ed\nby any choice of \u01eb > 0.\nThus, in all three cases, the claim is established. In particular, we have \u02c6X \u2208 Xc. This, together with\nhX k \u2212 \u02dcX k, \u00afQi > 0 and k \u00afQkF = 1, yields\nF \u2212 2\u01ebhX k \u2212 \u02dcX k, \u00afQi + \u01eb2 < kX k \u2212 \u02dcX kk2\nkX k \u2212 \u02c6Xk2\nfor \u01eb > 0 suf\ufb01ciently small, which contradicts the fact that \u02dcX k is the projection of X k onto Xc. This\ncompletes the proof of Lemma 4.1(c).\n\nF = kX k \u2212 \u02dcX k \u2212 \u01eb \u00afQk2\n\nF = kX k \u2212 \u02dcX kk2\n\nF\n\n7\n\n\f5 Numerical Experiments\n\nIn this section, we complement our theoretical results by testing the numerical performance of the\nPGM (5) on two problems: matrix completion and matrix classi\ufb01cation.\n\nMatrix Completion: We randomly generate an n \u00d7 n matrix M with a prescribed rank r. Then,\nwe \ufb01x a sampling ratio \u03b8 \u2208 (0, 1] and sample p = \u230a\u03b8n2\u230b entries of M uniformly at random. This\ninduces a sampling operator P : Rm\u00d7n \u2192 Rp and an observation vector b \u2208 Rp. In our experiments,\nwe \ufb01x the rank r = 3 and use the square loss f (\u00b7) = kP(\u00b7) \u2212 bk2\n2/2 with regularization parameter\n\u00b5 = 1 in problem (1). We then solve the resulting problem for different values of n and \u03b8 using the\nPGM (5) with a \ufb01xed step size \u03b1 = 1. We stop the algorithm when F (X k) \u2212 F (X k+1) < 10\u22128.\nFigure 1 shows the semi\u2013log plots of the error in objective value and the error in solution against the\nnumber of iterations. It can be seen that as long as the iterates are close enough to the optimal set,\nboth the objective values and the solutions converge linearly.\n\nj\n\nl\n\n)\ne\nu\na\nv\n \ne\nv\ni\nt\nc\ne\nb\no\n \nf\no\n \nr\no\nr\nr\ne\n(\ng\no\nL\n\nConvergence Performance of Objective Value\n\n \n\n\u03b8=0.1\n\u03b8=0.3\n\u03b8=0.5\n\n105\n\n100\n\n10\u22125\n\n10\u221210\n \n0\n\n200\n\n400\n600\nIterations\n\n800\n\n1000\n\nj\n\nl\n\n)\ne\nu\na\nv\n \ne\nv\ni\nt\nc\ne\nb\no\n \nf\no\n \nr\no\nr\nr\ne\n(\ng\no\nL\n\n105\n\n100\n\n10\u22125\n\n \nf\n\no\n\n \nr\no\nr\nr\n\nE\n(\ng\no\nL\n\n250\n\n300\n\n10\u221210\n \n0\n\n50\n\n100\n\n150\n\n200\n\nIterations\n\n10\u221210\n \n0\n\n200\n\n400\n\n600\n\n800\n\nIterations\n\nConvergence Performance of Objective Value\n\nConvergence Performance of Objective Value\n\n \n\nn=100\nn=500\nn=1000\n\nl\n\n)\ne\nu\na\nV\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n105\n\n100\n\n10\u22125\n\n \n\n\u03b8=0.2\n\u03b8=0.5\n\u03b8=0.8\n\n1000\n\n1200\n\nn = 1000\n\n\u03b8 = 0.3\n\nn = 40\n\nl\n\n)\nn\no\ni\nt\nu\no\ns\n \nf\no\n \nr\no\nr\nr\ne\n(\ng\no\nL\n\n104\n\n102\n\n100\n\n10\u22122\n\n10\u22124\n \n0\n\nConvergence Performance of Solution \n\n \n\n\u03b8=0.1\n\u03b8=0.3\n\u03b8=0.5\n\n200\n\n400\n600\nIterations\n\n800\n\n1000\n\n)\nn\no\n\ni\nt\n\nl\n\nu\no\ns\n \nf\n\no\n\n \nr\no\nr\nr\ne\n(\ng\no\nL\n\n104\n\n102\n\n100\n\n10\u22122\n\n10\u22124\n \n0\n\nConvergence Performance of Solution \n\n \n\nn=100\nn=500\nn=1000\n\nConvergence Performance of Solution \n\n \n\n\u03b8=0.2\n\u03b8=0.5\n\u03b8=0.8\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n)\nn\no\n\ni\nt\n\nl\n\nu\no\nS\n\n \nf\n\no\n\n \nr\no\nr\nr\n\nE\n(\ng\no\nL\n\n50\n\n100\n\n150\n\n200\n\nIterations\n\n250\n\n300\n\n10\u22124\n \n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\nIterations\n\nn = 1000\n\n\u03b8 = 0.3\n\nn = 40\n\nFigure 1: Matrix Completion\n\nFigure 2: Matrix Classi\ufb01cation\n\nMatrix Classi\ufb01cation: We consider a matrix classi\ufb01cation problem under the setting described\nin [21]. Speci\ufb01cally, we \ufb01rst randomly generate a low-rank matrix classi\ufb01er X \u2217, which is an n \u00d7 n\nsymmetric matrix of rank r. Then, we specify a sampling ratio \u03b8 \u2208 (0, 1] and sample p = \u230a\u03b8n2\u230b/2\nindependent n \u00d7 n symmetric matrices W1, . . . , Wp from the standard Wishart distribution with n\ndegrees of freedom. The label of Wi, denoted by yi, is given by sgn(hX \u2217, Wii). In our experiments,\ni=1 log(1 +\nexp(\u2212yih\u00b7, Wii)) with regularization parameter \u00b5 = 1 in problem (1). Since a good lower bound\non the Lipschitz constant Lf of \u2207f is not readily available in this case, a backtracking line search\nwas adopted at each iteration to achieve an acceptable step size; see, e.g., [3]. We stop the algorithm\nwhen F (X k) \u2212 F (X k+1) < 10\u22126. Figure 2 shows the convergence performance of the PGM (5)\nas \u03b8 varies. Again, it can be seen that both the objective values and the solutions converge linearly.\n\nwe \ufb01x the rank r = 3, the dimension n = 40, and use the logistic loss f (\u00b7) = Pp\n\n6 Conclusion\n\nIn this paper, we have established the linear convergence of the PGM for solving a class of trace\nnorm\u2013regularized problems. Our convergence result does not require the objective function to be\nstrongly convex and is applicable to many settings in machine learning. The key technical tool in\nthe proof is a Lipschitzian error bound for trace norm\u2013regularized problems, which could be of\nindependent interest. A future direction is to study error bounds for more general matrix norm\u2013\nregularized problems and their implications on the convergence rates of \ufb01rst\u2013order methods.\n\nAcknowledgments The authors would like to thank the anonymous reviewers for their careful\nreading of the manuscript and insightful comments. The research of A. M.\u2013C. So is supported in\npart by a gift grant from Microsoft Research Asia.\n\n8\n\n\fReferences\n\n[1] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering Shared Structures in Multiclass Classi\ufb01cation.\n\nIn Proc. 24th ICML, pages 17\u201324, 2007.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex Multi\u2013Task Feature Learning. Mach. Learn., 73(3):243\u2013\n\n272, 2008.\n\n[3] A. Beck and M. Teboulle. A Fast Iterative Shrinkage\u2013Thresholding Algorithm for Linear Inverse Prob-\n\nlems. SIAM J. Imaging Sci., 2(1):183\u2013202, 2009.\n\n[4] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization: Analysis, Algorithms, and\nEngineering Applications. MPS\u2013SIAM Series on Optimization. Society for Industrial and Applied Math-\nematics, Philadelphia, Pennsylvania, 2001.\n\n[5] M. Fazel, H. Hindi, and S. P. Boyd. A Rank Minimization Heuristic with Application to Minimum Order\n\nSystem Approximation. In Proc. 2001 ACC, pages 4734\u20134739, 2001.\n\n[6] D. Gross. Recovering Low\u2013Rank Matrices from Few Coef\ufb01cients in Any Basis. IEEE Trans. Inf. Theory,\n\n57(3):1548\u20131566, 2011.\n\n[7] S. Ji, K.-F. Sze, Z. Zhou, A. M.-C. So, and Y. Ye. Beyond Convex Relaxation: A Polynomial\u2013Time\nNon\u2013Convex Optimization Approach to Network Localization. In Proc. 32nd IEEE INFOCOM, pages\n2499\u20132507, 2013.\n\n[8] S. Ji and J. Ye. An Accelerated Gradient Method for Trace Norm Minimization. In Proc. 26th ICML,\n\npages 457\u2013464, 2009.\n\n[9] V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear\u2013Norm Penalization and Optimal Rates for\n\nNoisy Low\u2013Rank Matrix Completion. Ann. Stat., 39(5):2302\u20132329, 2011.\n\n[10] Z.-Q. Luo and P. Tseng. Error Bounds and Convergence Analysis of Feasible Descent Methods: A\n\nGeneral Approach. Ann. Oper. Res., 46(1):157\u2013178, 1993.\n\n[11] S. Ma, D. Goldfarb, and L. Chen. Fixed Point and Bregman Iterative Methods for Matrix Rank Mini-\n\nmization. Math. Program., 128(1\u20132):321\u2013353, 2011.\n\n[12] Yu. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Pub-\n\nlishers, Boston, 2004.\n\n[13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed Minimum\u2013Rank Solutions of Linear Matrix Equations\n\nvia Nuclear Norm Minimization. SIAM Rev., 52(3):471\u2013501, 2010.\n\n[14] R. T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton Univer-\n\nsity Press, Princeton, New Jersey, 1997.\n\n[15] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317 of Grundlehren der mathematis-\n\nchen Wissenschaften. Springer\u2013Verlag, Berlin Heidelberg, second edition, 2004.\n\n[16] M. Schmidt, N. Le Roux, and F. Bach. Convergence Rates of Inexact Proximal\u2013Gradient Methods for\n\nConvex Optimization. In Proc. NIPS 2011, pages 1458\u20131466, 2011.\n\n[17] A. M.-C. So, Y. Ye, and J. Zhang. A Uni\ufb01ed Theorem on SDP Rank Reduction. Math. Oper. Res.,\n\n33(4):910\u2013920, 2008.\n\n[18] W. So. Facial Structures of Schatten p\u2013Norms. Linear and Multilinear Algebra, 27(3):207\u2013212, 1990.\n\n[19] K.-C. Toh and S. Yun. An Accelerated Proximal Gradient Algorithm for Nuclear Norm Regularized\n\nLinear Least Squares Problems. Pac. J. Optim., 6(3):615\u2013640, 2010.\n\n[20] R. Tomioka and K. Aihara. Classifying Matrices with a Spectral Regularization. In Proc. of the 24th\n\nICML, pages 895\u2013902, 2007.\n\n[21] R. Tomioka, T. Suzuki, M. Sugiyama, and H. Kashima. A Fast Augmented Lagrangian Algorithm for\n\nLearning Low\u2013Rank Matrices. In Proc. 27th ICML, pages 1087\u20131094, 2010.\n\n[22] P. Tseng. Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Opti-\n\nmization. Math. Program., 125(2):263\u2013295, 2010.\n\n[23] P. Tseng and S. Yun. A Coordinate Gradient Descent Method for Nonsmooth Separable Minimization.\n\nMath. Program., 117(1\u20132):387\u2013423, 2009.\n\n[24] M. White, Y. Yu, X. Zhang, and D. Schuurmans. Convex Multi\u2013View Subspace Learning. In Proc. NIPS\n\n2012, pages 1682\u20131690, 2012.\n\n[25] H. Zhang, J. Jiang, and Z.-Q. Luo. On the Linear Convergence of a Proximal Gradient Method for a Class\n\nof Nonsmooth Convex Minimization Problems. J. Oper. Res. Soc. China, 1(2):163\u2013186, 2013.\n\n9\n\n\f", "award": [], "sourceid": 417, "authors": [{"given_name": "Ke", "family_name": "Hou", "institution": "CUHK"}, {"given_name": "Zirui", "family_name": "Zhou", "institution": "CUHK"}, {"given_name": "Anthony Man-Cho", "family_name": "So", "institution": "CUHK"}, {"given_name": "Zhi-Quan", "family_name": "Luo", "institution": "University of Minnesota, Twin Cites"}]}