{"title": "Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima", "book": "Advances in Neural Information Processing Systems", "page_first": 476, "page_last": 484, "abstract": "We establish theoretical results concerning all local optima of various regularized M-estimators, where both loss and penalty functions are allowed to be nonconvex. Our results show that as long as the loss function satisfies restricted strong convexity and the penalty function satisfies suitable regularity conditions, any local optimum of the composite objective function lies within statistical precision of the true parameter vector. Our theory covers a broad class of nonconvex objective functions, including corrected versions of the Lasso for errors-in-variables linear models; regression in generalized linear models using nonconvex regularizers such as SCAD and MCP; and graph and inverse covariance matrix estimation. On the optimization side, we show that a simple adaptation of composite gradient descent may be used to compute a global optimum up to the statistical precision epsilon in log(1/epsilon) iterations, which is the fastest possible rate of any first-order method. We provide a variety of simulations to illustrate the sharpness of our theoretical predictions.", "full_text": "Regularized M-estimators with nonconvexity:\n\nStatistical and algorithmic theory for local optima\n\nPo-Ling Loh\n\nDepartment of Statistics\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nploh@berkeley.edu\n\nMartin J. Wainwright\n\nDepartments of Statistics and EECS\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nwainwrig@stat.berkeley.edu\n\nAbstract\n\nWe establish theoretical results concerning local optima of regularized M-\nestimators, where both loss and penalty functions are allowed to be nonconvex.\nOur results show that as long as the loss satis\ufb01es restricted strong convexity and\nthe penalty satis\ufb01es suitable regularity conditions, any local optimum of the com-\nposite objective lies within statistical precision of the true parameter vector. Our\ntheory covers a broad class of nonconvex objective functions, including corrected\nversions of the Lasso for errors-in-variables linear models and regression in gen-\neralized linear models using nonconvex regularizers such as SCAD and MCP. On\nthe optimization side, we show that a simple adaptation of composite gradient de-\nscent may be used to compute a global optimum up to the statistical precision \u0001stat\nin log(1/\u0001stat) iterations, the fastest possible rate for any \ufb01rst-order method. We\nprovide simulations to illustrate the sharpness of our theoretical predictions.\n\n1\n\nIntroduction\n\nOptimization of nonconvex functions is known to be computationally intractable in general [11, 12].\nUnlike convex functions, nonconvex functions may possess local optima that are not global optima,\nand standard iterative methods such as gradient descent and coordinate descent are only guaranteed\nto converge to local optima. Although statistical results regarding nonconvex M-estimation often\nonly provide guarantees about the accuracy of global optima, it is observed empirically that the local\noptima obtained by various estimation algorithms seem to be well-behaved.\nIn this paper, we study the question of whether it is possible to certify \u201cgood\u201d behavior, in both a\nstatistical and computational sense, for various nonconvex M-estimators. On the statistical level,\nwe provide an abstract result, applicable to a broad class of (potentially nonconvex) M-estimators,\nwhich bounds the distance between any local optimum and the unique minimum of the population\nrisk. Although local optima of nonconvex objectives may not coincide with global optima, our\ntheory shows that any local optimum is essentially as good as a global optimum from a statistical\nperspective. The class of M-estimators covered by our theory includes the modi\ufb01ed Lasso as a\nspecial case, but our results are much stronger than those implied by previous work [6].\nIn addition to nonconvex loss functions, our theory also applies to nonconvex regularizers, shedding\nnew light on a long line of recent work involving the nonconvex SCAD and MCP regularizers [3, 2,\n13, 14]. Various methods have been proposed for optimizing convex loss functions with nonconvex\npenalties [3, 4, 15], but these methods are only guaranteed to generate local optima of the composite\nobjective, which have not been proven to be well-behaved. In contrast, our work provides a set\nof regularity conditions under which all local optima are guaranteed to lie within a small ball of\nthe population-level minimum, ensuring that standard methods such as projected and composite\ngradient descent [10] are suf\ufb01cient for obtaining estimators that lie within statistical error of the\n\n1\n\n\ftruth. In fact, we establish that under suitable conditions, a modi\ufb01ed form of composite gradient\ndescent only requires log(1/\u0001stat) iterations to obtain a solution that is accurate up to the statistical\nprecision \u0001stat.\nNotation. For functions f (n) and g(n), we write f (n) (cid:45) g(n) to mean that f (n) \u2264 cg(n) for\nsome universal constant c \u2208 (0,\u221e), and similarly, f (n) (cid:37) g(n) when f (n) \u2265 c(cid:48)g(n) for some\nuniversal constant c(cid:48) \u2208 (0,\u221e). We write f (n) (cid:16) g(n) when f (n) (cid:45) g(n) and f (n) (cid:37) g(n) hold\nsimultaneously. For a function h : Rp \u2192 R, we write \u2207h to denote a gradient or subgradient, if it\nexists. Finally, for q, r > 0, let Bq(r) denote the (cid:96)q-ball of radius r centered around 0.\n\n2 Problem formulation\n\nIn this section, we develop some general theory for regularized M-estimators. We \ufb01rst establish\nnotation, then discuss assumptions for nonconvex regularizers and losses studied in our paper.\n\n2.1 Background\n\n1 = {Z1, . . . , Zn}, drawn from a marginal distribution P over a\nGiven a collection of n samples Z n\nspace Z, consider a loss function Ln : Rp \u00d7 (Z)n \u2192 R. The value Ln(\u03b2; Z n\n1 ) serves as a measure\nof the \u201c\ufb01t\u201d between a parameter vector \u03b2 \u2208 Rp and the observed data. This empirical loss function\nshould be viewed as a surrogate to the population risk function L : Rp \u2192 R, given by\n\n(cid:2)Ln(\u03b2; Z n\n1 )(cid:3).\n\nL(\u03b2) := EZ\n\nL(\u03b2) that minimizes the population\n\nOur goal is to estimate the parameter vector \u03b2\u2217 := arg min\n\u03b2\u2208Rp\nrisk, assumed to be unique.\nTo this end, we consider a regularized M-estimator of the form\n\n(cid:98)\u03b2 \u2208 arg min\n\n{Ln(\u03b2; Z n\n\n1 ) + \u03c1\u03bb(\u03b2)} ,\n\ng(\u03b2)\u2264R\n\nseparable across coordinates, and with a slight abuse of notation, we write \u03c1\u03bb(\u03b2) =(cid:80)p\n\n(1)\nwhere \u03c1\u03bb : Rp \u2192 R is a regularizer, depending on a tuning parameter \u03bb > 0, which serves to\nenforce a certain type of structure on the solution. In all cases, we consider regularizers that are\nj=1 \u03c1\u03bb(\u03b2j).\nOur theory allows for possible nonconvexity in both the loss function Ln and the regularizer \u03c1\u03bb.\nDue to this potential nonconvexity, our M-estimator also includes a side constraint g : Rp \u2192 R+,\nwhich we require to be a convex function satisfying the lower bound g(\u03b2) \u2265 (cid:107)\u03b2(cid:107)1, for all \u03b2 \u2208 Rp.\nConsequently, any feasible point for the optimization problem (1) satis\ufb01es the constraint (cid:107)\u03b2(cid:107)1 \u2264 R,\nand as long as the empirical loss and regularizer are continuous, the Weierstrass extreme value\n\ntheorem guarantees that a global minimum(cid:98)\u03b2 exists.\n\n2.2 Nonconvex regularizers\nWe now state and discuss conditions on the regularizer, de\ufb01ned in terms of \u03c1\u03bb : R \u2192 R.\nAssumption 1.\n\n(i) The function \u03c1\u03bb satis\ufb01es \u03c1\u03bb(0) = 0 and is symmetric around zero (i.e., \u03c1\u03bb(t) = \u03c1\u03bb(\u2212t)\n\nfor all t \u2208 R).\n\n(ii) On the nonnegative real line, the function \u03c1\u03bb is nondecreasing.\n(iii) For t > 0, the function t (cid:55)\u2192 \u03c1\u03bb(t)\n(iv) The function \u03c1\u03bb is differentiable for all t (cid:54)= 0 and subdifferentiable at t = 0, with nonzero\n\nis nonincreasing in t.\n\nt\n\nsubgradients at t = 0 bounded by \u03bbL.\n\n(v) There exists \u00b5 > 0 such that \u03c1\u03bb,\u00b5(t) := \u03c1\u03bb(t) + \u00b5t2 is convex.\n\nMany regularizers that are commonly used in practice satisfy Assumption 1, including the (cid:96)1-norm,\n\u03c1\u03bb(\u03b2) = (cid:107)\u03b2(cid:107)1, and the following commonly used nonconvex regularizers:\n\n2\n\n\fSCAD penalty: This penalty, due to Fan and Li [3], takes the form\n\nfor |t| \u2264 \u03bb,\nfor \u03bb < |t| \u2264 a\u03bb,\nfor |t| > a\u03bb,\nwhere a > 2 is a \ufb01xed parameter. Assumption 1 holds with L = 1 and \u00b5 = 1\n\n\u2212(t2 \u2212 2a\u03bb|t| + \u03bb2)/(2(a \u2212 1)),\n(a + 1)\u03bb2/2,\n\n\u03c1\u03bb(t) :=\n\na\u22121.\n\n\uf8f1\uf8f2\uf8f3\u03bb|t|,\n\nMCP regularizer: This penalty, due to Zhang [13], takes the form\n\n(cid:90) |t|\n\n(cid:16)\n\n0\n\n(cid:17)\n\n\u03c1\u03bb(t) := sign(t) \u03bb \u00b7\n\n1 \u2212 z\n\u03bbb\n\ndz,\n\n+\n\nwhere b > 0 is a \ufb01xed parameter. Assumption 1 holds with L = 1 and \u00b5 = 1\nb .\n\n(2)\n\n(3)\n\n2.3 Nonconvex loss functions and restricted strong convexity\nThroughout this paper, we require the loss function Ln to be differentiable, but we do not require it\nto be convex. Instead, we impose a weaker condition known as restricted strong convexity (RSC).\nSuch conditions have been discussed in previous literature [9, 1], and involve a lower bound on the\nremainder in the \ufb01rst-order Taylor expansion of Ln. In particular, our main statistical result is based\non the following RSC condition:\n\n(cid:104)\u2207Ln(\u03b2\u2217 + \u2206) \u2212 \u2207Ln(\u03b2\u2217), \u2206(cid:105) \u2265\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3 \u03b11(cid:107)\u2206(cid:107)2\n\n2 \u2212 \u03c41\n\u03b12(cid:107)\u2206(cid:107)2 \u2212 \u03c42\n\n(cid:114)\n\nlog p\n\n(cid:107)\u2206(cid:107)2\n1,\n(cid:107)\u2206(cid:107)1,\n\nn\nlog p\n\nn\n\n\u2200(cid:107)\u2206(cid:107)2 \u2264 1,\n\n\u2200(cid:107)\u2206(cid:107)2 \u2265 1,\n\n(4a)\n\n(4b)\n\nwhere the \u03b1j\u2019s are strictly positive constants and the \u03c4j\u2019s are nonnegative constants.\nTo understand this condition, note that if Ln were actually strongly convex, then both these RSC\ninequalities would hold with \u03b11 = \u03b12 > 0 and \u03c41 = \u03c42 = 0. However, in the high-dimensional\nsetting (p (cid:29) n), the empirical loss Ln can never be strongly convex, but the RSC condition may\nstill hold with strictly positive (\u03b1j, \u03c4j). On the other hand, if Ln is convex (but not strongly convex),\nthe left-hand expression in inequality (4) is always nonnegative, so inequalities (4a) and (4b) hold\ntrivially for (cid:107)\u2206(cid:107)1\nlog p, respectively. Hence, the RSC inequalities\n(cid:107)\u2206(cid:107)2\nonly enforce a type of strong convexity condition over a cone set of the form\n.\n\n\u2265 (cid:113) \u03b11n\n\n\u03c41 log p and (cid:107)\u2206(cid:107)1\n(cid:107)\u2206(cid:107)2\n\n(cid:110)(cid:107)\u2206(cid:107)1\n\n(cid:113) n\n\n(cid:113) n\n\n\u2265 \u03b12\n\n\u2264 c\n\n(cid:111)\n\n\u03c42\n\n(cid:107)\u2206(cid:107)2\n\nlog p\n\n3 Statistical guarantees and consequences\n\nlocal minimum of the program (1):\n\nWe now turn to our main statistical guarantees and some consequences for various statistical models.\n\nOur theory applies to any vector (cid:101)\u03b2 \u2208 Rp that satis\ufb01es the \ufb01rst-order necessary conditions to be a\nWhen(cid:101)\u03b2 lies in the interior of the constraint set, condition (5) is the usual zero-subgradient condition.\n\n(cid:104)\u2207Ln((cid:101)\u03b2) + \u2207\u03c1\u03bb((cid:101)\u03b2), \u03b2 \u2212(cid:101)\u03b2(cid:105) \u2265 0,\n\nfor all feasible \u03b2 \u2208 Rp.\n\n(5)\n\n3.1 Main statistical results\n\ntion, and parameters, which guarantee that any local optimum (cid:101)\u03b2 lies close to the target vector\n\nOur main theorem is deterministic in nature, and speci\ufb01es conditions on the regularizer, loss func-\nL(\u03b2). Corresponding probabilistic results will be derived in subsequent sections. For\n\u03b2\u2217 = arg min\n\u03b2\u2208Rp\nproofs and more detailed discussion of the results contained in this paper, see the technical report [7].\nTheorem 1. Suppose the regularizer \u03c1\u03bb satis\ufb01es Assumption 1, Ln satis\ufb01es the RSC conditions (4)\nwith \u03b11 > \u00b5, and \u03b2\u2217 is feasible for the objective. Consider any choice of \u03bb such that\n\n(cid:40)\n\n(cid:114)\n\n(cid:41)\n\n\u00b7 max\n\n2\nL\n\n(cid:107)\u2207Ln(\u03b2\u2217)(cid:107)\u221e, \u03b12\n\nlog p\n\nn\n\n\u2264 \u03bb \u2264 \u03b12\n6RL\n\n,\n\n(6)\n\n3\n\n\flog p. Then any vector (cid:101)\u03b2 satisfying the \ufb01rst-order necessary con-\n\nand suppose n \u2265 16R2 max(\u03c4 2\nditions (5) satis\ufb01es the error bounds\n\n1 ,\u03c4 2\n2 )\n\n\u03b12\n2\n\n(cid:107)(cid:101)\u03b2 \u2212 \u03b2\u2217(cid:107)2 \u2264 7\u03bbL\n\n\u221a\n\nk\n4(\u03b11 \u2212 \u00b5)\n\n,\n\nand\n\n(cid:107)(cid:101)\u03b2 \u2212 \u03b2\u2217(cid:107)1 \u2264 56\u03bbLk\n\n4(\u03b11 \u2212 \u00b5)\n\n,\n\n(7)\n\n(cid:113) log p\n\nn and R proportional to 1\n\nwhere k = (cid:107)\u03b2\u2217(cid:107)0.\nFrom the bound (7), note that the squared (cid:96)2-error grows proportionally with k, the number of non-\nzeros in the target parameter, and with \u03bb2. As will be clari\ufb01ed in the following sections, choosing \u03bb\n\u03bb will satisfy the requirements of Theorem 1 w.h.p. for\nproportional to\nmany statistical models, in which case we have a squared (cid:96)2-error that scales as k log p\nn , as expected.\nRemark 1. It is worthwhile to discuss the quantity \u03b11 \u2212 \u00b5 appearing in the denominator of the\nbound in Theorem 1. Recall that \u03b11 measures the level of curvature of the loss function Ln, while \u00b5\nmeasures the level of nonconvexity of the penalty \u03c1\u03bb. Intuitively, the two quantities should play op-\nposing roles in our result: Larger values of \u00b5 correspond to more severe nonconvexity of the penalty,\nresulting in worse behavior of the overall objective (1), whereas larger values of \u03b11 correspond to\nmore (restricted) curvature of the loss, leading to better behavior.\n\nWe now develop corollaries for various nonconvex loss functions and regularizers of interest.\n\n3.2 Corrected linear regression\n\nWe begin by considering the case of high-dimensional linear regression with systematically cor-\nrupted observations. Recall that in the framework of ordinary linear regression, we have the model\n\n(cid:124) (cid:123)(cid:122) (cid:125)\nyi = (cid:104)\u03b2\u2217, xi(cid:105)\n(cid:80)p\nj=1 \u03b2\u2217\n\nj xij\n\n+ \u0001i,\n\nfor i = 1, . . . , n,\n\n(8)\n\nwhere \u03b2\u2217 \u2208 Rp is the unknown parameter vector and {(xi, yi)}n\nand Wainwright [6], assume we instead observe pairs {(zi, yi)}n\ncorrupted versions of the corresponding xi\u2019s. Some examples include the following:\n\ni=1 are observations. Following Loh\ni=1, where the zi\u2019s are systematically\n\n(a) Additive noise: Observe zi = xi + wi, where wi \u22a5\u22a5 xi, E[wi] = 0, and cov[wi] = \u03a3w.\n(b) Missing data: For \u03d1 \u2208 [0, 1), observe zi \u2208 Rp such that for each component j, we inde-\n\npendently observe zij = xij with probability 1 \u2212 \u03d1, and zij = \u2217 with probability \u03d1.\n\nWe use the population and empirical loss functions\n\n1\n2\n\nL(\u03b2) =\n\n\u03b2T \u03a3x\u03b2 \u2212 \u03b2\u2217T \u03a3x\u03b2,\n\nwhere ((cid:98)\u0393,(cid:98)\u03b3) are estimators for (\u03a3x, \u03a3x\u03b2\u2217) depending on {(zi, yi)}n\n\u03b2T(cid:98)\u0393\u03b2 \u2212(cid:98)\u03b3T \u03b2 + \u03c1\u03bb(\u03b2)\n\n(cid:98)\u03b2 \u2208 arg min\n\nFrom the formulation (1), the corrected linear regression estimator is given by\n\n(cid:26) 1\n\n.\n\nLn(\u03b2) =\n\nand\n\ng(\u03b2)\u2264R\n\n2\n\n\u03b2T(cid:98)\u0393\u03b2 \u2212(cid:98)\u03b3T \u03b2,\n1\n(9)\n2\ni=1. Then \u03b2\u2217 = arg min\u03b2 L(\u03b2).\n(cid:27)\n\nWe now state a corollary in the case of additive noise (model (a)), where we take\n\n(cid:98)\u0393 =\n\nZ T Z\n\n\u2212 \u03a3w,\n\nand\n\n(cid:98)\u03b3 =\n\nZ T y\n\n.\n\nWhen p (cid:29) n, the matrix(cid:98)\u0393 in equation (11) is always negative-de\ufb01nite, so the empirical loss function\nLn previously de\ufb01ned (9) is nonconvex. Other choices of(cid:98)\u0393 are applicable to missing data (model\n\nn\n\nn\n\n(b)), and also lead to nonconvex programs [6].\n\n(10)\n\n(11)\n\n4\n\n\f,\n\nc\n\nk\n\nn\n\n.\n\nR\n\n\u221a\n\nc0\u03bb\n\nand\n\nc(cid:48)\n0\u03bbk\n\nCorollary 1. Suppose we have i.i.d. observations {(zi, yi)}n\nsub-Gaussian additive noise. Suppose (\u03bb, R) are chosen such that \u03b2\u2217 is feasible and\n\u2264 \u03bb \u2264 c(cid:48)\n\nThen given a sample size n \u2265 C max{R2, k} log p, any local optimum (cid:101)\u03b2 of the nonconvex pro-\n\ni=1 from a corrupted linear model with\n\n(cid:114)\n\nlog p\n\n(cid:107)(cid:101)\u03b2 \u2212 \u03b2\u2217(cid:107)2 \u2264\n\ngram (10) satis\ufb01es the estimation error bounds\n\nwith probability at least 1 \u2212 c1 exp(\u2212c2 log p), where (cid:107)\u03b2\u2217(cid:107)0 = k.\n\n(cid:107)(cid:101)\u03b2 \u2212 \u03b2\u2217(cid:107)1 \u2264\nRemark 2. When \u03c1\u03bb(\u03b2) = \u03bb(cid:107)\u03b2(cid:107)1 and g(\u03b2) = (cid:107)\u03b2(cid:107)1, taking \u03bb (cid:16)(cid:113) log p\nare stated only for a global minimum(cid:98)\u03b2 of the program (10), whereas Corollary 1 is a much stronger\nresult holding for any local minimum (cid:101)\u03b2. Theorem 2 of our earlier paper [6] provides an indirect\nroute for establishing similar bounds on (cid:107)(cid:101)\u03b2 \u2212 \u03b2\u2217(cid:107)1 and (cid:107)(cid:101)\u03b2 \u2212 \u03b2\u2217(cid:107)2, since the projected gradient\n\nk for some\nconstant b0 \u2265 (cid:107)\u03b2\u2217(cid:107)2 yields the required scaling n (cid:37) k log p. Hence, the bounds in Corollary 1\nagree with bounds in Theorem 1 of Loh and Wainwright [6]. Note, however, that the latter results\n\n\u03bbmin(\u03a3x) \u2212 2\u00b5\n\u221a\n\n\u03bbmin(\u03a3x) \u2212 2\u00b5\n\ndescent algorithm may become stuck in local minima. In contrast, our argument here does not rely\non an algorithmic proof and applies to a more general class of (possibly nonconvex) penalties.\nCorollary 1 also has important consequences in the case where pairs {(xi, yi)}n\ni=1 from the linear\nmodel (8) are observed without corruption and \u03c1\u03bb is nonconvex. Then the empirical loss Ln is\nequivalent to the least-squares loss, modulo a constant factor. Much existing work [3, 14] only\nestablishes statistical consistency of global minima and then provides specialized algorithms for\nobtaining speci\ufb01c local optima that are provably close to global optima. In contrast, our results\ndemonstrate that any optimization algorithm converging to a local optimum suf\ufb01ces.\n\nn and R = b0\n\n,\n\n3.3 Generalized linear models\n\nMoving beyond linear regression, we now consider the case where observations are drawn from a\ngeneralized linear model (GLM). Recall that a GLM is characterized by the conditional distribution\n\nP(yi | xi, \u03b2, \u03c3) = exp\n\n(cid:26) yi(cid:104)\u03b2, xi(cid:105) \u2212 \u03c8(xT\n\ni \u03b2)\n\n(cid:27)\n\n,\n\nc(\u03c3)\n\nwhere \u03c3 > 0 is a scale parameter and \u03c8 is the cumulant function. By standard properties of expo-\nnential families [8, 5], we have\n\n\u03c8(cid:48)(xT\n\ni \u03b2) = E[yi | xi, \u03b2, \u03c3].\n\nIn our analysis, we assume there exists \u03b1u > 0 such that \u03c8(cid:48)(cid:48)(t) \u2264 \u03b1u for all t \u2208 R. This bound-\nedness assumption holds in various settings, including linear regression, logistic regression, and\nmultinomial regression. The bound is required to establish both statistical consistency results in the\npresent section and fast global convergence guarantees for our optimization algorithms in Section 4.\nWe will assume that \u03b2\u2217 is sparse and optimize the penalized maximum likelihood program\n\n(cid:98)\u03b2 \u2208 arg min\n\n(cid:40)\n\n1\nn\n\nn(cid:88)\n\n(cid:0)\u03c8(xT\n\n(cid:41)\n\ni \u03b2(cid:1) + \u03c1\u03bb(\u03b2)\n\ni \u03b2) \u2212 yixT\n\n.\n\n(12)\n\ng(\u03b2)\u2264R\nWe then have the following corollary:\nCorollary 2. Suppose we have i.i.d. observations {(xi, yi)}n\nsub-Gaussian. Suppose (\u03bb, R) are chosen such that \u03b2\u2217 is feasible and\n\ni=1\n\ni=1 from a GLM, where the xi\u2019s are\n\nGiven a sample size n \u2265 CR2 log p, any local optimum(cid:101)\u03b2 of the nonconvex program (12) satis\ufb01es\n\nR\n\nn\n\nc\n\n.\n\n(cid:107)(cid:101)\u03b2 \u2212 \u03b2\u2217(cid:107)2 \u2264\n\n\u221a\n\nc0\u03bb\n\nk\n\n\u03bbmin(\u03a3x) \u2212 2\u00b5\n\n,\n\nand\n\n(cid:107)(cid:101)\u03b2 \u2212 \u03b2\u2217(cid:107)1 \u2264\n\nc(cid:48)\n0\u03bbk\n\n\u03bbmin(\u03a3x) \u2212 2\u00b5\n\n,\n\nwith probability at least 1 \u2212 c1 exp(\u2212c2 log p), where (cid:107)\u03b2\u2217(cid:107)0 = k.\n\n(cid:114)\n\nlog p\n\n\u2264 \u03bb \u2264 c(cid:48)\n\n5\n\n\f4 Optimization algorithm\n\nwhich is convex by Assumption 1. We may then write the program (1) as\n\nWe now describe how a version of composite gradient descent may be applied to ef\ufb01ciently optimize\nthe nonconvex program (1). We focus on a version of the optimization problem with the side function\n\ng\u03bb,\u00b5(\u03b2) :=\n\n(cid:98)\u03b2 \u2208 arg min\n\ng\u03bb,\u00b5(\u03b2)\u2264R\n\n(cid:111)\n\n,\n\n(cid:1)\n(cid:125)\n\n\u03c1\u03bb(\u03b2) + \u00b5(cid:107)\u03b2(cid:107)2\n\n2\n\n1\n\u03bb\n\n(cid:123)(cid:122)\n\n(cid:110)\n(cid:110)(cid:0)Ln(\u03b2) \u2212 \u00b5(cid:107)\u03b2(cid:107)2\n(cid:124)\n(cid:13)(cid:13)(cid:13)(cid:13)\u03b2 \u2212\n\n(cid:18)\n\n\u00afLn\n\n\u03b7\n\n2\n\n\u03b2t \u2212 \u2207Ln(\u03b2t)\n\n(cid:40)\n\n(cid:111)\n\n+\u03bbg\u03bb,\u00b5(\u03b2)\n\n.\n\n(13)\n\n(14)\n\n(cid:41)\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n+\n\n\u03bb\n\u03b7\n\ng\u03bb,\u00b5(\u03b2)\n\n,\n\n(15)\n\nThe objective function then decomposes nicely into a sum of a differentiable but nonconvex function\nand a possibly nonsmooth but convex penalty. Applied to the representation (14), the composite\ngradient descent procedure of Nesterov [10] produces a sequence of iterates {\u03b2t}\u221e\nt=0 via the updates\n\n\u03b2t+1 \u2208 arg min\n\ng\u03bb,\u00b5(\u03b2)\u2264R\n\n1\n2\n\nwhere 1\n\n\u03b7 is the stepsize. De\ufb01ne the Taylor error around \u03b22 in the direction \u03b21 \u2212 \u03b22 by\n\nT (\u03b21, \u03b22) := Ln(\u03b21) \u2212 Ln(\u03b22) \u2212 (cid:104)\u2207Ln(\u03b22), \u03b21 \u2212 \u03b22(cid:105).\n\n(16)\nFor all vectors \u03b22 \u2208 B2(3) \u2229 B1(R), we require the following form of restricted strong convexity:\n(17a)\n\n\u2200(cid:107)\u03b21 \u2212 \u03b22(cid:107)2 \u2264 3,\n\n(cid:107)\u03b21 \u2212 \u03b22(cid:107)2\n1,\n\nlog p\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3 \u03b11(cid:107)\u03b21 \u2212 \u03b22(cid:107)2\n\n2 \u2212 \u03c41\n\u03b12(cid:107)\u03b21 \u2212 \u03b22(cid:107)2 \u2212 \u03c42\n\n(cid:114)\n\nn\nlog p\n\nn\n\n(cid:107)\u03b21 \u2212 \u03b22(cid:107)1,\n\n\u2200(cid:107)\u03b21 \u2212 \u03b22(cid:107)2 \u2265 3.\n\n(17b)\n\nT (\u03b21, \u03b22) \u2265\n\nThe conditions (17) are similar but not identical to the earlier RSC conditions (4). The main\ndifference is that we now require the Taylor difference to be bounded below uniformly over\n\u03b22 \u2208 B2(3) \u2229 B1(R), as opposed to for a \ufb01xed \u03b22 = \u03b2\u2217. We also assume an upper bound:\n\nT (\u03b21, \u03b22) \u2264 \u03b13(cid:107)\u03b21 \u2212 \u03b22(cid:107)2\n\n2 + \u03c43\n\nlog p\n\n(cid:107)\u03b21 \u2212 \u03b22(cid:107)2\n1,\n\nfor all \u03b21, \u03b22 \u2208 Rp,\n\n(18)\n\nn\n\na condition referred to as restricted smoothness in past work [1]. Throughout this section, we as-\nsume \u03b1i > \u00b5 for all i, where \u00b5 is the coef\ufb01cient ensuring the convexity of the function g\u03bb,\u00b5 from\nequation (13). Furthermore, we de\ufb01ne \u03b1 = min{\u03b11, \u03b12} and \u03c4 = max{\u03c41, \u03c42, \u03c43}.\nThe following theorem applies to any population loss function L for which the population minimizer\n\u03b2\u2217 is k-sparse and (cid:107)\u03b2\u2217(cid:107)2 \u2264 1, and under the scaling n > Ck log p, for a constant C depending on\nthe \u03b1i\u2019s and \u03c4i\u2019s. We show that the composite gradient updates (15) exhibit a type of globally\ngeometric convergence in terms of the quantity\n\n1 \u2212 \u03b1\u2212\u00b5\n\n128\u03c4 k log p\nn\n\u03b1 \u2212 \u00b5\nUnder the stated scaling on the sample size, we are guaranteed that \u03ba \u2208 (0, 1). Let\n\n4\u03b7 + \u03d5(n, p, k)\n1 \u2212 \u03d5(n, p, k)\n\nwhere \u03d5(n, p, k) :=\n\n\u03ba :=\n\n,\n\n2 log\n\nT \u2217(\u03b4) :=\n\n\u03b42\nlog(1/\u03ba)\n\n+\n\nwhere \u03c6(\u03b2) := Ln(\u03b2) + \u03c1\u03bb(\u03b2), and de\ufb01ne \u0001stat := (cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)2.\nAssumption 1. Suppose(cid:98)\u03b2 is any global minimum of the program (14), with\n(cid:40)\n(cid:114)\n\nTheorem 2. Suppose Ln satis\ufb01es the RSC/RSM conditions (17) and (18), and suppose \u03c1\u03bb satis\ufb01es\n\n(cid:114)\n\nlog(1/\u03ba)\n\n(cid:41)\n\nlog log\n\n1 +\n\n\u03b42\n\nlog 2\n\nR\n\nlog p\n\nn\n\n\u2264 c,\n\nand\n\n\u03bb \u2265 4\nL\n\n\u00b7 max\n\n(cid:107)\u2207Ln(\u03b2\u2217)(cid:107)\u221e, \u03c4\n\nlog p\n\nn\n\n.\n\nThen for any stepsize \u03b7 \u2265 2 \u00b7 max{\u03b13 \u2212 \u00b5, \u00b5} and tolerance \u03b42 \u2265 c\u00012\n\nstat\n\n1\u2212\u03ba , we have\n\n(cid:18) \u03bbRL\n\n(cid:19)\n\n.\n\n,\n\n(19)\n\n(20)\n\n(cid:16) \u03c6(\u03b20)\u2212\u03c6((cid:98)\u03b2)\n\n(cid:17)\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\n(cid:107)\u03b2t \u2212(cid:98)\u03b2(cid:107)2\n2 \u2264\n\n2\n\n\u03b1 \u2212 \u00b5\n\n(cid:18)\n\n,\n\n\u2200t \u2265 T \u2217(\u03b4).\n\n(21)\n\n\u03b42 +\n\n\u03b44\n\u03c4\n\n+ 128\u03c4\n\nk log p\n\nn\n\n\u00012\nstat\n\n6\n\n\fRemark 3. Note that for the optimal choice of tolerance parameter \u03b4 (cid:16) \u0001stat, the bound in inequal-\nity (21) takes the form c\u00012\n\u03b1\u2212\u00b5, meaning successive iterates are guaranteed to converge to a region\nstat\n\nwithin statistical accuracy of the true global optimum(cid:98)\u03b2. Combining Theorems 1 and 2, we have\n\n(cid:110)(cid:107)\u03b2t \u2212(cid:98)\u03b2(cid:107)2, (cid:107)\u03b2t \u2212 \u03b2\u2217(cid:107)2\n\n(cid:111)\n\nmax\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n= O\n\nk log p\n\nn\n\n,\n\n\u2200t \u2265 T (c(cid:48)\u0001stat).\n\n5 Simulations\nIn this section, we report the results of simulations for two versions of the loss function Ln, corre-\nsponding to linear and logistic regression, and three penalty functions: Lasso, SCAD, and MCP. In\nall cases, we chose regularization parameters R = 1.1\n\n(cid:113) log p\n\n\u03bb \u00b7 \u03c1\u03bb(\u03b2\u2217) and \u03bb =\n\nn .\n\nLinear regression:\nnoise according to the mechanism described in Section 3.2, giving the estimator\n\nIn the case of linear regression, we simulated covariates corrupted by additive\n\n(cid:98)\u03b2 \u2208 arg min\n\ng\u03bb,\u00b5(\u03b2)\u2264R\n\n(cid:26) 1\n\n2\n\n\u03b2T\n\n(cid:18) X T X\n\nn\n\n(cid:19)\n\n\u2212 \u03a3w\n\n\u03b2 \u2212 yT Z\nn\n\n\u03b2 + \u03c1\u03bb(\u03b2)\n\n.\n\n(22)\n\nWe generated i.i.d. samples xi \u223c N (0, I) and \u0001i \u223c N (0, (0.1)2), and set \u03a3w = (0.2)2I.\n\nLogistic regression:\nSince \u03c8(t) = log(1 + exp(t)), the program (12) becomes\n\nIn the case of logistic regression, we generated i.i.d. samples xi \u223c N (0, I).\n\n{log(1 + exp((cid:104)\u03b2, xi(cid:105)) \u2212 yi(cid:104)\u03b2, xi(cid:105)} + \u03c1\u03bb(\u03b2)\n\n.\n\n(23)\n\n(cid:27)\n\n(cid:41)\n\n(cid:98)\u03b2 \u2208 arg min\n\ng\u03bb,\u00b5(\u03b2)\u2264R\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nWe optimized the programs (22) and (23) using the composite gradient updates (15). Figure 1\ndifferent problem sizes p. In each case, \u03b2\u2217 is a k-sparse vector with k = (cid:98)\u221a\nshows the results of corrected linear regression with Lasso, SCAD, and MCP regularizers for three\np(cid:99), where the nonzero\nentries were generated from a normal distribution and the vector was then rescaled so (cid:107)\u03b2\u2217(cid:107)2 = 1.\nwhen the estimation error (cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)2 is plotted against the rescaled sample size\nAs predicted by Theorem 1, the curves corresponding to the same penalty function stack up nicely\nk log p, and the (cid:96)2-\nerror decreases to zero as the number of samples increases, showing that the estimators (22) and (23)\nare statistically consistent. We chose the parameter a = 3.7 for SCAD and b = 3.5 for MCP.\n\nn\n\n(a)\n\n(b)\n\nFigure 1. Plots showing statistical consistency of (a) linear and (b) logistic regression with Lasso,\n\nSCAD, and MCP. Each point represents an average over 20 trials. The estimation error (cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)2\n\nn\n\nk log p . Lasso, SCAD, and MCP results are represented by\n\nis plotted against the rescaled sample size\nsolid, dotted, and dashed lines, respectively.\n\nThe simulations in Figure 2 depict the optimization-theoretic conclusions of Theorem 2. Each panel\nshows two different families of curves, corresponding to statistical error (red) and optimization error\n\n7\n\n0102030405000.10.20.30.40.5n/(k log p)l2 norm errorcomparing penalties for corrected linear regression  p=128p=256p=5120204060801000.40.50.60.70.80.91n/(k log p)l2 norm errorcomparing penalties for logistic regression  p=128p=256p=512\f(blue). The vertical axis measures the (cid:96)2-error on a log scale, while the horizontal axis tracks the\nstarting points. We used p = 128, k = (cid:98)\u221a\niteration number. The curves were obtained by running composite gradient descent from 10 random\np(cid:99), and n = (cid:98)20k log p(cid:99). As predicted by our theory,\nthe optimization error decreases at a linear rate until it falls to the level of statistical error. Panels\n(b) and (c) provide simulations for two values of the SCAD parameter a; the larger choice a = 3.7\ncorresponds to a higher level of curvature and produces a tighter cluster of local optima.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2. Plots illustrating linear rates of convergence for corrected linear regression with MCP and\n\nSCAD. Red lines depict statistical error log(cid:0)(cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)2\nlog(cid:0)(cid:107)\u03b2t \u2212(cid:98)\u03b2(cid:107)2\n\n(cid:1) and blue lines depict optimization error\n(cid:1). As predicted by Theorem 2, the optimization error decreases linearly up to statistical\n\naccuracy. Each plot shows the solution trajectory for 10 initializations of composite gradient descent.\nPanel (a) shows results for MCP; panels (b) and (c) show results for SCAD with different values of a.\n\nFigure 3 provides analogous results to Figure 2 for logistic regression, using p = 64, k = (cid:98)\u221a\np(cid:99), and\nn = (cid:98)20k log p(cid:99). The plot shows solution trajectories for 20 different initializations of composite\nlocal/global optimum(cid:98)\u03b2, SCAD and MCP produce multiple local optima.\ngradient descent. Again, the log optimization error decreases at a linear rate up to the level of\nstatistical error, as predicted by Theorem 2. Whereas the convex Lasso penalty yields a unique\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3. Plots showing linear rates of convergence on a log scale for logistic regression. Red lines\ndepict statistical error and blue lines depict optimization error. (a) Lasso penalty. (b) SCAD penalty.\n(c) MCP. Each plot shows the solution trajectory for 20 initializations of composite gradient descent.\n\n6 Discussion\n\nWe have analyzed theoretical properties of local optima of regularized M-estimators, where both\nthe loss and penalty function are allowed to be nonconvex. Our results are the \ufb01rst to establish that\nall local optima of such nonconvex problems are close to the truth, implying that any optimization\nmethod guaranteed to converge to a local optimum will provide statistically consistent solutions. We\nshow that a variant of composite gradient descent may be used to obtain near-global optima in linear\ntime, and verify our theoretical results with simulations.\n\nAcknowledgments\n\nPL acknowledges support from a Hertz Foundation Fellowship and an NSF Graduate Research Fel-\nlowship. MJW and PL were also partially supported by grants NSF-DMS-0907632 and AFOSR-\n09NL184. The authors thank the anonymous reviewers for helpful feedback.\n\n8\n\n02004006008001000\u221212\u221210\u22128\u22126\u22124\u2212202iteration countlog(||(cid:96)t \u2212 (cid:96)*||2)log error plot for corrected linear regression with MCP, b = 1.5  opt errstat err020040060080010001200\u221210\u22128\u22126\u22124\u2212202iteration countlog(||(cid:96)t \u2212 (cid:96)*||2)log error plot for corrected linear regression with SCAD, a = 3.7  opt errstat err020040060080010001200\u22128\u22127\u22126\u22125\u22124\u22123\u22122\u2212101iteration countlog(||(cid:96)t \u2212 (cid:96)*||2)log error plot for corrected linear regression with SCAD, a = 2.5  opt errstat err0500100015002000\u22127\u22126\u22125\u22124\u22123\u22122\u2212101iteration countlog(||(cid:96)t \u2212 (cid:96)*||2)log error plot for logistic regression with Lasso  opt errstat err0500100015002000\u22124\u22123.5\u22123\u22122.5\u22122\u22121.5\u22121\u22120.500.5iteration countlog(||(cid:96)t \u2212 (cid:96)*||2)log error plot for logistic regression with SCAD, a = 3.7  opt errstat err0500100015002000\u22124\u22123.5\u22123\u22122.5\u22122\u22121.5\u22121\u22120.500.5iteration countlog(||(cid:96)t \u2212 (cid:96)*||2)log error plot for logistic regression with MCP, b = 3  opt errstat err\fReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradient methods\n\nfor high-dimensional statistical recovery. Annals of Statistics, 40(5):2452\u20132482, 2012.\n\n[2] P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression,\nwith applications to biological feature selection. Annals of Applied Statistics, 5(1):232\u2013253,\n2011.\n\n[3] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle proper-\n\nties. Journal of the American Statistical Association, 96:1348\u20131360, 2001.\n\n[4] D. R. Hunter and R. Li. Variable selection using MM algorithms. Annals of Statistics,\n\n33(4):1617\u20131642, 2005.\n\n[5] E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Verlag, 1998.\n[6] P. Loh and M.J. Wainwright. High-dimensional regression with noisy and missing data: Prov-\n\nable guarantees with non-convexity. Annals of Statistics, 40(3):1637\u20131664, 2012.\n\n[7] P. Loh and M.J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and\nalgorithmic theory for local optima. arXiv e-prints, May 2013. Available at http://arxiv.\norg/abs/1305.2436.\n\n[8] P. McCullagh and J. A. Nelder. Generalized Linear Models (Second Edition). London: Chap-\n\nman & Hall, 1989.\n\n[9] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\ndimensional analysis of M-estimators with decomposable regularizers. Statistical Science,\n27(4):538\u2013557, December 2012. See arXiv version for lemma/propositions cited here.\n\n[10] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discus-\nsion Papers 2007076, Universit Catholique de Louvain, Center for Operations Research and\nEconometrics (CORE), 2007.\n\n[11] Y. Nesterov and A. Nemirovskii. Interior Point Polynomial Algorithms in Convex Program-\nming. SIAM studies in applied and numerical mathematics. Society for Industrial and Applied\nMathematics, 1987.\n\n[12] S. A. Vavasis. Complexity issues in global optimization: A survey. In Handbook of Global\n\nOptimization, pages 27\u201341. Kluwer, 1995.\n\n[13] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of\n\nStatistics, 38(2):894\u2013942, 2010.\n\n[14] C.-H. Zhang and T. Zhang. A general theory of concave regularization for high-dimensional\n\nsparse estimation problems. Statistical Science, 27(4):576\u2013593, 2012.\n\n[15] H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models.\n\nAnnals of Statistics, 36(4):1509\u20131533, 2008.\n\n9\n\n\f", "award": [], "sourceid": 299, "authors": [{"given_name": "Po-Ling", "family_name": "Loh", "institution": "UC Berkeley"}, {"given_name": "Martin", "family_name": "Wainwright", "institution": "UC Berkeley"}]}