{"title": "Adaptive Accelerated Gradient Converging Method under H\\\"{o}lderian Error Bound Condition", "book": "Advances in Neural Information Processing Systems", "page_first": 3104, "page_last": 3114, "abstract": "Recent studies have shown that proximal gradient (PG) method and accelerated gradient method (APG) with restarting can enjoy a linear convergence under a weaker condition than strong convexity, namely a quadratic growth condition (QGC). However, the faster convergence of restarting APG method relies on the potentially unknown constant in QGC to appropriately restart APG, which restricts its applicability. We address this issue by developing a novel adaptive gradient converging methods, i.e., leveraging the magnitude of proximal gradient as a criterion for restart and termination. Our analysis extends to a much more general condition beyond the QGC, namely the H\\\"{o}lderian error bound (HEB) condition. {\\it The key technique} for our development is a novel synthesis of {\\it adaptive regularization and a conditional restarting scheme}, which extends previous work focusing on strongly convex problems to a much broader family of problems. Furthermore, we demonstrate that our results have important implication and applications in machine learning: (i) if the objective function is coercive and semi-algebraic, PG's convergence speed is essentially $o(\\frac{1}{t})$, where $t$ is the total number of iterations; (ii) if the objective function consists of an $\\ell_1$, $\\ell_\\infty$, $\\ell_{1,\\infty}$, or huber norm regularization and a convex smooth piecewise quadratic loss (e.g., square loss, squared hinge loss and huber loss), the proposed algorithm is parameter-free and enjoys a {\\it faster linear convergence} than PG without any other assumptions (e.g., restricted eigen-value condition). It is notable that our linear convergence results for the aforementioned problems are global instead of local. To the best of our knowledge, these improved results are first shown in this work.", "full_text": "Adaptive Accelerated Gradient Converging Method\n\nunder H\u00f6lderian Error Bound Condition\n\nMingrui Liu, Tianbao Yang\n\nDepartment of Computer Science\n\nThe University of Iowa, Iowa City, IA 52242\nmingrui-liu, tianbao-yang@uiowa.edu\n\nAbstract\n\nRecent studies have shown that proximal gradient (PG) method and accelerated\ngradient method (APG) with restarting can enjoy a linear convergence under a\nweaker condition than strong convexity, namely a quadratic growth condition\n(QGC). However, the faster convergence of restarting APG method relies on\nthe potentially unknown constant in QGC to appropriately restart APG, which\nrestricts its applicability. We address this issue by developing a novel adaptive\ngradient converging methods, i.e., leveraging the magnitude of proximal gradient\nas a criterion for restart and termination. Our analysis extends to a much more\ngeneral condition beyond the QGC, namely the H\u00f6lderian error bound (HEB)\ncondition. The key technique for our development is a novel synthesis of adaptive\nregularization and a conditional restarting scheme, which extends previous work\nfocusing on strongly convex problems to a much broader family of problems.\nFurthermore, we demonstrate that our results have important implication and\napplications in machine learning: (i) if the objective function is coercive and semi-\nalgebraic, PG\u2019s convergence speed is essentially o( 1\nt ), where t is the total number\nof iterations; (ii) if the objective function consists of an (cid:96)1, (cid:96)\u221e, (cid:96)1,\u221e, or huber\nnorm regularization and a convex smooth piecewise quadratic loss (e.g., square\nloss, squared hinge loss and huber loss), the proposed algorithm is parameter-free\nand enjoys a faster linear convergence than PG without any other assumptions\n(e.g., restricted eigen-value condition). It is notable that our linear convergence\nresults for the aforementioned problems are global instead of local. To the best of\nour knowledge, these improved results are \ufb01rst shown in this work.\n\n1\n\nIntroduction\n\nWe consider the following smooth composite optimization:\n\nF (x) (cid:44) f (x) + g(x),\n\nmin\nx\u2208Rd\n\n(1)\n\nwhere g(x) is a proper lower semi-continuous convex function and f (x) is a continuously dif-\nferentiable convex function, whose gradient is L-Lipschitz continuous. The above problem has\nbeen studied extensively in literature and many algorithms have been developed with convergence\nguarantee. In particular, by employing the proximal mapping associated with g(x), i.e.,\n\nP\u03b7g(u) = arg min\nx\u2208Rd\n\n(cid:107)x \u2212 u(cid:107)2\n\n2 + \u03b7g(x),\n\n1\n2\n\n(2)\n\nproximal gradient (PG) and accelerated proximal gradient (APG) methods have been developed\n\u0001) 1 iteration complexities for \ufb01nding an \u0001-optimal solution.\nfor solving (1) with O(1/\u0001) and O(1/\n\n\u221a\n\n1For the moment, we neglect the constant factor.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTable 1: Summary of iteration complexities in this work under the HEB condition with \u03b8 \u2208 (0, 1/2],\n\nwhere G(x) denotes the proximal gradient, C(1/\u0001\u03b1) = max(1/\u0001\u03b1, log(1/\u0001)) and (cid:101)O(\u00b7) suppresses\nF (x) \u2212 F\u2217 \u2264 \u0001 O(cid:0)c2LC(cid:0)\n\na logarithmic term. If \u03b8 > 1/2, all algorithms can converge with \ufb01nite steps of proximal mapping.\nrAPG stands for restarting APG. \u2217 mark results available for certain subclasses of problems.\n\nLC(cid:16) 1\n\nrAPG\n\u221a\nc\n\n(cid:17)(cid:17)\n\nadaAGC\n\n(cid:1)(cid:1)\n\nalgo.\n\n(cid:16)\n\nPG\n\nO\n\n\u00011/2\u2212\u03b8\n\n(cid:18)\n\n(cid:19)(cid:19)\n\n1\n\n2(1\u2212\u03b8)\n\n\u221a\nLC\n\nc\n\n1\n\n1\u22122\u03b8\n2(1\u2212\u03b8)\n\n\u0001\n\n*\n\n(cid:18)\n\n(cid:101)O\n\nYes\nNo\n\n\u00011\u22122\u03b8\n\n1\n\n(cid:18)\n\n(cid:19)(cid:19)\n\n1\n\n1\u2212\u03b8 LC\n\nc\n\n1\n1\u22122\u03b8\n1\u2212\u03b8\n\n\u0001\n\n(cid:18)\n\nO\n\nNo\nNo\n\n\u2013\n\nYes\nYes\n\n(cid:107)G(x)(cid:107)2 \u2264 \u0001\n\nrequires \u03b8\nrequires c\n\nWhen either f (x) or g(x) is strongly convex, both PG and APG can enjoy a linear convergence, i.e.,\nthe iteration complexity is improved to be O(log(1/\u0001)).\nRecently, a wave of studies try to generalize the linear convergence to problems without strong\nconvexity but under certain structured condition of the objective function or more generally a\nquadratic growth condition [8, 32, 21, 23, 7, 31, 3, 15, 9, 29, 4, 24, 26, 25]. Earlier work along the\nline dates back to [12, 13, 14]. An example of the structured condition is such that f (x) = h(Ax)\nwhere h(\u00b7) is strongly convex function and \u2207h(x) is Lipschitz continuous on any compact set, and\ng(x) is a polyhedral function. Under such a structured condition, a local error bound condition can\nbe established [12, 13, 14], which renders an asymptotic (local) linear convergence for the proximal\ngradient method. A quadratic growth condition (QGC) prescribes that the objective function satis\ufb01es\nfor any x \u2208 Rd 2: \u03b1\n2 \u2264 F (x) \u2212 F (x\u2217), where x\u2217 denotes a closest point to x in the\noptimal set. Under such a quadratic growth condition, several recent studies have established the\nlinear convergence of PG, APG and many other algorithms (e.g., coordinate descent methods) [3,\n15, 4, 9, 29]. A notable result is that PG enjoys an iteration complexity of O( L\n\u03b1 log(1/\u0001)) without\nknowing the value of \u03b1, while a restarting version of APG studied in [15] enjoys an improved iteration\ncomplexity of O(\n\u03b1 log(1/\u0001)) hinging on the value of \u03b1 to appropriately restart APG periodically.\nOther equivalent conditions or more restricted conditions are also considered in several studies to\nshow the linear convergence of (proximal) gradient method and other methods [9, 15, 29, 30].\nIn this paper, we extend this line of work to a more general error bound condition, i.e., the H\u00f6lderian\nerror bound (HEB) condition on a compact sublevel set S\u03be = {x \u2208 Rd : F (x) \u2212 F (x\u2217) \u2264 \u03be}: there\nexists \u03b8 \u2208 (0, 1] and 0 < c < \u221e such that\n\n2 (cid:107)x \u2212 x\u2217(cid:107)2\n(cid:113) L\n\nNote that when \u03b8 = 1/2 and c =(cid:112)1/\u03b1, the HEB reduces to the QGC. In the sequel, we will refer to\n\n(cid:107)x \u2212 x\u2217(cid:107)2 \u2264 c(F (x) \u2212 F (x\u2217))\u03b8, \u2200x \u2208 S\u03be.\n\n(3)\n\nC = Lc2 as condition number of the problem. It is worth mentioning that Bolte et al. [3] considered\nthe same condition or an equivalent Kurdyka - \u0141ojasiewicz inequality but they only focused on\ndescent methods that bear a suf\ufb01cient decrease condition for each update consequentially excluding\nAPG. In addition, they do not provide explicit iteration complexity under the general HEB condition.\nAs a warm-up and motivation, we will \ufb01rst present a straightforward analysis to show that PG\nis automatically adaptive and APG can be made adaptive to the HEB by restarting.\nIn par-\nticular if F (x) satis\ufb01es a HEB condition on the initial sublevel set, PG has an iteration com-\n\u0001 ))) 3, and restarting APG enjoys an iteration complexity of\nplexity of O(max( C\n\u0001 ))) for the convergence of objective value, where C = Lc2 is the condition\nO(max(\nnumber. These two results resemble but generalize recent works that establish linear convergence of\nPG and restarting APG under the QGC - a special case of HEB. Although enjoying faster convergence,\nrestarting APG has a critical caveat: it requires the knowledge of constant c in HEB to restart APG,\nwhich is usually dif\ufb01cult to compute or estimate. In this paper, we make nontrivial contributions to\n\n\u00011\u22122\u03b8 , C log( 1\n\n\u221a\n\u00011/2\u2212\u03b8 ,\n\nC log( 1\n\n\u221a\n\nC\n\n2It can be relaxed to a \ufb01xed domain as done in this work.\n3When \u03b8 > 1/2, all algorithms can converge in \ufb01nite steps.\n\n2\n\n\fobtain faster convergence of the proximal gradient\u2019s norm under the HEB condition by developing an\nadaptive accelerated gradient converging method.\nThe main results of this paper are summarized in Table 1. The contributions of this paper are: (i)\nwe extend the analysis of PG and restarting APG under the quadratic growth condition to more\ngeneral HEB condition, and establish the adaptive iteration complexities of both algorithms; (ii)\nto enjoy faster convergence of restarting APG and to eliminate the algorithmic dependence on\nthe unknown parameter c, we propose and analyze an adaptive accelerated gradient converging\n(adaAGC) method. The developed algorithms and theory have important implication and applications\nin machine learning. Firstly, if the considered objective function is also coercive and semi-algebraic\n(e.g., a norm regularized problem in machine learning with a semi-algebraic loss function), then PG\u2019s\nconvergence speed is essentially o(1/t) instead of O(1/t), where t is the total number of iterations.\nSecondly, for solving (cid:96)1, (cid:96)\u221e or (cid:96)1,\u221e regularized smooth loss minimization problems including\nleast-squares loss, squared hinge loss and huber loss, the proposed adaAGC method enjoys a linear\nconvergence and a square root dependence on the \u201ccondition\" number. In contrast to previous work,\nthe proposed algorithm is parameter free and does not rely on any restricted conditions (e.g., the\nrestricted eigen-value conditions).\n\n0\u2264|\u03b1|\u2264r \u03bb\u03b1x\u03b1, where \u03bb\u03b1 \u2208 R and x\u03b1 = x\u03b11\n\nr \u2208 N such that h(x) = (cid:80)\n|\u03b1| =(cid:80)d\n\n2 Notations and Preliminaries\nIn this section, we present some notations and preliminaries. In the sequel, we let (cid:107)\u00b7(cid:107)p (p \u2265 1) denote\nthe p-norm of a vector. A function g(x) : Rd \u2192 (\u2212\u221e,\u221e] is a proper function if g(x) < +\u221e for at\nleast one x. g(x) is lower semi-continuous at a point x0 if lim inf x\u2192x0 g(x) = g(x0). A function\nF (x) is coercive if and only if F (x) \u2192 \u221e as (cid:107)x(cid:107)2 \u2192 \u221e. We will also refer to semi-algebraic set and\nsemi-algebraic function several times in the paper, which are standard concepts in mathematics [2].\nDue to limit of space, we present the de\ufb01nitions in the supplement.\nDenote by N the set of all positive integers. A function h(x) is a real polynomial if there exists\nd , \u03b1j \u2208 N \u222a {0},\nj=1 \u03b1j and r is referred to as the degree of h(x). A continuous function f (x) is said to be a\npiecewise convex polynomial if there exist \ufb01nitely many polyhedra P1, . . . , Pk with \u222ak\nj=1Pj = Rn\nsuch that the restriction of f on each Pj is a convex polynomial. Let fj be the restriction of f on Pj.\nThe degree of a piecewise convex polynomial function f denoted by deg(f ) is the maximum of the\ndegree of each fj. If deg(f ) = 2, the function is referred to as a piecewise convex quadratic function.\nNote that a piecewise convex polynomial function is not necessarily a convex function [10].\nA function f (x) is L-smooth w.r.t (cid:107) \u00b7 (cid:107)2 if it is differentiable and has a Lipschitz continuous gradient\nwith the Lipschitz constant L, i.e., (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)2 \u2264 L(cid:107)x \u2212 y(cid:107)2,\u2200x, y. Let \u2202g(x) denote the\nsubdifferential of g at x. Denote by (cid:107)\u2202g(x)(cid:107)2 = minu\u2208\u2202g(x) (cid:107)u(cid:107)2. A function g(x) is \u03b1-strongly\nconvex w.r.t (cid:107) \u00b7 (cid:107)2 if it satis\ufb01es for any u \u2208 \u2202g(y) such that g(x) \u2265 g(y) + u(cid:62)(x \u2212 y) + \u03b1\n2 (cid:107)x \u2212\ny(cid:107)2\nDenote by \u03b7 > 0 a positive scalar, and let P\u03b7g be the proximal mapping associated with \u03b7g(\u00b7) de\ufb01ned\nin (2). Given an objective function F (x) = f (x) + g(x), where f (x) is L-smooth and convex, g(x)\nis a simple non-smooth function which is closed and convex, de\ufb01ne a proximal gradient G\u03b7(x) as:\n\n2,\u2200x, y.\n\n1 . . . x\u03b1d\n\nG\u03b7(x) =\n\n1\n\u03b7\n\n(x \u2212 x+\n\n\u03b7 ), where x+\n\n\u03b7 = P\u03b7g(x \u2212 \u03b7\u2207f (x)).\n\nWhen g(x) = 0, we have G\u03b7(x) = \u2207f (x), i.e., the proximal gradient is the gradient. It is known that\nx is an optimal solution iff G\u03b7(x) = 0. If \u03b7 = 1/L, for simplicity we denote by G(x) = G1/L(x)\nand x+ = Pg/L(x \u2212 \u2207f (x)/L). Let F\u2217 denote the optimal objective value to minx\u2208Rd F (x) and\n\u2126\u2217 denote the optimal set. Denote by S\u03be = {x : F (x) \u2212 F\u2217 \u2264 \u03be} the \u03be-sublevel set of F (x). Let\nD(x, \u2126) = miny\u2208\u2126 (cid:107)x \u2212 y(cid:107)2.\nThe proximal gradient (PG) method solves the problem (1) by the update\n\n(4)\nwith \u03b7 \u2264 1/L starting from some initial solution x1 \u2208 Rd. It can be shown that PG has an iteration\ncomplexity of O( LD(x1,\u2126\u2217)2\n). Nevertheless, accelerated proximal gradient (APG) converges faster\nthan PG. There are many variants of APG in literature [22] including the well-known FISTA [1]. The\n\n\u0001\n\nxt+1 = P\u03b7g(xt \u2212 \u03b7\u2207f (xt)),\n\n3\n\n\fAlgorithm 1: ADG\nx0 \u2208 \u2126, A0 = 0, v0 = x0\nfor t = 0, . . . , T do\n\nAt+a = 2 1+\u03b1At\n\nL\n\nFind at+1 from quadratic equation a2\nSet At+1 = At + at+1\nxt + at+1\nSet yt = At\nAt+1\nAt+1\nCompute xt+1 = Pg/L(yt \u2212 \u2207f (yt)/L)\nCompute vt+1 = arg minx\n\nvt\n\n(cid:80)t+1\n\u03c4 =1 a\u03c4\u2207f (x\u03c4 )(cid:62)x + At+1g(x) + 1\n\n2(cid:107)x \u2212 x0(cid:107)2\n\n2\n\nsimplest variant adopts the following update\n\nyt = xt + \u03b2t(xt \u2212 xt\u22121), xt+1 = P\u03b7g(yt \u2212 \u03b7\u2207f (yt)),\n\n\u221a\n\u221a\nLD(x1,\u2126\u2217)\n\u221a\nL\u2212\u221a\n\u221a\n\u221a\n\nwhere \u03b7 \u2264 1/L and \u03b2t is an appropriate sequence (e.g. \u03b2t = t\u22121\nt+2 ). APG enjoys an iteration\n) [22]. Furthermore, if f (x) is both L-smooth and \u03b1-strongly convex,\ncomplexity of O(\none can set \u03b2t =\nand deduce a linear convergence [16, 11] with a better dependence on the\ncondition number than that of PG. If g(x) is \u03b1-strongly convex and f (x) is L-smooth, Nesterov [17]\nproposed a different variant based on dual averaging, which is referred to accelerated dual gradient\n(ADG) method and will be useful for our development. The key steps are presented in Algorithm 1.\n\n\u0001\n\u03b1\n\u03b1\n\nL+\n\n2.1 H\u00f6lderian error bound (HEB) condition\n\nDe\ufb01nition 1 (H\u00f6lderian error bound (HEB)). A function F (x) is said to satisfy a HEB condition on\nthe \u03be-sublevel set if there exist \u03b8 \u2208 (0, 1] and 0 < c < \u221e such that for any x \u2208 S\u03be\n\ndist(x, \u2126\u2217) \u2264 c(F (x) \u2212 F\u2217)\u03b8.\n\n(5)\n\nThe HEB condition is closely related to the \u0141ojasiewicz inequality or more generally Kurdyka-\n\u0141ojasiewicz (KL) inequality in real algebraic geometry. It has been shown that when functions are\nsemi-algebraic and continuous, the above inequality is known to hold on any compact set [3]. We\nrefer the readers to [3] for more discussions on HEB and KL inequalities.\nIn the remainder of this section, we will review some previous results to demonstrate that HEB is a\ngeneric condition that holds for a broad family of problems of interest. The following proposition\nstates that any proper, coercive, convex, lower-semicontinuous and semi-algebraic functions satisfy\nthe HEB condition.\nProposition 1. [3] Let F (x) be a proper, coercive, convex, lower semicontinuous and semi-algebraic\nfunction. Then there exists \u03b8 \u2208 (0, 1] and 0 < c < \u221e such that F (x) satis\ufb01es the HEB on any\n\u03be-sublevel set.\n\nExample: Most optimization problems in machine learning with an objective that consists of an\nempirical loss that is semi-algebraic (e.g., hinge loss, squared hinge loss, absolute loss, square loss)\nand a norm regularization (cid:107) \u00b7 (cid:107)p (p \u2265 1 is a rational) or a norm constraint are proper, coercive, lower\nsemicontinuous and semi-algebraic functions.\nNext two propositions exhibit the value \u03b8 for piecewise convex quadratic functions and piecewise\nconvex polynomial functions.\nProposition 2. [10] Let F (x) be a piecewise convex quadratic function on Rd. Suppose F (x) is\nconvex. Then for any \u03be > 0, there exists 0 < c < \u221e such that D(x, \u2126\u2217) \u2264 c(F (x) \u2212 F\u2217)1/2,\u2200x \u2208\nS\u03be.\nMany problems in machine learning are piecewise convex quadratic functions, which will be discussed\nmore in Section 5.\nProposition 3. [10] Let F (x) be a piecewise convex polynomial function on Rd. Suppose\nF (x) is convex. Then for any \u03be > 0, there exists c > 0 such that D(x, \u2126\u2217) \u2264 c(F (x) \u2212\nF\u2217)\n\n(deg(F )\u22121)d+1 ,\u2200x \u2208 S\u03be.\n\n1\n\n4\n\n\fAlgorithm 2: restarting APG (rAPG)\nInput: the number of stages K and x0 \u2208 \u2126\nfor k = 1, . . . , K do\n\n1 = xk\u22121 and xk\n\nSet yk\nfor \u03c4 = 1, . . . , tk do\n\n1 = xk\u22121\n\nUpdate xk\nUpdate yk\n\n\u03c4 +1 = Pg/L(yk\n\u03c4 +1 + \u03c4\n\u03c4 +1 = xk\ntk+1 and update tk\n\nLet xk = xk\n\n\u03c4 \u2212 \u2207f (yk\n\u03c4 +3 (xk\n\n\u03c4 )/L)\n\u03c4 +1 \u2212 xk\n\u03c4 )\n\nOutput: xK\nIndeed, for a polyhedral constrained convex polynomial, we can have a tighter result, as shown below.\nProposition 4. [27] Let F (x) be a convex polynomial function on Rd with degree m. If P \u2282 Rd is\na polyhedral set, then the problem minx\u2208P F (x) admits a global error bound: \u2200x \u2208 P there exists\n0 < c < \u221e such that\n\nD(x, \u2126\u2217) \u2264 c\n\n(F (x) \u2212 F\u2217) + (F (x) \u2212 F\u2217)\n\n1\nm\n\n.\n\n(6)\n\n(cid:104)\n\n(cid:105)\n\nFrom the global error bound (6), one can easily derive the HEB condition (3). As an example, an (cid:96)1\nconstrained (cid:96)p norm regression below [19] satis\ufb01es the HEB condition (3) with \u03b8 = 1\np:\n\nmin\n(cid:107)x(cid:107)1\u2264s\n\nF (x) (cid:44) 1\nn\n\ni x \u2212 bi)p,\n(a(cid:62)\n\np \u2208 2N.\n\n(7)\n\nn(cid:88)\n\ni=1\n\nh(u) =(cid:80)\n\nMany previous papers have considered a family of structured smooth composite functions F (x) =\nh(Ax) + g(x), where g(x) is a polyhedral function and h(\u00b7) is a smooth and strongly convex function\non any compact set. Suppose the optimal set of the above problem is non-empty and compact (e.g.,\nthe function is coercive) so is the sublevel set S\u03be, and it can been shown that such a function satis\ufb01es\nHEB with \u03b8 = 1/2 on any sublevel set S\u03be [15, Theorem 10]. Examples of h(u) include logistic loss\n\ni log(1 + exp(\u2212ui)) and square loss h(u) = (cid:107)u(cid:107)2\n2.\n2(cid:107)x(cid:107)2\n\nFinally, we note that there exist problems that admit HEB with \u03b8 > 1/2. A trivial example is given by\np with p \u2208 [1, 2), which satis\ufb01es HEB with \u03b8 = 1/p \u2208 (1/2, 1]. An interesting\nF (x) = 1\nnon-trivial family of problems is that f (x) = 0 and g(x) is a piece-wise linear functions according\nto Proposition 3. PG or APG applied to such family of problems is closely related to proximal point\nalgorithm [20]. Explorations of such algorithmic connection is not the focus of this paper.\n\n2 +(cid:107)x(cid:107)p\n\n3 PG and restarting APG under HEB\n\nAs a warm-up and motivation of the major contribution presented in next section, we present a\nconvergence result of PG and a restarting APG under the HEB condition. The analysis is mostly\nstraightforward and is included in the supplement. We \ufb01rst present a result of PG using the update (4).\nTheorem 1. Suppose F (x0) \u2212 F\u2217 \u2264 \u00010 and F (x) satis\ufb01es HEB on S\u00010. The iteration complexity\nof PG with option I (which returns the last solution, see the supplementary material) for achieving\nF (xt) \u2212 F\u2217 \u2264 \u0001 is O(c2L\u00012\u03b8\u22121\nNext, we show that APG can be made adaptive to HEB by periodically restarting given c and \u03b8. This\nis similar to [15] under the QGC. The steps of restarting APG (rAPG) are presented in Algorithm 2,\nwhere we employ the simplest variant of APG.\nTheorem 2. Suppose F (x0) \u2212 F\u2217 \u2264 \u00010 and F (x) satis\ufb01es HEB on S\u00010. By running Algorithm 2\n\u0001 (cid:101) and tk = (cid:100)2c\nwith K = (cid:100)log2\nk\u22121 (cid:101), we have F (xK) \u2212 F\u2217 \u2264 \u0001. The iteration complexity\n\u221a\nL\u00011/2\u2212\u03b8\nof rAPG is O(c\n\n\u221a\n) if \u03b8 > 1/2, and if \u03b8 \u2264 1/2 it is O(max{ c\n\n) if \u03b8 > 1/2, and is O(max{ c2L\n\n\u0001 )}) if \u03b8 \u2264 1/2.\n\n\u00011\u22122\u03b8 , c2L log( \u00010\n\nL\u0001\u03b8\u22121/2\n\nL log( \u00010\n\n\u221a\n\n\u00010\n\n0\n\n\u221a\n\u00011/2\u2212\u03b8 , c\n\nL\n\n\u0001 )}).\n\n0\n\nFrom Algorithm 2, we can see that rAPG requires the knowledge of c besides \u03b8 to restart APG.\nHowever, for many problems of interest, the value of c is unknown, which makes rAPG impractical.\nTo address this issue, we propose to use the magnitude of the proximal gradient as a measure for\nrestart and termination. It is worth mentioning the difference between the development in this\npaper and previous studies. Previous work [16, 11] have considered strongly convex optimization\n\n5\n\n\fproblems where the strong convexity parameter is unknown, where they also use the magnitude of\nthe proximal gradient as a measure for restart and termination. However, in order to achieve faster\nconvergence under the HEB condition without the strong convexity, we have to introduce a novel\ntechnique of adaptive regularization that adapts to the HEB. With a novel synthesis of the adaptive\nregularization and a conditional restarting that searchs for the c, we are able to develop practical\nadaptive accelerated gradient methods. We also notice a recent work [6] that proposed unconditional\nrestarted accelerated gradient methods under QGC. Their restart of APG/FISTA does not involve\nevaluation of the gradient or the objective value but rather depends on a restarting frequency parameter\nand a convex combination parameter for computing the restarting solution, which can be set based on\na rough estimate of the strong convexity parameter. As a result, their linear convergence (established\nfor distance of solutions to the optimal set) heavily depends on the rough estimate of the strong\nconvexity parameter.\nBefore diving into the details of the proposed algorithm, we will \ufb01rst present a variant of PG as a\nbaseline for comparison motivated by [18] for smooth problems, which enjoys a faster convergence\nthan the vanilla PG in terms of the proximal gradient\u2019s norm. The idea is to return a solution\nthat achieves the minimum magnitude of the proximal gradient, i.e., min1\u2264\u03c4\u2264t (cid:107)G(x\u03c4 )(cid:107)2. The\nconvergence of min1\u2264\u03c4\u2264t (cid:107)G(x\u03c4 )(cid:107)2 under HEB is presented in the following theorem.\nTheorem 3. Suppose F (x0) \u2212 F\u2217 \u2264 \u00010 and F (x) satis\ufb01es HEB on S\u00010. The iteration complexity of\nPG (option II, which returns the solution with historically minimal proximal gradient, see the supple-\nmentary material) for achieving min1\u2264\u03c4\u2264t (cid:107)G(x\u03c4 )(cid:107)2 \u2264 \u0001, is O(c\n\u0001 )}) if\n\u03b8 \u2264 1/2, and is O(c2L\u00012\u03b8\u22121\n\n1\u2212\u03b8 L max{1/\u0001\n\n1\u22122\u03b8\n1\u2212\u03b8 , log( \u00010\n\n) if \u03b8 > 1/2.\n\n1\n\n0\n\nThe \ufb01nal theorem in this section summarizes an o(1/t) convergence result of PG for minimizing\na proper, coercive, convex, lower semicontinuous and semi-algebraic function, which could be\ninteresting of its own.\nTheorem 4. Let F (x) be a proper, coercive, convex, lower semicontinuous and semi-algebraic\nfunctions. Then PG (with option I and option II) converges at a speed of o(1/t) for F (x) \u2212 F\u2217 and\nG(x), respectively, where t is the total number of iterations.\n\nRemark: This can be easily proved by combining Proposition 1 and Theorems 1, 3.\n\n4 Adaptive Accelerated Gradient Converging Methods\n\nWe \ufb01rst present a key lemma for our development that serves the foundation of the adaptive regular-\nization and conditional restarting.\nLemma 1. Assume F (x) satis\ufb01es HEB for any x \u2208 S\u03be with \u03b8 \u2208 (0, 1]. If \u03b8 \u2208 (0, 1/2], then for\nany x \u2208 S\u03be, we have D(x, \u2126\u2217) \u2264 2\n. If \u03b8 \u2208 (1/2, 1], then for any\n\n1\u2212\u03b8 (cid:107)G(x)(cid:107) \u03b8\n1\u2212\u03b8\n2\n\nx \u2208 S\u03be, we have D(x, \u2126\u2217) \u2264(cid:0) 2\n\nL + 2c2\u03be2\u03b8\u22121(cid:1)(cid:107)G(x)(cid:107)2.\n\nL(cid:107)G(x)(cid:107)2 + c\n\n1\u2212\u03b8 2\n\n1\n\n\u03b8\n\nA building block of the proposed algorithm is to solve a problem of the following style by employing\nthe Algorithm 1 (i.e., Nesterov\u2019s ADG):\n\n(cid:107)x \u2212 x0(cid:107)2\n\n\u03b4\n2\n\nF\u03b4(x) = F (x) +\n\n(8)\n2(cid:107)x\u2212\n2. A key result for our development of conditional restarting is the following theorem for each\n\nwhich consists of a L-smooth function f (x) and a \u03b4-strongly convex function g\u03b4(x) = g(x) + \u03b4\nx0(cid:107)2\ncall of Algorithm 1 for solving the above problem.\nTheorem 5. By running the Algorithm 1 for minimizing f (x) + g\u03b4(x) with an initial solution x0,\n\n2 = f (x) + g(x) +\n\n(cid:107)x \u2212 x0(cid:107)2\n2,\n\n\u03b4\n2\n\nfor t \u2265(cid:113) L\n\n(cid:1) we have\n\n2\u03b4 log(cid:0) L\n(cid:107)G(xt+1)(cid:107)2 \u2264(cid:112)L(L + \u03b4)(cid:107)x0 \u2212 x\u2217(cid:107)2\n\n\u03b4\n\n(cid:104)\n\n1 +(cid:112)\u03b4/(2L)\n\n(cid:105)\u2212t\n\n\u221a\n\n+ 2\n\n2\u03b4(cid:107)x0 \u2212 x\u2217(cid:107)2.\n\nwhere x\u2217 is any optimal solution to the original problem.\n\nFinally, we present the proposed adaptive accelerated gradient converging (adaAGC) method for\nsolving the smooth composite optimization in Algorithm 3 and prove the main theorem of this section.\n\n6\n\n\fAlgorithm 3: adaAGC for solving (1)\nInput: x0 \u2208 \u2126 and c0 and \u03b3 > 1\nLet ce = c0 and \u03b50 = (cid:107)G(x0)(cid:107)2\nfor k = 1, . . . , K do\nfor s = 1, . . . , do\n\nLet \u03b4k be given in (9) and g\u03b4k (x) = g(x) + \u03b4k\nA0 = 0, v0 = xk\u22121, xk\nfor t = 0, . . . do\n\n0 = xk\u22121\n\n2 (cid:107)x \u2212 xk\u22121(cid:107)2\n\n2\n\nL\n\na2\nAt+a = 2 1+\u03b4kAt\n\nLet at+1 be the root of\nSet At+1 = At + at+1\nSet yt = At\nAt+1\nCompute xk\nCompute vt+1 = arg minx\nif (cid:107)G(xk\n\nt + at+1\nxk\nAt+1\n\nt+1)(cid:107)2 \u2264 \u03b5k\u22121/2 then\n\nt+1 = Pg\u03b4k /L(yt \u2212 \u2207f (yt)/L)\n2(cid:107)x \u2212 xk\u22121(cid:107)2\n\nvt\n\n1\n\nlet xk = xk\nt+1 and \u03b5k = \u03b5k\u22121/2\nbreak the enclosing two for loops\n\n2 +(cid:80)t+1\n\n// step S1\n\nif \u03c4 = (cid:100)(cid:113) 2L\n\n\u03b4k\n\n\u221a\n\n\u03c4 =1 a\u03c4\u2207f (xk\n\n\u03c4 )(cid:62)x + At+1g\u03b4k (x)\n\nlog\n\nL(L+\u03b4k)\n\n\u03b4k\n\n(cid:101) then\n\n// condition (*)\n\nlet ce = \u03b3ce and break the enclosing for loop\n\n// step S2\n\nOutput: xK\n\nThe adaAGC runs with multiple stages (k = 1, . . . , K). We start with an initial guess c0 of the\nparameter c in the HEB. With the current guess ce of c, at the k-th stage adaAGC employs ADG to\nsolve a problem of (8) with an adaptive regularization parameter \u03b4k being\nif \u03b8 \u2208 (0, 1/2]\nif \u03b8 \u2208 (1/2, 1]\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3 min\n\n1\u22122\u03b8\n1\u2212\u03b8\nk\u22121\n16c1/(1\u2212\u03b8)\n\n(cid:32)\n(cid:16) L\n\n(cid:33)\n\n(cid:17)\n\n\u03b4k =\n\nL\n32 ,\n\nmin\n\n(9)\n\n1\u2212\u03b8\n\n2\n\n\u03b5\n\n\u03b8\n\ne\n\n32 ,\n\n1\ne\u00012\u03b8\u22121\n\n0\n\n32c2\n\n(cid:16)\u221a\n\nThe condition (*) speci\ufb01es the condition for restarting with an increased value of ce. When the \ufb02ow\nenters step S2 before step S1 for each s, it means that the current guess ce is not suf\ufb01ciently large\naccording to Theorem 5 and Lemma 1, then we increase ce and repeat the same process (next iteration\nfor s). We refer to this machinery as conditional restarting. We present the main result of this section\nin the following theorem.\nTheorem 6. Suppose F (x0)\u2212 F\u2217 \u2264 \u00010, F (x) satis\ufb01es HEB on S\u00010 and c0 \u2264 c. Let \u03b50 = (cid:107)G(x0)(cid:107)2,\nhaving (cid:107)G(xK)(cid:107)2 \u2264 \u0001 is (cid:101)O\nK = (cid:100)log2( \u03b50\n\u0001 )(cid:101), p = (1 \u2212 2\u03b8)/(1 \u2212 \u03b8) for \u03b8 \u2208 (0, 1/2]. The iteration complexity of Algorithm 3 for\nif \u03b8 \u2208 (1/2, 1], where (cid:101)O(\u00b7) suppresses a log term depending on c, c0, L, \u03b3.\nLc\u0001\u03b8\u22121/2\nWe sketch the idea of the proof here: for each k, we can bound the number of cycles (indexd by s in the\nnumber of iterations across all stages is bounded by(cid:80)K\nalgorithm) in order to enter step S1 denoted by sk. We can bound sk \u2264 log\u03b3(c/c0) + 1 and then total\n(cid:101).\nBefore ending this section, we would like to remark that if the smoothness parameter L is unknown,\none can also employ the backtracking technique pairing with each update to search for L [17].\n\nif \u03b8 \u2208 (0, 1/2], and (cid:101)O(\nk=1 sktk where tk = (cid:100)(cid:113) 2L\n\n\u0001p/2 , log(\u03b50/\u0001)\n\n2(1\u2212\u03b8) max( 1\n\nL(L+\u03b4k)\n\n(cid:17)\n\n\u221a\n\n\u221a\n\nlog\n\nLc\n\n\u03b4k\n\n\u03b4k\n\n)\n\n0\n\n1\n\n4.1 Convergence of Objective Gap\n\nIn this subsection, we show that the convergence of the proximal gradient also implies the convergence\nof the objective gap F (x)\u2212 F\u2217 for certain subclasses of the general problems that we have considered.\nOur \ufb01rst result applies to the case when F (x) satis\ufb01es the HEB with \u03b8 \u2208 (0, 1) and the nonsmooth\npart g(x) is absent, i.e., F (x) = f (x).\nIn this case, we can establish the convergence of the\nobjective gap, since the objective gap can be bounded by a function of the magnitude of gradient,\n\n7\n\n\f0\n\n1\n\n2\n\n\u221a\n\n(cid:16)\u221a\n\nLc\u0001\u03b8\u22121/2\n\ni.e., f (x) \u2212 f\u2217 \u2264 c1/(1\u2212\u03b8)(cid:107)\u2207f (x)(cid:107)1/(1\u2212\u03b8)\neasily prove the following result.\ncomplexity of Algorithm 3 for having F (xK) \u2212 F (x\u2217) \u2264 \u0001 is (cid:101)O\nTheorem 7. Assume F (x) = f (x) and the same conditions in Theorem 6 hold. The iteration\n\u03b8 \u2208 (0, 1/2], and (cid:101)O(\nif\n\n(cid:17)\n) if \u03b8 \u2208 (1/2, 1), where (cid:101)O(\u00b7) suppresses a log term depending on\n\n(c.f. the proof of Lemma 2 in the supplement). One can\n\nc, c0, L, \u03b3.\nRemark Note that the above iteration complexity of adaAGC is the same as that of rAPG (shown in\nTable 1), where the later is established under the knowledge of c.\nOur second result applies to a subclass of the general problems where either g(x) or f (x) is \u00b5-strongly\nconvex or F (x) = f (x) + g(x), where f (x) = h(Ax) with h(\u00b7) being a strongly convex function\nand g(x) is the indicator function of a polyhedral set \u2126 = {x : Cx \u2264 b}. Examples include square\nloss minimization under an (cid:96)1 or (cid:96)\u221e constraint [15, Theorem 8]. It has been shown that in the last\ncase, for any x \u2208 dom(F ), there exists \u00b5 > 0 such that\n\n\u00011/2\u2212\u03b8 , log(\u03b50/\u0001)\n\nLc max(\n\n\u00b5\n2\n\n(cid:107)x \u2212 x\u2217(cid:107)2\n2,\n\nf (x\u2217) \u2265 f (x) + \u2207f (x)(cid:62)(x\u2217 \u2212 x) +\n\n(10)\nwhere x\u2217 is the closest optimal solution to x, and the HEB condition of F (x) with \u03b8 = 1/2 and\nF (x+) \u2212 F\u2217 \u2264 O(1/\u00b5)(cid:107)G(x)(cid:107)2\nTheorem 8. Assume f (x) or g(x) is \u00b5-strongly convex, or f (x) = h(Ax) and g(x) is the in-\ndicator function of a polyhedral set such that (10) holds for some \u00b5 > 0, and other conditions\nK) \u2212 F (x\u2217) \u2264 \u0001 is\nin Theorem 6 hold. The iteration complexity of Algorithm 3 for having F (x+\n\nc = (cid:112)2/\u00b5 holds [15, Theorem 1]. In the three cases mentioned above, we can establish that\n(cid:16)(cid:112)L/\u00b5 log(\u03b50/\n\n(cid:17)\n, where (cid:101)O(\u00b7) suppresses a log term depending on \u00b5, c0, L, \u03b3.\n\n2, where x+ = Pg/L(x \u2212 \u2207f (x)/L), and the following result.\n\n(cid:101)O\n\n\u221a\n\n\u00b5\u0001)\n\n5 Applications and Experiments\n\ni=1\n\n1\nn\n\nmin\nx\u2208Rd\n\nn(cid:88)\n\nIn this section, we present some applications of our theorems and algorithms in machine learning. In\nparticular, we consider the regularized problems with a smooth loss:\n(cid:96)(x(cid:62)ai, bi) + \u03bbR(x),\n\n(cid:96)\u221e norm (cid:107)x(cid:107)\u221e, or a huber norm [28], or the (cid:96)1,p norm(cid:80)K\n\n(11)\nwhere (ai, bi), i = 1, . . . , n denote a set of training examples, R(x) could be the (cid:96)1 norm (cid:107)x(cid:107)1, the\nk=1 (cid:107)xk(cid:107)p, where k is the k-th component\nvector of x. Next, we present several results about the HEB condition to cover a broad family of loss\nfunctions that enjoy the faster convergence of adaAGC.\nCorollary 1. Assume the loss function (cid:96)(z, b) is nonnegative, convex, smooth and piecewise quadratic,\nthen the problems in (11) with (cid:96)1 norm, (cid:96)\u221e norm, Huber norm and (cid:96)1,\u221e norm regularization satisfy\nthe HEB condition with \u03b8 = 1/2 on any sublevel set S\u03be with \u03be > 0. Hence adaAGC has a global\nlinear convergence in terms of the proximal gradient\u2019s norm and a square root dependence on the\ncondition number.\n\nRemark: The above corollary follows directly from Proposition 2 and Theorem 6. If the loss function\nis a logistic loss and the regularizer is a polyhedral function (e.g., (cid:96)1, (cid:96)\u221e and (cid:96)1,\u221e norm), we can\nprove the same result. Examples of convex, smooth and piecewise convex quadratic loss functions\ninclude: square loss: (cid:96)(z, b) = (z \u2212 b)2 for b \u2208 R; squared hinge loss: (cid:96)(z, b) = max(0, 1 \u2212 bz)2\nfor b \u2208 {1,\u22121}; and huber loss: (cid:96)(z, b) = \u03c1(|z \u2212 b| \u2212 \u03c1\n2 ) if |z \u2212 b| > \u03c1, and (cid:96)(z, b) = (z \u2212 b)2/2 if\n|z \u2212 b| \u2264 \u03c1, for b \u2208 R.\n\nExperimental Results We conduct some experiments to demonstrate the effectiveness of adaAGC\nfor solving problems of type (1). Speci\ufb01cally, we compare adaAGC, PG with option II that returns\nthe solution with historically minimal proximal gradient, FISTA, unconditional restarting FISTA\n(urFISTA) [6] for optimizing the squared hinge loss (classi\ufb01cation), square loss (regression), huber\nloss (with \u03c1 = 1) (regression) with (cid:96)1 and (cid:96)\u221e regularization, which are cases of (11), and we also\nconsider the (cid:96)1 constrained (cid:96)p norm regression (7) with varying p. We use three datasets from\nthe LibSVM website [5], which are splice (n = 1000, d = 60) for classi\ufb01cation, and bodyfat\n\n8\n\n\fPG\n\nFISTA\nurFISTA\nadaAGC\n\n2040\n1289\n1666\n1410\n\n2040\n1289\n2371\n1410\n\n2040\n1289\n2601\n1410\n\n\u0001 = 10\u22127\n\n2040\n1289\n3480\n1410\n\n3514\n5526\n1674\n2382\n\n3724\n5526\n2379\n2382\n\n3724\n5526\n2605\n2382\n\n3724\n5526\n3488\n2382\n\nTable 2: squared hinge loss with (cid:96)1 norm (left) and (cid:96)\u221e norm (right) regularization on splice data\n\nAlgorithm \u0001 = 10\u22124\n\n\u0001 = 10\u22125\n\n\u0001 = 10\u22126\n\n\u0001 = 10\u22124\n\n\u0001 = 10\u22125\n\n\u0001 = 10\u22126\n\n\u0001 = 10\u22127\n\nFISTA > adaAGC > PG > urFISTA\n\nadaAGC > urFISTA > PG > FISTA\n\nTable 3: square loss with (cid:96)1 norm (left) and (cid:96)\u221e norm (right) regularization on cpusmall data\n\u0001 = 10\u22127\nAlgorithm \u0001 = 10\u22124\n109298\n210874\n6781\n20082\n43601\n18278\n13632\n9571\n\n\u0001 = 10\u22124\n139505\n6610\n18276\n9881\n\n\u0001 = 10\u22126\n210874\n20082\n35169\n13632\n\n\u0001 = 10\u22125\n159908\n16387\n26706\n12623\n\n\u0001 = 10\u22127\n170915\n23779\n43603\n13575\n\n\u0001 = 10\u22125\n204120\n16418\n26704\n13033\n\n\u0001 = 10\u22126\n170915\n23779\n35173\n13575\n\nFISTA\nurFISTA\nadaAGC\n\nPG\n\nadaAGC > FISTA > urFISTA > PG\n\nadaAGC > FISTA > urFISTA > PG\n\nTable 4: (cid:96)1 regularized huber loss (left) and (cid:96)1 constrained square loss (right) on bodyfat data\n\nPG\n\nAlgorithm \u0001 = 10\u22124\n258723\n6630\n6855\n16976\n\nFISTA\nurFISTA\nadaAGC\n\n\u0001 = 10\u22125\n423181\n25020\n12662\n16980\n\n\u0001 = 10\u22126\n602043\n74416\n17994\n23844\n\n\u0001 = 10\u22127\n681488\n124261\n23933\n25697\n\n\u0001 = 10\u22124\n1006880\n15805\n138359\n23054\n\n\u0001 = 10\u22125\n1768482\n66319\n235081\n33818\n\n\u0001 = 10\u22126\n2530085\n180977\n331203\n44582\n\n\u0001 = 10\u22127\n2632578\n181176\n426341\n48127\n\nurFISTA > adaAGC > FISTA > PG\nTable 5: (cid:96)1 constrained (cid:96)p norm regression on bodyfat data (\u0001 = 10\u22123)\n\nadaAGC> FISTA > urFISTA > PG\n\nAlgorithm\n\nPG\n\nadaAGC\n\np = 2\n\n250869 (1)\n8710 (1)\n\np = 4\n\n979401 (3.90)\n17494 (2.0)\n\np = 6\n\n1559753 (6.22)\n22481 (2.58)\n\np = 8\n\n4015665 (16.00)\n\n33081 (3.80)\n\nn, and the parameter s in (7) is set to s = 100.\n\n(n = 252, d = 14), cpusmall (n = 8192, d = 12) for regression. For problems covered by (11), we\n\ufb01x \u03bb = 1\nWe use the backtracking in PG, adaAGC and FISTA to search for the smoothness parameter. In\nadaAGC, we set c0 = 2, \u03b3 = 2 for the (cid:96)1 constrained (cid:96)p norm regression and c0 = 10, \u03b3 = 2 for the\nrest problems. For fairness, for urFISTA and adaAGC, we use the same initial estimate of unknown\nparameter (i.e., c). Each algorithm starts at the same initial point, which is set to be zero, and we stop\neach algorithm when the norm of its proximal gradient is less than a prescribed threshold \u0001 and report\nthe total number of proximal mappings. The results are presented in the Tables 2\u20135. It indicates\nthat adaAGC converges faster than PG and FISTA (except for solving squared hinge loss with (cid:96)1\nnorm regularization) when \u0001 is very small, which is consistent with the theoretical results. Note that\nurFISTA sometimes has better performance than adaAGC but is worse than adaAGC in most cases. It\nis notable that for some problems (see Table 2) the number of proximal mappings is the same value\nfor achieving different precision \u0001. This is because that value is the minimum number of proximal\nmappings such that the magnitude of the proximal gradient suddenly becomes zero. In Table 5, the\nnumbers in parenthesis indicate the increasing factor in the number of proximal mappings compared\nto the base case p = 2, which show that increasing factors of adaAGC are approximately the square\nroot of that of PG and thus are consistent with our theory.\n6 Conclusions\nIn this paper, we have considered smooth composite optimization problems under a general H\u00f6lderian\nerror bound condition. We have established adaptive iteration complexity to the H\u00f6lderian error\nbound condition of proximal gradient and accelerated proximal gradient methods. To eliminate the\ndependence on the unknown parameter in the error bound condition and enjoy the faster convergence\nof accelerated proximal gradient method, we have developed a novel parameter-free adaptive ac-\ncelerated gradient converging method using the magnitude of the (proximal) gradient as a measure\nfor restart and termination. We have also considered a broad family of norm regularized problems\nin machine learning and showed faster convergence of the proposed adaptive accelerated gradient\nconverging method.\n\nAcknowledgments We thank the anonymous reviewers for their helpful comments. M. Liu and T.\nYang are partially supported by National Science Foundation (IIS-1463988, IIS-1545995).\n\n9\n\n\fReferences\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Img. Sci., 2:183\u2013202, 2009.\n\n[2] E. Bierstone and P. D. Milman. Semianalytic and subanalytic sets. Publications Math\u00e9matiques\n\nde l\u2019Institut des Hautes \u00c9tudes Scienti\ufb01ques, 67(1):5\u201342, 1988.\n\n[3] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. Suter. From error bounds to the complexity of\n\n\ufb01rst-order descent methods for convex functions. CoRR, abs/1510.08234, 2015.\n\n[4] D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence of\n\nproximal methods. arXiv:1602.06661, 2016.\n\n[5] R.-E. Fan and C.-J. Lin. Libsvm data: Classi\ufb01cation, regression and multi-label. URL:\n\nhttp://www. csie. ntu. edu. tw/cjlin/libsvmtools/datasets, 2011.\n\n[6] O. Fercoq and Z. Qu. Restarting accelerated gradient methods with a rough strong convexity\n\nestimate. arXiv preprint arXiv:1609.07358, 2016.\n\n[7] P. Gong and J. Ye. Linear convergence of variance-reduced projected stochastic gradient without\n\nstrong convexity. CoRR, abs/1406.1102, 2014.\n\n[8] K. Hou, Z. Zhou, A. M. So, and Z. Luo. On the linear convergence of the proximal gradient\nmethod for trace norm regularization. In Advances in Neural Information Processing Systems\n(NIPS), pages 710\u2013718, 2013.\n\n[9] H. Karimi, J. Nutini, and M. W. Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Machine Learning and Knowledge\nDiscovery in Databases - European Conference (ECML-PKDD), pages 795\u2013811, 2016.\n\n[10] G. Li. Global error bounds for piecewise convex polynomials. Math. Program., 137(1-2):37\u201364,\n\n2013.\n\n[11] Q. Lin and L. Xiao. An adaptive accelerated proximal gradient method and its homotopy\nIn Proceedings of the International Conference on\n\ncontinuation for sparse optimization.\nMachine Learning, (ICML), pages 73\u201381, 2014.\n\n[12] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differen-\n\ntiable minization. Journal of Optimization Theory and Applications, 72(1):7\u201335, 1992.\n\n[13] Z.-Q. Luo and P. Tseng. On the linear convergence of descent methods for convex essenially\n\nsmooth minization. SIAM Journal on Control and Optimization, 30(2):408\u2013425, 1992.\n\n[14] Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a\n\ngeneral approach. Annals of Operations Research, 46:157\u2013178, 1993.\n\n[15] I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of \ufb01rst order methods for non-\n\nstrongly convex optimization. CoRR, abs/1504.06298, 2015.\n\n[16] Y. Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization.\n\nKluwer Academic Publ., 2004.\n\n[17] Y. Nesterov. Gradient methods for minimizing composite objective function. Core discussion\npapers, Universite catholique de Louvain, Center for Operations Research and Econometrics\n(CORE), 2007.\n\n[18] Y. Nesterov. How to make the gradients small. Optima 88, 2012.\n\n[19] H. Nyquist. The optimal lp norm estimator in linear regression models. Communications in\n\nStatistics - Theory and Methods, 12(21):2511\u20132524, 1983.\n\n[20] R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. on Control\n\nand Optimization, 14, 1976.\n\n10\n\n\f[21] A. M. So. Non-asymptotic convergence analysis of inexact gradient methods for machine\n\nlearning without strong convexity. CoRR, abs/1309.0113, 2013.\n\n[22] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted\n\nto SIAM Journal on Optimization, 2008.\n\n[23] P. Wang and C. Lin. Iteration complexity of feasible descent methods for convex optimization.\n\nJournal of Machine Learning Research, 15(1):1523\u20131548, 2014.\n\n[24] Y. Xu, Q. Lin, and T. Yang. Stochastic convex optimization: Faster local growth implies faster\nglobal convergence. In International Conference on Machine Learning, pages 3821\u20133830,\n2017.\n\n[25] Y. Xu, Y. Yan, Q. Lin, and T. Yang. Homotopy smoothing for non-smooth problems with lower\ncomplexity than O(1/\u0001). In Advances In Neural Information Processing Systems 29 (NIPS),\npages 1208\u20131216, 2016.\n\n[26] T. Yang and Q. Lin. Rsg: Beating subgradient method without smoothness and strong convexity.\n\nCoRR, abs/1512.03107, 2016.\n\n[27] W. H. Yang. Error bounds for convex polynomials. SIAM Journal on Optimization, 19(4):1633\u2013\n\n1647, 2009.\n\n[28] O. Zadorozhnyi, G. Benecke, S. Mandt, T. Scheffer, and M. Kloft. Huber-norm regularization\nfor linear prediction models. In Machine Learning and Knowledge Discovery in Databases -\nEuropean Conference (ECML-PKDD), pages 714\u2013730, 2016.\n\n[29] H. Zhang. New analysis of linear convergence of gradient-type methods via unifying error\n\nbound conditions. CoRR, abs/1606.00269, 2016.\n\n[30] H. Zhang. The restricted strong convexity revisited: analysis of equivalence to error bound and\n\nquadratic growth. Optimization Letters, pages 1\u201317, 2016.\n\n[31] Z. Zhou and A. M. So. A uni\ufb01ed approach to error bounds for structured convex optimization\n\nproblems. CoRR, abs/1512.03518, 2015.\n\n[32] Z. Zhou, Q. Zhang, and A. M. So. L1p-norm regularization: Error bounds and convergence\nrate analysis of \ufb01rst-order methods. In Proceedings of the 32nd International Conference on\nMachine Learning, (ICML), pages 1501\u20131510, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1758, "authors": [{"given_name": "Mingrui", "family_name": "Liu", "institution": "The University of Iowa"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}