{"title": "Linear Convergence with Condition Number Independent Access of Full Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 980, "page_last": 988, "abstract": "For smooth and strongly convex optimization, the optimal iteration complexity of the gradient-based algorithm is $O(\\sqrt{\\kappa}\\log 1/\\epsilon)$, where $\\kappa$ is the conditional number. In the case that the optimization problem is ill-conditioned, we need to evaluate a larger number of full gradients, which could be computationally expensive. In this paper, we propose to reduce the number of full gradient required by allowing the algorithm to access the stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use an combination of the gradient and the stochastic gradient to update the intermediate solutions. By performing a fixed number of mixed gradient descents, we are able to improve the sub-optimality of the solution by a constant factor, and thus achieve a linear convergence rate. Theoretical analysis shows that EMGD is able to find an $\\epsilon$-optimal solution by computing $O(\\log 1/\\epsilon)$ full gradients and $O(\\kappa^2\\log 1/\\epsilon)$ stochastic gradients.", "full_text": "Linear Convergence with Condition Number\n\nIndependent Access of Full Gradients\n\nLijun Zhang Mehrdad Mahdavi Rong Jin\nDepartment of Computer Science and Engineering\n\nMichigan State University, East Lansing, MI 48824, USA\n\n{zhanglij,mahdavim,rongjin}@msu.edu\n\nAbstract\n\nFor smooth and strongly convex optimizations, the optimal iteration complexity of\nthe gradient-based algorithm is O(\u221a\u03ba log 1/\u01eb), where \u03ba is the condition number.\nIn the case that the optimization problem is ill-conditioned, we need to evaluate a\nlarge number of full gradients, which could be computationally expensive. In this\npaper, we propose to remove the dependence on the condition number by allowing\nthe algorithm to access stochastic gradients of the objective function. To this end,\nwe present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that\nis able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed\ngradient descent, where we use a combination of the full and stochastic gradients\nto update the intermediate solution. Theoretical analysis shows that EMGD is\nable to \ufb01nd an \u01eb-optimal solution by computing O(log 1/\u01eb) full gradients and\nO(\u03ba2 log 1/\u01eb) stochastic gradients.\n\n1\n\nIntroduction\n\nConvex optimization has become a tool central to many areas of engineering and applied sciences,\nsuch as signal processing [20] and machine learning [24]. The problem of convex optimization is\ntypically given as\n\nF (w),\n\nmin\nw\u2208W\n\nwhere W is a convex domain, and F (\u00b7) is a convex function. In most cases, the optimization algo-\nrithm for solving the above problem is an iterative process, and the convergence rate is characterized\nby the iteration complexity, i.e., the number of iterations needed to \ufb01nd an \u01eb-optimal solution [3,17].\nIn this study, we focus on \ufb01rst order methods, where we only have the access to the (stochastic)\ngradient of the objective function. For most convex optimization problems, the iteration complexity\nof an optimization algorithm depends on the following two factors.\n\n1. The analytical properties of the objective function. For example, is F (\u00b7) smooth or strongly\n\nconvex?\n\n2. The information that can be elicited about the objective function. For example, do we have\n\naccess to the full gradient or the stochastic gradient of F (\u00b7)?\n\nThe optimal iteration complexities for some popular combinations of the above two factors are sum-\nmarized in Table 1 and elaborated in the related work section. We observe that when the objective\nfunction is smooth (and strongly convex), the convergence rate for full gradients is much faster than\nthat for stochastic gradients. On the other hand, the evaluation of a stochastic gradient is usually\nsigni\ufb01cantly more ef\ufb01cient than that of a full gradient. Thus, replacing full gradients with stochastic\ngradients essentially trades the number of iterations with a low computational cost per iteration.\n\n1\n\n\fTable 1: The optimal iteration complexity of convex optimization. L and \u03bb are the moduli of\nsmoothness and strong convexity, respectively. \u03ba = L/\u03bb is the condition number.\n\nFull Gradient\n\nStochastic Gradient\n\nLipschitz continuous\n\n\u01eb2(cid:1)\nO(cid:0) 1\nO(cid:0) 1\n\u01eb2(cid:1)\n\nSmooth\n\nO(cid:16) L\u221a\u01eb(cid:17)\nO(cid:0) 1\n\u01eb2(cid:1)\n\nSmooth & Strongly Convex\n\n\u01eb(cid:1)\nO(cid:0)\u221a\u03ba log 1\n\u03bb\u01eb(cid:1)\nO(cid:0) 1\n\nIn this work, we consider the case when the objective function is both smooth and strongly convex,\nwhere the optimal iteration complexity is O(\u221a\u03ba log 1\n\u01eb ) if the optimization method is \ufb01rst order\nand has access to the full gradients [17]. For the optimization problems that are ill-conditioned, the\ncondition number \u03ba can be very large, leading to many evaluations of full gradients, an operation that\nis computationally expensive for large data sets. To reduce the computational cost, we are interested\nin the possibility of making the number of full gradients required independent from \u03ba. Although the\nO(\u221a\u03ba log 1\n\u01eb ) rate is in general not improvable for any \ufb01rst order method, we bypass this dif\ufb01culty by\nallowing the algorithm to have access to both full and stochastic gradients. Our objective is to reduce\nthe iteration complexity from O(\u221a\u03ba log 1\n\u01eb ) by replacing most of the evaluations of full\ngradients with the evaluations of stochastic gradients. Under the assumption that stochastic gradients\ncan be computed ef\ufb01ciently, this tradeoff could lead to a signi\ufb01cant improvement in computational\nef\ufb01ciency.\n\n\u01eb ) to O(log 1\n\nTo this end, we developed a novel optimization algorithm named Epoch Mixed Gradient Descent\n(EMGD). It divides the optimization process into a sequence of epochs, an idea that is borrowed\nfrom the epoch gradient descent [9]. At each epoch, the proposed algorithm performs mixed gra-\ndient descent by evaluating one full gradient and O(\u03ba2) stochastic gradients. It achieves a constant\nreduction in the optimization error for every epoch, leading to a linear convergence rate. Our analy-\nsis shows that EMGD is able to \ufb01nd an \u01eb-optimal solution by computing O(log 1\n\u01eb ) full gradients and\nO(\u03ba2 log 1\n\u01eb ) stochastic gradients. In other words, with the help of stochastic gradients, the number\nof full gradients required is reduced from O(\u221a\u03ba log 1\n\u01eb ), independent from the condition\nnumber.\n\n\u01eb ) to O(log 1\n\n2 Related Work\n\nDuring the last three decades, there have been signi\ufb01cant advances in convex optimization [3,15,17].\nIn this section, we provide a brief review of the \ufb01rst order optimization methods.\n\nWe \ufb01rst discuss deterministic optimization, where the gradient of the objective function is available.\nFor the general convex and Lipschitz continuous optimization problem, the iteration complexity\nof gradient (subgradient) descent is O( 1\n\u01eb2 ), which is optimal up to constant factors [15]. When\nthe objective function is convex and smooth, the optimal optimization scheme is the accelerated\ngradient descent developed by Nesterov, whose iteration complexity is O( L\u221a\u01eb ) [16, 18]. With slight\nmodi\ufb01cations, the accelerated gradient descent algorithm can also be applied to optimize the smooth\nand strongly convex objective function, whose iteration complexity is O(\u221a\u03ba log 1\n\u01eb ) and is in general\nnot improvable [17, 19]. The objective of our work is to reduce the number of accesses to the full\ngradients by exploiting the availability of stochastic gradients.\n\nIn stochastic optimization, we have access to the stochastic gradient, which is an unbiased estimate\nof the full gradient [14]. Similar to the case in deterministic optimization, if the objective function\nis convex and Lipschitz continuous, stochastic gradient (subgradient) descent is the optimal algo-\nrithm and the iteration complexity is also O( 1\n\u01eb2 ) [14, 15]. When the objective function is \u03bb-strongly\nconvex, the algorithms proposed in very recent works [9, 10, 21, 26] achieve the optimal O( 1\n\u03bb\u01eb ) it-\neration complexity [1]. Since the convergence rate of stochastic optimization is dominated by the\nrandomness in the gradient [6, 11], smoothness usually does not lead to a faster convergence rate for\nstochastic optimization. A variant of stochastic optimization is the \u201csemi-stochastic\u201d approximation,\nwhich interleave stochastic gradient descent and full gradient descent [12]. In the strongly convex\ncase, if the stochastic gradients are taken at a decreasing rate, the convergence rate can be improved\nto approach O( 1\n\n\u03bb\u221a\u01eb ) [13].\n\n2\n\n\fFrom the above discussion, we observe that the iteration complexity in stochastic optimization is\npolynomial in 1\n\u01eb , making it dif\ufb01cult to \ufb01nd high-precision solutions. However, when the objective\nfunction is strongly convex and can be written as a sum of a \ufb01nite number of functions, i.e.,\n\nF (w) =\n\n1\nn\n\nnXi=1\n\nfi(w),\n\n(1)\n\nwhere each fi(\u00b7) is smooth, the iteration complexity of some speci\ufb01c algorithms may exhibit a loga-\nrithmic dependence on 1\n\u01eb , i.e., a linear convergence rate. The two very recent works are the stochastic\naverage gradient (SAG) [22], whose iteration complexity is O(n log 1\n\u01eb ), provided n \u2265 8\u03ba, and the\nstochastic dual coordinate ascent (SDCA) [23], whose iteration complexity is O((n + \u03ba) log 1\n\u01eb ).1\nUnder approximate conditions, the incremental gradient method [2] and the hybrid method [5] can\nalso minimize the function in (1) with a linear convergence rate. But those algorithms usually treat\none pass of all fi\u2019s (or the subset of fi\u2019s) as one iteration, and thus have high computational cost per\niteration.\n\n3 Epoch Mixed Gradient Descent\n\n3.1 Preliminaries\n\nIn this paper, we assume there exist two oracles.\n\n1. The \ufb01rst one is a gradient oracle Og, which for a given input point w \u2208 W returns the\n\ngradient \u2207F (w), that is,\n\nOg(w) = \u2207F (w).\n\n2. The second one is a function oracle Of , each call of which returns a random function f (\u00b7),\n\nsuch that\n\nand f (\u00b7) is L-smooth, that is,\n\nF (w) = Ef [f (w)], \u2200w \u2208 W,\n\nk\u2207f (w) \u2212 \u2207f (w\u2032)k \u2264 Lkw \u2212 w\u2032k, \u2200w, w\u2032 \u2208 W.\n\n(2)\n\n(3)\n\n(4)\n\nAlthough we do not de\ufb01ne a stochastic gradient oracle directly, the function oracle Of allows us to\nevaluate the stochastic gradient of F (\u00b7) at any point w \u2208 W.\nNotice that the assumption about the function oracle Of implies that the objective function F (\u00b7) is\nalso L-smooth. Since \u2207F (w) = Ef\u2207f (w), by Jensen\u2019s inequality, we have\n\nk\u2207F (w) \u2212 \u2207F (w\u2032)k \u2264 Efk\u2207f (w) \u2212 \u2207f (w\u2032)k\nBesides, we further assume F (\u00b7) is \u03bb-strongly convex, that is,\n\n(2)\n\n\u2264 Lkw \u2212 w\u2032k, \u2200w, w\u2032 \u2208 W.\n\nk\u2207F (w) \u2212 \u2207F (w\u2032)k \u2265 \u03bbkw \u2212 w\u2032k, \u2200w, w\u2032 \u2208 W.\n\nFrom (3) and (4), it is obvious that L \u2265 \u03bb. The condition number \u03ba is de\ufb01ned as the ratio between\nthem. i.e., \u03ba = L/\u03bb \u2265 1.\n\n3.2 The Algorithm\n\nThe detailed steps of the proposed Epoch Mixed Gradient Descent (EMGD) are shown in Algo-\nrithm 1, where we use the superscript for the index of epoches, and the subscript for the index of\niterations at each epoch. We denote by B(x; r) the \u21132 ball of radius r around the point x.\nSimilar to the epoch gradient descent (EGD) [9], we divided the optimization process into a sequence\nof epochs (step 3 to step 10). While the number of accesses to the gradient oracle in EGD increases\nexponentially over the epoches, the number of accesses to the two oracles in EMGD is \ufb01xed.\n\n1In order to apply SDCA, we need to assume each function fi is \u03bb-strongly convex, so that we can rewrite\n\nfi(w) as gi(w) + \u03bb\n\n2 kwk2, where gi(w) = fi(w) \u2212 \u03bb\n\n2 kwk2 is convex.\n\n3\n\n\fAlgorithm 1 Epoch Mixed Gradient Descent (EMGD)\nInput: step size \u03b7, the initial domain size \u22061, the number of iterations T per epoch, and the number\nof epoches m\n1: Initialize \u00afw1 = 0\n2: for k = 1, . . . , m do\n3:\n4:\n5:\n6:\n7:\n\nSet wk\nCall the gradient oracle Og to obtain \u2207F ( \u00afwk)\nfor t = 1, . . . , T do\n\nCall the function oracle Of to obtain a random function f k\n\nCompute the mixed gradient as\n\n1 = \u00afwk\n\nt (\u00b7)\n\n\u02dcgk\nt = \u2207F ( \u00afwk) + \u2207f k\n\nt (wk\n\nt ) \u2212 \u2207f k\n\nt ( \u00afwk)\n\n8:\n\nUpdate the solution by\n\nwk\n\nt+1 =\n\nargmin\nw\u2208W\u2229B( \u00afw\n\nk;\u2206k)\n\n\u03b7hw \u2212 wk\n\nt , \u02dcgk\n\nt i +\n\n1\n2kw \u2212 wk\n\nt k2\n\nend for\nSet \u00afwk+1 = 1\n\n9:\n10:\n11: end for\nReturn \u00afwm+1\n\nt and \u2206k+1 = \u2206k/\u221a2\n\nt=1 wk\n\nT +1PT +1\n\n1 to be the average solution \u00afwk obtained\nAt the beginning of each epoch, we initialize the solution wk\nfrom the last epoch, and then call the gradient oracle Og to obtain \u2207F ( \u00afwk). At each iteration t\nof epoch k, we call the function oracle Of to obtain a random function f k\nt (\u00b7) and de\ufb01ne the mixed\ngradient at the current solution wk\n\nt as\n\n\u02dcgk\nt = \u2207F ( \u00afwk) + \u2207f k\n\nt (wk\n\nt ) \u2212 \u2207f k\n\nt ( \u00afwk),\n\nwhich involves both the full gradient and the stochastic gradient. The mixed gradient can be divided\nt ( \u00afwk). Due\nt (\u00b7) and the shrinkage of the domain size, the norm of the stochastic\n\ninto two parts: the deterministic part \u2207F ( \u00afwk) and the stochastic part \u2207f k\nto the smoothness property of f k\npart is well bounded, which is the reason why our algorithm can achieve linear convergence.\n\nt ) \u2212 \u2207f k\n\nt (wk\n\nBased on the mixed gradient, we update wk\nt by a gradient mapping over a shrinking domain (i.e.,\nW \u2229 B( \u00afwk; \u2206k)) in step 8. Since the updating is similar to the standard gradient descent except for\nthe domain constraint, we refer to it as mixed gradient descent for short. At the end of the iteration\nfor epoch k, we compute the average value of T + 1 solutions, instead of T solutions, and update\nthe domain size by reducing a factor of \u221a2.\n\n3.3 The Convergence Rate\n\nThe following theorem shows the convergence rate of the proposed algorithm.\n\nTheorem 1. Assume\n\n\u03b4 \u2264 e\u22121/2, T \u2265\n\n1152L2\n\n\u03bb2\n\nln\n\n1\n\u03b4\n\n, and \u22061 \u2265 maxr 2\n\n\u03bb\n\n(F (0) \u2212 F (w\u2217)).\n\n(5)\n\nSet \u03b7 = 1/[L\u221aT ]. Let \u00afwm+1 be the solution returned by Algorithm 1 after m epoches that has m\naccesses to oracle Og and mT accesses to oracle Of . Then, with a probability at least 1 \u2212 m\u03b4, we\nhave\n\nF ( \u00afwm+1) \u2212 F (w\u2217) \u2264\n\n\u03bb[\u22061]2\n2m+1 , and k \u00afwm+1 \u2212 w\u2217k2 \u2264\n\n[\u22061]2\n2m .\n\nTheorem 1 immediately implies that EMGD is able to achieve an \u01eb optimization error by computing\nO(log 1\n\n\u01eb ) full gradients and O(\u03ba2 log 1\n\n\u01eb ) stochastic gradients.\n\n4\n\n\fTable 2: The computational complexity for minimizing 1\n\ni=1 fi(w)\n\nNesterov\u2019s algorithm [17]\n\nO(cid:0)\u221a\u03ban log 1\n\u01eb(cid:1)\n\nEMGD\n\nO(cid:0)(n + \u03ba2) log 1\n\u01eb(cid:1)\n\n3.4 Comparisons\n\nSAG (n \u2265 8\u03ba) [22]\n\nnPn\nO(cid:0)n log 1\n\u01eb(cid:1)\n\nSDCA [23]\n\nO(cid:0)(n + \u03ba) log 1\n\u01eb(cid:1)\n\nCompared to the optimization algorithms that only rely on full gradients [17], the number of full\ngradients needed in EMGD is O(log 1\n\u01eb ). Compared to the optimization\nalgorithms that only rely on stochastic gradients [9,10,21], EMGD is more ef\ufb01cient since it achieves\na linear convergence rate.\n\n\u01eb ) instead of O(\u221a\u03ba log 1\n\nThe proposed EMGD algorithm can also be applied to the special optimization problem considered\nin [22, 23], where F (w) = 1\ni=1 fi(w). To make quantitative comparisons, let\u2019s assume the\nfull gradient is n times more expensive to compute than the stochastic gradient. Table 2 lists the\ncomputational complexities of the algorithms that enjoy linear convergence. As can be seen, the\ncomputational complexity of EMGD is lower than Nesterov\u2019s algorithm [17] as long as the condition\n\nnPn\n\nnumber \u03ba \u2264 n2/3, the complexity of SAG [22] is lower than Nesterov\u2019s algorithm if \u03ba \u2264 n/8, and\nthe complexity of SDCA [23] is lower than Nesterov\u2019s algorithm if \u03ba \u2264 n2.2 The complexity of\nEMGD is on the same order as SAG and SDCA when \u03ba \u2264 n1/2, but higher in other cases. Thus, in\n\nterms of computational cost, EMGD may not be the best one, but it has advantages in other aspects.\n\n1. Unlike SAG and SDCA that only work for unconstrained optimization problem, the pro-\nposed algorithm works for both constrained and unconstrained optimization problems, pro-\nvided that the constrained problem in Step 8 can be solved ef\ufb01ciently.\n\n2. Unlike the SAG and SDCA that require an \u2126(n) storage space, the proposed algorithm\n\nonly requires the storage space of \u2126(d), where d is the dimension of w.\n\n3. The only step in Algorithm 1 that has dependence on n is step 4 for computing the gradient\n\u2207F ( \u00afwk). By utilizing distributed computing, the running time of this step can be reduced\nto O(n/k), where k is the number of computers, and the convergence rate remains the\nsame. For SAG and SDCA , it is unclear whether they can reduce the running time without\naffecting the convergence rate.\n\n4. The linear convergence of SAG and SDCA only holds in expectation, whereas the linear\n\nconvergence of EMGD holds with a high probability, which is much stronger.\n\n4 The Analysis\n\nIn the proof, we frequently use the following property of strongly convex functions [9].\n\nLemma 1. Let f (x) be a \u03bb-strongly convex function over the domain X , and x\u2217 =\nargminx\u2208X\n\nf (x). Then, for any x \u2208 X , we have\nf (x) \u2212 f (x\u2217) \u2265\n\n\u03bb\n2kx \u2212 x\u2217k2.\n\n(6)\n\n4.1 The Main Idea\n\nThe Proof of Theorem 1 is based on induction. From the assumption about \u22061 in (5), we have\n\n(5)\n\n\u03bb[\u22061]2\n\n2\n\n\u2264\n\n\u2264 [\u22061]2,\n2In machine learning, we usually face a regularized optimization problem minw\u2208W\n\n, and k \u00afw1 \u2212 w\u2217k2\n\nF ( \u00afw1) \u2212 F (w\u2217)\n\ni w) +\n\u03c4\n2 kwk2, where \u2113(\u00b7; \u00b7) is some loss function. When the norm of the data is bounded, the smoothness parameter\nL can be treated as a constant. The strong convexity parameter \u03bb is lower bounded by \u03c4 . Thus, as long as\n\u03c4 > \u2126(n\u22122/3), which is a reasonable scenario [25], we have \u03ba < O(n2/3), indicating our proposed EMGD\ncan be applied.\n\ni=1 \u2113(yi; x\n\n1\n\nn Pn\n\n\u22a4\n\n(5), (6)\n\n5\n\n\fwhich means Theorem 1 is true for m = 0. Suppose Theorem 1 is true for m = k. That is, with a\nprobability at least 1 \u2212 k\u03b4, we have\n\nF ( \u00afwk+1) \u2212 F (w\u2217) \u2264\n\n\u03bb[\u22061]2\n2k+1 , and k \u00afwk+1 \u2212 w\u2217k2 \u2264\n\n[\u22061]2\n2k .\n\nOur goal is to show that after running the k+1-th epoch, with a probability at least 1 \u2212 (k + 1)\u03b4, we\nhave\n\nF ( \u00afwk+2) \u2212 F (w\u2217) \u2264\n\n4.2 The Details\n\n\u03bb[\u22061]2\n2k+2 , and k \u00afwk+2 \u2212 w\u2217k2 \u2264\n\n[\u22061]2\n2k+1 .\n\nFor the simplicity of presentation, we drop the index k for epoch. Let \u00afw be the solution obtained\nfrom the epoch k. Given the condition\n\nF ( \u00afw) \u2212 F (w\u2217) \u2264\n\n\u03bb\n2\n\n\u22062, and k \u00afw \u2212 w\u2217k2 \u2264 \u22062,\n\n(7)\n\nsatis\ufb01es\n\nwe will show that after running the T iterations in one epoch, the new solution, denoted by bw,\n\n\u22062,\n\n(8)\n\nF (bw) \u2212 F (w\u2217) \u2264\n\n\u22062, and kbw \u2212 w\u2217k2 \u2264\n\n\u03bb\n4\n\n1\n2\n\nwith a probability at least 1 \u2212 \u03b4.\nDe\ufb01ne\n\nThe objective function can be rewritten as\n\ng = \u2207F ( \u00afw), bF (w) = F (w) \u2212 hw, gi, and gt(w) = ft(w) \u2212 hw,\u2207ft( \u00afw)i.\n\nAnd the mixed gradient can be rewritten as\n\nF (w) = hw, gi + bF (w).\n\u02dcgk = g + \u2207gt(wt).\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\nThen, the updating rule given in Algorithm 1 becomes\n\nwt+1 = argmin\n\nw\u2208W\u2229B( \u00afw,\u2206)\n\n\u03b7hw \u2212 wt, g + \u2207gt(wt)i +\n\n1\n2kw \u2212 wtk2.\n\nNotice that the objective function in (11) is 1-strongly convex. Using the fact that w\u2217 \u2208 W \u2229\nB( \u00afw; \u2206) and Lemma 1 (with x\u2217 = wt+1 and x = w\u2217), we have\n\n1\n2kwt+1 \u2212 wtk2\n\u03b7hwt+1 \u2212 wt, g + \u2207gt(wt)i +\n1\n1\n2kw\u2217 \u2212 wtk2 \u2212\n2kw\u2217 \u2212 wt+1k2.\n\u2264\u03b7hw\u2217 \u2212 wt, g + \u2207gt(wt)i +\nFor each iteration t in the current epoch, we have\n\nand\n\nF (wt) \u2212 F (w\u2217)\n\u2264h\u2207F (wt), wt \u2212 w\u2217i \u2212\n\n(4)\n\n\u03bb\n2kwt \u2212 w\u2217k2\n\n(10)\n\n= hg + \u2207gt(wt), wt \u2212 w\u2217i +D\u2207bF (wt) \u2212 \u2207gt(wt), wt \u2212 w\u2217E \u2212\nhg + \u2207gt(wt), wt \u2212 w\u2217i\n\u2264 hg + \u2207gt(wt), wt \u2212 wt+1i + kwt \u2212 w\u2217k2\n\u2212 kwt+1 \u2212 w\u2217k2\n\u2264hg, wt \u2212 wt+1i + kwt \u2212 w\u2217k2\n\n2\u03b7\n\n2\u03b7\n\n2\u03b7\n\n(12)\n\n+ max\n\nw (cid:18)h\u2207gt(wt), wt \u2212 wi \u2212 kwt \u2212 wk2\n\n2\u03b7\n\n=hg, wt \u2212 wt+1i + kwt \u2212 w\u2217k2\n\n2\u03b7\n\n2\u03b7\n\n\u2212 kwt+1 \u2212 w\u2217k2\n(cid:19)\n\u2212 kwt+1 \u2212 w\u2217k2\n\n2\u03b7\n\n+\n\n\u03b7\n2k\u2207gt(wt)k2.\n\n\u03bb\n2kwt \u2212 w\u2217k2,\n\n\u2212 kwt \u2212 wt+1k2\n\n2\u03b7\n\n6\n\n\fCombining (13) and (14), we have\n\nF (wt) \u2212 F (w\u2217)\n\u2264kwt \u2212 w\u2217k2\n+ hg, wt \u2212 wt+1i +\n\n2\u03b7\n\n2\u03b7\n\u03b7\n\n\u2212 kwt+1 \u2212 w\u2217k2\n\n\u03bb\n2kwt \u2212 w\u2217k2\n\n\u2212\n\n2k\u2207gt(wt)k2 +D\u2207bF (wt) \u2212 \u2207gt(wt), wt \u2212 w\u2217E .\n\nBy adding the inequalities of all iterations, we have\n\n(F (wt) \u2212 F (w\u2217))\n\n2\u03b7\n\nTXt=1\n\u2264k \u00afw \u2212 w\u2217k2\nTXt=1\n|\n\n\u03b7\n2\n\n+\n\nkwt \u2212 w\u2217k2 + hg, \u00afw \u2212 wT +1i\n\n2\u03b7\n\nTXt=1\n\n\u2212\n\n\u03bb\n2\n\n\u2212 kwT +1 \u2212 w\u2217k2\nTXt=1\nh\u2207bF (wt) \u2212 \u2207gt(wt), wt \u2212 w\u2217i\n|\n}\n\nk\u2207gt(wt)k2\n}\n{z\n\n{z\n\n,BT\n\n,AT\n\n+\n\n.\n\nSince F (\u00b7) is L-smooth, we have\n\nF (wT +1) \u2212 F ( \u00afw) \u2264 h\u2207F ( \u00afw), wT +1 \u2212 \u00afwi +\n\nwhich implies\n\nL\n2 k \u00afw \u2212 wT +1k2,\n\nhg, \u00afw \u2212 wT +1i \u2264 F ( \u00afw) \u2212 F (wT +1) +\nL\n2\n\n\u2264 F (w\u2217) \u2212 F (wT +1) +\n\n\u22062 +\n\n\u03bb\n2\n\n(7)\n\nL\n2\n\n\u22062\n\n\u22062 \u2264 F (w\u2217) \u2212 F (wT +1) + L\u22062.\n\n(15)\n\n(16)\n\nFrom (15) and (16), we have\n\n(F (wt) \u2212 F (w\u2217)) \u2264 \u22062(cid:18) 1\n\n2\u03b7\n\n+ L(cid:19) +\n\n\u03b7\n2\n\nT +1Xt=1\n\nAT + BT .\n\n(17)\n\nNext, we consider how to bound AT and BT . The upper bound of AT is given by\n\nTXt=1\n\nAT =\n\nk\u2207gt(wt)k2 =\n\nk\u2207ft(wt) \u2212 \u2207ft( \u00afw)k2\n\n(2)\n\n\u2264 L2\n\nkwt \u2212 \u00afwk2 \u2264 T L2\u22062.\n\n(18)\n\nTXt=1\n\nTXt=1\n\nTo bound BT , we need the Hoeffding-Azuma inequality stated below [4].\nLemma 2. Let V1, V2, . . . be a martingale difference sequence with respect to some sequence\nX1, X2, . . . such that Vi \u2208 [Ai, Ai + ci] for some random variable Ai, measurable with respect\n\ni=1 Vi, then for any t > 0,\n\nto X1, . . . , Xi\u22121 and a positive constant ci. If Sn =Pn\nPr[Sn > t] \u2264 exp(cid:18)\u2212\nPn\n\nDe\ufb01ne\n\n2t2\ni=1 c2\n\ni(cid:19) .\n\nVt = h\u2207bF (wt) \u2212 \u2207gt(wt), wt \u2212 w\u2217i, t = 1, . . . , T.\n\nRecall the de\ufb01nition of bF (\u00b7) and gt(\u00b7) in (9). Based on our assumption about the function oracle\n\nOf , it is straightforward to check that V1, . . . is a martingale difference with respect to g1, . . .. The\nvalue of Vt can be bounded by\n\n|Vt|\n\n(cid:13)(cid:13)(cid:13)\u2207bF (wt) \u2212 \u2207gt(wt)(cid:13)(cid:13)(cid:13)kwt \u2212 w\u2217k\n\n2\u2206 (k\u2207F (wt) \u2212 \u2207F ( \u00afw)k + k\u2207ft(wt) \u2212 \u2207ft( \u00afw)k)\n4L\u2206kwt \u2212 \u00afwk \u2264 4L\u22062.\n\n\u2264\n\u2264\n\u2264\n\n(2), (3)\n\n7\n\n\fFollowing Lemma 2, with a probability at least 1 \u2212 \u03b4, we have\n1\n\u03b4\n\nBT \u2264 4L\u22062r2T ln\n\n.\n\n(19)\n\nBy adding the inequalities in (17), (18) and (19) together, with a probability at least 1 \u2212 \u03b4, we have\n\nBy choosing \u03b7 = 1/[L\u221aT ], we have\n\n\u03b7T L2\n\n+ L +\n\n(F (wt) \u2212 F (w\u2217)) \u2264 \u22062 1\n\nT +1Xt=1\n(F (wt) \u2212 F (w\u2217)) \u2264 L\u22062 \u221aT + 1 + 4r2T ln\n\n2\u03b7\n\n2\n\nT +1Xt=1\n\n+ 4Lr2T ln\n\n1\n\n\u03b4! .\n\n1\n\n\u03b4! (5)\n\u2264 6L\u22062r2T ln\n\n1\n\u03b4\n\n,\n\n(20)\n\nwhere in the second inequality we use the condition \u03b4 \u2264 e\u22121/2 in (5). By Jensen\u2019s inequality, we\n\nhave\n\nF (bw) \u2212 F (w\u2217) \u2264\nkbw \u2212 w\u2217k2\n\nand therefore\n\nThus, when\n\n\u2264 \u22062 6Lp2 ln 1/\u03b4\n\n\u221aT + 1\n\n,\n\n1\n\nT + 1\n\n(6)\n\n\u2264\n\n2\n\u03bb\n\n(20)\n\n(F (wt) \u2212 F (w\u2217))\n\nT +1Xt=1\nF (bw) \u2212 F (w\u2217) \u2264 \u22062 12Lp2 ln 1/\u03b4\n\n\u03bb\u221aT + 1\n\n1152L2\n\n.\n\nT \u2265\n\n\u03bb2\n\nln\n\n1\n\u03b4\n\n,\n\nwith a probability at least 1 \u2212 \u03b4, we have\n\nF (bw) \u2212 F (w\u2217) \u2264\n\n5 Conclusion and Future Work\n\n\u03bb\n4\n\n\u22062, and kbw \u2212 w\u2217k2 \u2264\n\n1\n2\n\n\u22062.\n\n\u01eb ) to O(log 1\n\nIn this paper, we consider how to reduce the number of full gradients needed for smooth and strongly\nconvex optimization problems. Under the assumption that both the gradient and the stochastic gra-\ndient are available, a novel algorithm named Epoch Mixed Gradient Descent (EMGD) is proposed.\nTheoretical analysis shows that with the help of stochastic gradients, we are able to reduce the num-\nber of gradients needed from O(\u221a\u03ba log 1\n\u01eb ). In the case that the objective function is in\nthe form of (1), i.e., a sum of n smooth functions, EMGD has lower computational cost than the full\ngradient method [17], if the condition number \u03ba \u2264 n2/3.\nIn practice, a drawback of EMGD is that it requires the condition number \u03ba is known beforehand.\nWe will interstage how to \ufb01nd a good estimation of \u03ba in future. When the objective function is\na sum of some special functions, such as the square loss (i.e., (yi \u2212 x\u22a4i w)2), we can estimate\n\nthe condition number by sampling. In particular, the Hessian matrix estimated from a subset of\nfunctions, combined with the concentration inequalities for matrix [7], can be used to bound the\neigenvalues of the true Hessian matrix and consequentially \u03ba. Furthermore, if there exists a strongly\nconvex regularizer in the objective function, which happens in many machine learning problems [8],\nthe knowledge of the regularizer itself allows us to \ufb01nd an upper bound of \u03ba.\n\nAcknowledgments\n\nThis work is partially supported by ONR Award N000141210431 and NSF (IIS-1251031).\n\n8\n\n\fReferences\n\n[1] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright.\n\nInformation-theoretic lower bounds\non the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory,\n58(5):3235\u20133249, 2012.\n\n[2] D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal\n\non Optimization, 7(4):913\u2013926, 1997.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.\n\n[5] M. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data \ufb01tting. SIAM Journal\n\non Scienti\ufb01c Computing, 34(3):A1380\u2013A1405, 2012.\n\n[6] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic\ncomposite optimization i: a generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469\u2013\n1492, 2012.\n\n[7] A. Gittens and J. A. Tropp. Tail bounds for all eigenvalues of a sum of random matrices. ArXiv e-prints,\n\narXiv:1104.4513, 2011.\n\n[8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in\n\nStatistics. Springer New York, 2009.\n\n[9] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic\nstrongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages\n421\u2013436, 2011.\n\n[10] A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex func-\n\ntions. Technical report, 2010.\n\n[11] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133:365\u2013\n\n397, 2012.\n\n[12] K. Marti. On solutions of stochastic programming problems by descent procedures with stochastic and\n\ndeterministic directions. Methods of Operations Research, 33:281\u2013293, 1979.\n\n[13] K. Marti and E. Fuchs. Rates of convergence of semi-stochastic approximation procedures for solving\n\nstochastic optimization problems. Optimization, 17(2):243\u2013265, 1986.\n\n[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[15] A. Nemirovski and D. B. Yudin. Problem complexity and method ef\ufb01ciency in optimization. John Wiley\n\n& Sons Ltd, 1983.\n\n[16] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence\n\nO(1/k2). Doklady AN SSSR (translated as Soviet. Math. Docl.), 269:543\u2013547, 1983.\n\n[17] Y. Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied opti-\n\nmization. Kluwer Academic Publishers, 2004.\n\n[18] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n152, 2005.\n\n[19] Y. Nesterov. Gradient methods for minimizing composite objective function. Core discussion papers,\n\n2007.\n\n[20] D. P. Palomar and Y. C. Eldar, editors. Convex Optimization in Signal Processing and Communications.\n\n2010, Cambridge University Press.\n\n[21] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic\noptimization. In Proceedings of the 29th International Conference on Machine Learning, pages 449\u2013456,\n2012.\n\n[22] N. L. Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence\nrate for \ufb01nite training sets. In Advances in Neural Information Processing Systems 25, pages 2672\u20132680,\n2012.\n\n[23] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss mini-\n\nmization. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[24] S. Sra, S. Nowozin, and S. J. Wright, editors. Optimization for Machine Learning. The MIT Press, 2011.\n\n[25] Q. Wu and D.-X. Zhou. Svm soft margin classi\ufb01ers: Linear programming versus quadratic programming.\n\nNeural Computation, 17(5):1160\u20131187, 2005.\n\n[26] L. Zhang, T. Yang, R. Jin, and X. He. O(log T ) projections for stochastic optimization of smooth and\nstrongly convex functions. In Proceedings of the 30th International Conference on Machine Learning\n(ICML), pages 621\u2013629, 2013.\n\n9\n\n\f", "award": [], "sourceid": 529, "authors": [{"given_name": "Lijun", "family_name": "Zhang", "institution": "Michigan State University (MSU)"}, {"given_name": "Mehrdad", "family_name": "Mahdavi", "institution": "Michigan State University (MSU)"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Michigan State University (MSU)"}]}