{"title": "Stochastic Composite Mirror Descent: Optimal Bounds with High Probabilities", "book": "Advances in Neural Information Processing Systems", "page_first": 1519, "page_last": 1529, "abstract": "We study stochastic composite mirror descent, a class of scalable algorithms able to exploit the geometry and composite structure of a problem. We consider both convex and strongly convex objectives with non-smooth loss functions, for each of which we establish high-probability convergence rates optimal up to a logarithmic factor. We apply the derived computational error bounds to study the generalization performance of multi-pass stochastic gradient descent (SGD) in a non-parametric setting. Our high-probability generalization bounds enjoy a logarithmical dependency on the number of passes provided that the step size sequence is square-summable, which improves the existing bounds in expectation with a polynomial dependency and therefore gives a strong justification on the ability of multi-pass SGD to overcome overfitting. Our analysis removes boundedness assumptions on subgradients often imposed in the literature. Numerical results are reported to support our theoretical findings.", "full_text": "Stochastic Composite Mirror Descent: Optimal\n\nBounds with High Probabilities\n\nShenzhen Key Laboratory of Computational Intelligence, Department of Computer Science\nand Engineering, Southern University of Science and Technology, Shenzhen 518055, China\n\nYunwen Lei and Ke Tang\u2217\n\nleiyw@sustc.edu.cn tangk3@sustc.edu.cn\n\nAbstract\n\nWe study stochastic composite mirror descent, a class of scalable algorithms able\nto exploit the geometry and composite structure of a problem. We consider both\nconvex and strongly convex objectives with non-smooth loss functions, for each\nof which we establish high-probability convergence rates optimal up to a loga-\nrithmic factor. We apply the derived computational error bounds to study the\ngeneralization performance of multi-pass stochastic gradient descent (SGD) in a\nnon-parametric setting. Our high-probability generalization bounds enjoy a loga-\nrithmical dependency on the number of passes provided that the step size sequence\nis square-summable, which improves the existing bounds in expectation with a\npolynomial dependency and therefore gives a strong justi\ufb01cation on the ability\nof multi-pass SGD to overcome over\ufb01tting. Our analysis removes boundedness\nassumptions on subgradients often imposed in the literature. Numerical results are\nreported to support our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nStochastic gradient descent (SGD) has found wide applications in machine learning problems due\nto its simplicity in implementation, low memory requirement and low computational complexity\nper iteration, as well as good practical behavior [2, 6, 28, 32, 41]. As an iterative method, SGD\nminimizes empirical errors by moving iterates along the direction of a negative gradient calculated\nbased on a loss function on a single training example or a batch of few examples. This strategy of\nprocessing few examples per iteration makes SGD particularly suitable for large scale applications\nwith very large data points [2, 41], which are becoming ubiquitous in the big data era.\nStochastic composite mirror descent (SCMD) is a powerful extension of SGD based on two moti-\nvations [12]. Firstly, it relaxes the Hilbert space structure of SGD by using a mirror map to capture\ngeometric properties of data from a Banach space [4, 25]. Secondly, it exploits the problem structure\nby separating, at every iteration, a data-\ufb01tting term and a regularization term in structured optimization\nproblems to obtain a desired regularization effect, which arise naturally since a regularizer is often\nintroduced to either avoid over\ufb01tting or impose a priori information [12, 37].\nAlthough much theoretical analysis has been performed to understand the practical behavior of\nSGD and SCMD, the existing theoretical results are still not quite satisfactory. Firstly, most of\nthe existing theoretical results are stated in expectation which inevitably ignore some information\non high-order moments of the random variable we are interested in. In practice, we may be more\ninterested in high-probability bounds to understand the variability of the learned model which is\nalso an important factor we should take into account when measuring the quality of models [32].\nSecondly, the existing generalization bounds, stated in expectation, for SGD either are suboptimal\n\n\u2217Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\for require to impose a smoothness assumption on loss functions [13, 21]. Thirdly, a non-trivial\nassumption on the boundedness of subgradients is often imposed in the literature to proceed with\nthe analysis [11, 12, 28, 32], especially in the derivation of high-probability bounds. However, this\nboundedness assumption may not hold if the optimization is conducted in an unbounded domain,\nunder which scenario the derived bounds may not be intuitive.\nIn this paper, we aim to contribute towards a re\ufb01ned analysis on both convergence rates and gen-\neralization properties of SCMD. We consider both general convex and strongly convex objectives,\nfor each of which we show that SCMD can achieve almost optimal convergence rates with high\nprobability, which match the minimax lower rates for stochastic approximation up to a logarithmic\nfactor [1, 25]. In particular, we identify a constraint on step sizes to guarantee the boundedness of\niterates with high probability (up to a logarithmic factor). Furthermore, we apply these convergence\nrates related to computational errors to establish high-probability generalization bounds for the model\ntrained by SGD through multiple passes over the training examples, which is a typical way of using\nSGD to process large datasets [20]. Our generalization bounds do not require to impose smoothness\nassumptions on loss functions and can be optimal up to a logarithmic factor. Surprisingly, we show\nthat estimation errors scale logarithmically with respect to (w.r.t.) the number of passes provided that\nthe step size sequence is square-summable, which implies that SGD may be immune to over\ufb01tting.\nAs a contrast, estimation error bounds based on stability arguments [13] and uniform deviation\narguments [21] scale polynomially w.r.t. the number of passes, which may not justify well the ability\nof SGD in overcoming over\ufb01tting in practice. All our theoretical results are derived without any\nboundedness assumptions on subgradients based on two tricks. The \ufb01rst trick is to use a self-bounding\nproperty of loss functions (Assumption 1) to show that a (weighted) summation of function values\ncan be controlled by step sizes (Lemma 2). The second trick is to show that conditional variances of\nmartingales in a one-step progress inequality of SCMD can be partially offset by some other terms in\nthe one-step progress inequality.\nThe paper is organized as follows. We introduce SCMD and state convergence rates in Section 2 and\nSection 3, respectively. We study generalization bounds of SGD in Section 4. Discussions are given\nin Section 5. Simulation results and conclusions are given in Section 6 and Section 7, respectively.\n\n2 Stochastic Composite Mirror Descent\n\nmin\n\nw\u2208W \u03c6(w) = Ez[f (w, z)] + r(w),\n\nMany machine learning problems involve optimization problems of a composite structure [12, 37]\n(2.1)\nwhere W is a Banach space with a norm (cid:107) \u00b7 (cid:107), F (w) := Ez[f (w, z)] is a data-\ufb01tting term and\nr : W \u2192 R+ is a simple regularizer possibly inducing sparsity. Here f : W \u00d7 Z (cid:55)\u2192 R+ is a\nfunction with f (w, z) measuring the quality of a model indexed by w \u2208 W on a random example\nz = (x, y) drawn from a probability measure \u02dc\u03c1 de\ufb01ned in a sample space Z = X \u00d7 Y with an input\nspace X \u2282 W\u2217 and an output space Y \u2282 R. We denote by Ez the expectation w.r.t. z, and by W\u2217\nthe dual of W with the dual norm (cid:107) \u00b7 (cid:107)\u2217. A typical choice of the data-\ufb01tting term takes the form\nf (w, z) = (cid:96)((cid:104)w, x(cid:105), y), where (cid:96) : R \u00d7 Y (cid:55)\u2192 R+ is a loss function and (cid:104)w, x(cid:105) is the dual element\nx \u2208 W\u2217 acting on w \u2208 W. With speci\ufb01c instantiations of loss functions (cid:96) and regularizers r, the\nformulation (2.1) covers many famous machine learning problems in a unifying framework, including\nleast squares, support vector machines, logistic regression, lasso and elastic-net, etc [12, 37].\nAs an extension of SGD, SCMD uses a strongly convex and Fr\u00e9chet differentiable mirror map \u03a8\nto generate an appropriate Bregman distance D\u03a8(w, \u02dcw) := \u03a8(w) \u2212 \u03a8( \u02dcw) \u2212 (cid:104)w \u2212 \u02dcw,\u2207\u03a8( \u02dcw)(cid:105) to\ncapture the involved non-Euclidean geometry [4, 25], where \u2207\u03a8( \u02dcw) denotes the gradient of \u03a8 at\n\u02dcw. Let w1 = 0 \u2208 W and {\u03b7t}t\u2208N be a positive step size sequence. Upon the arrival of zt at the\nt-th iteration, SCMD calculates a subgradient f(cid:48)(wt, zt) \u2208 \u2202wf (wt, zt) as an unbiased estimate of\nF (cid:48)(wt) \u2208 \u2202F (wt), and updates the model as follows\n\n(cid:2)(cid:104)w \u2212 wt, f(cid:48)(wt, zt)(cid:105) + r(w)(cid:3) + D\u03a8(w, wt).\n\n(2.2)\n\nw\u2208W \u03b7t\n\nwt+1 = arg min\n\nHere \u2202wf (wt, zt) :=(cid:8)g : f (w, zt) \u2212 f (wt, zt) \u2265 (cid:104)w \u2212 wt, g(cid:105) for all w(cid:9) denotes the subdifferential\n\nof f (\u00b7, zt) at wt. Intuitively, SCMD uses f(cid:48)(wt, zt) to form a \ufb01rst-order approximation of f (\u00b7, zt) at\nwt and uses the Bregman distance D\u03a8(w, wt) to keep wt+1 not far away from the current iterate.\nThe regularizer r is kept intact here for a regularization effect [12, 37]. A typical choice of \u03a8 is the\n\n2\n\n\f2(cid:107)w(cid:107)2\n\np-norm divergence \u03a8p(w) = 1\n\nsetting p close to 1 [12, 37]. Here (cid:107) \u00b7 (cid:107)p is the p-norm de\ufb01ned by (cid:107)w(cid:107)p =(cid:0)(cid:80)d\n\np (1 < p \u2264 2), which works favorably for sparse problems by\n\ni=1 |w(i)|p(cid:1)1/p for\n\nw = (w(1), . . . , w(d)) \u2208 Rd. SCMD recovers SGD by taking \u03a8 = \u03a82 and r(w) = 0, stochastic\nforward-backward splitting by taking \u03a8 = \u03a82 [11], stochastic mirror descent by taking r(w) = 0 [24]\nand stochastic mirror descent algorithm made sparse by taking \u03a8 = \u03a8p and r(w) = \u03bb(cid:107)w(cid:107)1 [30].\n\n3 Convergence Rates\n\nBefore stating our high-probability convergence rates, we introduce some assumptions. Throughout\nthe paper, we assume that the mirror map \u03a8 is Fr\u00e9chet differentiable and \u03c3\u03a8-strongly convex in the\nsense that D\u03a8(w, \u02dcw) \u2265 2\u22121\u03c3\u03a8(cid:107)w \u2212 \u02dcw(cid:107)2 for all w, \u02dcw \u2208 W \u2282 Rd (\u03c3\u03a8 > 0), and f (w, z) is convex\nw.r.t. the \ufb01rst argument. We also always assume that Assumption 1 and Assumption 2 hold, the\nsample space Z is bounded and supz\u2208Z f (0, z) < \u221e.\nAssumption 1. We assume that there exist A and B \u2265 0 such that the following inequalities hold\nfor any w \u2208 W, z \u2208 Z and any f(cid:48)(w, z) \u2208 \u2202f (w, z), r(cid:48)(w) \u2208 \u2202r(w)\n\n(cid:107)f(cid:48)(w, z)(cid:107)2\u2217 \u2264 Af (w, z) + B and\n\n(cid:107)r(cid:48)(w)(cid:107)2\u2217 \u2264 Ar(w) + B.\n\n(3.1)\n\nThis is a standard assumption and satis\ufb01ed in many practical problems [11, 41]. For example, Lemma\nA.5 shows that r(w) = \u03bb(cid:107)w(cid:107)p\np satis\ufb01es the second inequality of (3.1) with (cid:107) \u00b7 (cid:107) = (cid:107) \u00b7 (cid:107)p(1 \u2264 p \u2264 2),\nA = 2\u03bbp(p \u2212 1) and B = \u03bbp(2 \u2212 p). Furthermore, if f (w, z) = (cid:96)((cid:104)w, x(cid:105), y), then Lemma A.4\nshows that (cid:107)f(cid:48)(w, z)(cid:107)2\u2217 = |(cid:96)(cid:48)((cid:104)w, x(cid:105), y)|2(cid:107)x(cid:107)2\u2217 would satisfy the \ufb01rst inequality of (3.1) if\n\n|(cid:96)(cid:48)(a, y)|2 \u2264 \u02dcA(cid:96)(a, y) + \u02dcB,\n\n\u2200a \u2208 R, y \u2208 Y\n\n(3.2)\nfor some \u02dcA, \u02dcB > 0 [41], where (cid:96)(cid:48)(a, y) denotes a subgradient of (cid:96) w.r.t. the \ufb01rst argument. Many\npopular loss functions satisfy (3.2), including the p-norm hinge loss (cid:96)(a, y) = max{0, 1 \u2212 ya}p\n(1 \u2264 p \u2264 2) [34], the logistic loss (cid:96)(a, y) = log(1 + exp(\u2212ya)) for classi\ufb01cation, and the p-th\npower absolute distance loss (cid:96)(a, y) = |a \u2212 y|p (1 \u2264 p \u2264 2), the Huber loss (cid:96)(a, y) = (a \u2212 y)2 if\n|a \u2212 y| \u2264 1 and (cid:96)(a, y) = 2|a \u2212 y| \u2212 1 otherwise for regression [41]. We refer the interested readers\nto [41] for constants \u02dcA, \u02dcB in (3.2) with different loss functions (cid:96).\nAssumption 2. We assume the existence of \u03c3F , \u03c3r \u2265 0 such that\n\nF (w) \u2212 F ( \u02dcw) \u2212 (cid:104)w \u2212 \u02dcw, F (cid:48)( \u02dcw)(cid:105) \u2265 \u03c3F D\u03a8(w, \u02dcw),\nr(w) \u2212 r( \u02dcw) \u2212 (cid:104)w \u2212 \u02dcw, r(cid:48)( \u02dcw)(cid:105) \u2265 \u03c3rD\u03a8(w, \u02dcw)\n\n(3.3)\n\nhold for all w, \u02dcw \u2208 W and any F (cid:48)( \u02dcw) \u2208 \u2202F ( \u02dcw), r(cid:48)( \u02dcw) \u2208 \u2202r( \u02dcw).\nThe case \u03c3\u03c6 := \u03c3F + \u03c3r = 0 corresponds to general convex objectives, while the case \u03c3\u03c6 > 0\ncorresponds to strongly convex objectives. Let w\u2217 = arg minw\u2208W \u03c6(w) be the minimizer of \u03c6 in W\nwith the minimal norm. We always assume (cid:107)w\u2217(cid:107) < \u221e in this paper.\nOur theoretical analysis is based on the following lemma quantifying the one-step progress of SCMD\nmeasured by Bregman distance, which shows how D\u03a8(w, wt) would change in a single iteration.\nLemma 1. Let {wt}t\u2208N be generated by (2.2), then the following inequality holds for any w \u2208 W\n\nD\u03a8(w, wt+1) \u2212 D\u03a8(w, wt) \u2264 \u03b7t(cid:104)w \u2212 wt, f(cid:48)(wt, zt)(cid:105) + \u03b7t(r(w) \u2212 r(wt))\n\n+ \u03c3\u22121\n\n\u03a8 \u03b72\n\nt\n\n\u2212\u03c3r\u03b7tD\u03a8(w, wt+1).\n\n(3.4)\n\nand a G \u2208 R is imposed to control(cid:80)T\nAssumption 1 to replace Bt with At. Equation (3.6) allows us to control(cid:80)T\n\nExisting one-step progress inequality can be found in the literature with At replaced by Bt :=\n(cid:107)f(cid:48)(wt, zt)(cid:107)2\u2217 + (cid:107)r(cid:48)(wt)(cid:107)2\u2217, see, e.g., [12]. Then, a non-trivial assumption as Bt \u2264 G for all t \u2208 N\nt ). We re\ufb01ne these discussions by using\nt=1 \u03b72\nt )\nwithout imposing any boundedness assumptions on subgradients. In our discussion for strongly\nconvex objectives, we require to divide both sides of (3.4) by \u03b72\nt . In this way, Eq. (3.7) plays an\nanalogous role in removing boundedness assumptions in the strongly convex case. Both proofs of\nLemma 1 and Lemma 2 are given in Supplementary Material B.\n\nt At by O((cid:80)T\n\nt=1 \u03b72\n\nt=1 \u03b72\n\nt=1 \u03b72\n\n(cid:0)Af (wt, zt) + Ar(wt) + 2B(cid:1)\n(cid:123)(cid:122)\n(cid:124)\n(cid:125)\nt Bt by O((cid:80)T\n\n:=At\n\n3\n\n\fLemma 2. Let {wt}t\u2208N be the sequence produced by (2.2) with \u03b7t \u2264 (2A)\u22121\u03c3\u03a8. Then, we have\n\nwhere C1 = supz\u2208Z f (0, z) + r(0) + A\u22121B. Furthermore, if \u03b7t+1 \u2264 \u03b7t, then for all t \u2208 N\n\nt(cid:88)\n\nk=1\n\n(cid:107)wt+1(cid:107)2 \u2264 2C1\u03c3\u22121\n\n\u03a8\n\n\u2200t \u2208 N,\n\n\u03b7k,\n\nt(cid:88)\nt(cid:88)\n(cid:0)f (wk, zk) + r(wk)(cid:1) \u2264 2C1\n(cid:0) t(cid:88)\n(cid:0)f (wk, zk) + r(wk)(cid:1) \u2264 2C1t + 2C1\n\n\u03b72\nk\n\nk=1\n\nk=1\n\n\u03b72\nk,\n\n(cid:1)\u03b7\u22121\n\nt\n\n.\n\n\u03b7k\n\nk=1\n\nt(cid:88)\n\nk=1\n\n(3.5)\n\n(3.6)\n\n(3.7)\n\n3.1 Convex Objectives\n\nWe study the behavior of SCMD for convex objectives with \u03c3\u03c6 = 0. The assumption(cid:80)\u221e\nmartingale(cid:80)t\n(cid:80)\u221e\n\nt < \u221e\nis satis\ufb01ed if \u03b7t = \u03b71t\u2212\u03b8 with \u03b8 > 1/2 or \u03b7t = \u03b71(t log\u03b2(et))\u2212 1\n2 with \u03b2 > 1. Our idea is to take\na summation of Eq. (3.4) with w = w\u2217, and show that the conditional variance of the involved\nk=1 \u03b7k(cid:104)w\u2217 \u2212 wk, f(cid:48)(wk, zk) \u2212 Ezk [f(cid:48)(wk, zk)](cid:105) can be partially offset by some other\nterms. The proofs of Theorems 3 and 4 are given in Supplementary Material C.\nTheorem 3. Let {wt}t\u2208N be the sequence produced by (2.2) with \u03b7t \u2264 (2A)\u22121\u03c3\u03a8, \u03b7t+1 \u2264 \u03b7t and\nt < \u221e. Then, there exists a constant C2 independent of T (explicitly given in the proof) such\nthat for any \u03b4 \u2208 (0, 1) the following inequality holds with probability at least 1 \u2212 \u03b4\n\nt=1 \u03b72\n\nt=1 \u03b72\n\nmax\n1\u2264t\u2264T\n\n(3.8)\nRemark 1. Although implemented in a possibly unbounded domain, Theorem 3 shows that {wt}t\u2208N\nby (2.2) falls into a bounded ball (up to a logarithmic factor) with high probabilities. Intuitively, this\nsuggests that SCMD is immune to over\ufb01tting if we take appropriate step sizes. In this case, we can\nrun SCMD with many iterations without essentially harming the quality of the output model.\n\n.\n\n(cid:107)wt(cid:107)2 \u2264 C2 log\n\nT\n\u03b4\n\nBased on Theorem 3, we establish high-probability convergence rates for a weighted average of\niterates without any assumptions on the boundedness of iterates. In Theorem 4 and Corollary 5, we\nestablish bounds on suboptimality of objectives w.r.t. any w and an optimal solution w\u2217, respectively.\nTheorem 4. Let w \u2208 W and \u03b4 \u2208 (0, 2/e). Let \u00afw(1)\nt=1 \u03b7twt be a weighted\naverage of the \ufb01rst T iterates. Under the conditions of Theorem 3, with probability 1 \u2212 \u03b4 we have\n\nT = (cid:0)(cid:80)T\n\nt=1 \u03b7t\n\nT ) \u2212 \u03c6(w) \u2264(cid:16) T(cid:88)\n\n(cid:17)\u22121(cid:0)2C3D\u03a8(w, 0) + C4\n\n\u03b7t\n\n\u03c6( \u00afw(1)\n\nt=1\n\n(cid:1)\u22121(cid:80)T\n(cid:1) log\n\n3\n2\n\n2T\n\u03b4\n\n,\n\n(3.9)\n\nwhere C3 and C4 are two constants (explicitly given in the proof) independent of T .\nRemark 2. A similar high-probability bound was established for SCMD in [12]. However, their\ndiscussion needs to impose an additional almost-sure boundedness assumption on iterates, i.e.,\n(cid:107)wt(cid:107)2 \u2264 G for a G > 0 and all t \u2208 N. These boundedness assumptions on either subgradients\nor iterates are fundamental to the existing analysis but hard to check in practice. Moreover, the\nhigh-probability analysis makes these assumptions non-trivial to remove since one also needs to\nconsider high-order moments of random variables.\nCorollary 5. If \u03b4 \u2208 (0, 2/e) and conditions of Theorem 4 are satis\ufb01ed, then (3.9) holds with\nprobability 1 \u2212 \u03b4 with w = w\u2217. Furthermore, if we choose \u03b7t = \u03b71t\u2212\u03b8 with \u03b8 > 1/2, then with\nprobability 1\u2212\u03b4 we have \u03c6( \u00afw(1)\nwith \u03b2 > 1, then with probability 1 \u2212 \u03b4 we have \u03c6( \u00afw(1)\n\nT )\u2212\u03c6(w\u2217) = O(cid:0)T \u03b8\u22121 log\nThe convergence rate O(cid:0)(cid:0)T \u22121 log\u03b2 T(cid:1) 1\nfactor [1], which follows directly from Theorem 4 and(cid:80)T\n\n(cid:1); if we choose \u03b7t = \u03b71(t log\u03b2(et))\u2212 1\n(cid:1).\nT ) \u2212 \u03c6(w\u2217) = O(cid:0)(cid:0)T \u22121 log\u03b2 T(cid:1) 1\n(cid:1) in Corollary 5 is optimal up to a logarithmic\n\nt=1 t\u2212\u03b8 \u2265 (1\u2212 \u03b8)\u22121(T 1\u2212\u03b8 \u2212 1), \u03b8 \u2208 (0, 1).\n\n3\n2 T\n\u03b4\n\n3\n2 T\n\u03b4\n\n3\n2 T\n\u03b4\n\n2 log\n\n2 log\n\n2\n\nWe omit the proof for brevity.\n\n4\n\n\ft=1 \u03b7t\n\nIn Theorem 6, we give suf\ufb01cient conditions for the almost sure \ufb01niteness of limt\u2192\u221e D\u03a8(w\u2217, wt)\n\n(cid:0)\u03c6(wt) \u2212 \u03c6(w\u2217)(cid:1). As a direct corollary, we also establish convergence rates with\n\nprobability one in Corollary 7. Theorem 6 is a part of Proposition E.3 to be presented and proved in\nSupplementary Material E, while the proof of Corollary 7 is omitted for brevity.\nt < \u221e. Then {D\u03a8(w\u2217, wt)}t converges almost\nsurely (a.s.) to a non-negative random variable and limt\u2192\u221e D\u03a8(w\u2217, wt) < \u221e a.s.. Furthermore, if\n\nand(cid:80)\u221e\nTheorem 6. Consider {wt}t\u2208N by (2.2) with(cid:80)\u221e\n\u03b7t \u2264 (2A)\u22121\u03c3\u03a8 and \u03b7t+1 \u2264 \u03b7t, then(cid:80)\u221e\n\u03b8 > 1/2, then limT\u2192\u221e T 1\u2212\u03b8(cid:0)\u03c6( \u00afw(1)\nT ) \u2212 \u03c6(w\u2217)(cid:1) < \u221e a.s.. If we choose \u03b7t = \u03b71(t log\u03b2(et))\u2212 1\n2(cid:0)\u03c6( \u00afw(1)\n(cid:1) 1\nwith \u03b2 > 1, then limT\u2192\u221e(cid:0) T\n\n(cid:0)\u03c6(wt) \u2212 \u03c6(w\u2217)(cid:1) < \u221e a.s..\nT ) \u2212 \u03c6(w\u2217)(cid:1) < \u221e a.s..\n\nCorollary 7. Let {wt}t\u2208N be produced by (2.2) and \u03b71 \u2264 (2A)\u22121\u03c3\u03a8. If we choose \u03b7t = \u03b71t\u2212\u03b8 with\n\nt=1 \u03b72\n\nt=1 \u03b7t\n\n2\n\nlog\u03b2 T\n\n3.2 Strongly Convex Objectives\n\nt\n\nt\n\n) \u2212 \u03c6(w\u2217) with \u00afw(2)\n\nmartingale(cid:80)t\n\nWe now turn to strongly convex objectives with \u03c3\u03c6 > 0. In Theorem 8, we establish high-probability\nbounds for both (cid:107)wt \u2212 w\u2217(cid:107)2 and \u03c6( \u00afw(2)\nbeing another weighted average of\nthe \ufb01rst t iterates, for each of which we derive optimal convergence rates up to a logarithmic factor\n[1]. The optimality means that not only the dependency on t but also the dependency on the strong-\nconvexity parameter \u03c3\u03c6 can not be improved up to a logarithmic factor [16, 28] (\u03c3\u03c6 is often chosen\nto be very small in practical learning problems [28, 31]). It should be mentioned that our analysis\nremoves boundedness assumptions on subgradients in the literature [28]. Our idea is to take a\nweighted summation of (3.4) with w = w\u2217, and show that the conditional variance of an involved\nk=1(k + t0 + 1)(cid:104)w\u2217 \u2212 wk, f(cid:48)(wk, zk) \u2212 Ezk [f(cid:48)(wk, zk)](cid:105) can be partially offset by\nanother term in this weighted summation of (3.4), which is another trick to remove boundedness\nassumptions on subgradients. We also give a suf\ufb01cient condition on the almost sure convergence of\nwt to w\u2217 in Theorem 9. The proof of Theorem 8 is given in Supplementary Material D. Theorem 9 is\na part of Proposition E.3 to be presented in Supplementary Material E.\n4 ). Let {wt}t\u2208N be produced by (2.2) with \u03b7t =\nTheorem 8. Assume \u03c3\u03c6 > 0 and \u03b4 \u2208 (0, e\u2212 1\nk=1(k+t0+1)wk, t \u2208 N.\n\nt =(cid:0)(cid:80)t\n\n, where t0 \u2265 16A log T\n\nThen, the following inequalities hold with probability 1 \u2212 \u03b4 for all t = 1, . . . , T\n\n. Let \u00afw(2)\n\n\u03c3\u03c6t+2\u03c3F +\u03c3\u03c6t0\n\n\u03c3\u03c6\u03c3\u03a8\n\n2\n\n\u03b4\n\nk=1(k+t0+1)(cid:1)\u22121(cid:80)t\n) \u2212 \u03c6(w\u2217) \u2264 (cid:101)CT\n\n(cid:107)w\u2217 \u2212 wt(cid:107)2 \u2264\n\nCT\n\nMoreover, the dependencies of CT and (cid:101)CT on T /\u03b4 are logarithmic. The dependencies of CT and (cid:101)CT\nTheorem 9. Let {wt}t\u2208N be the sequence produced by (2.2) with \u03c3\u03c6 > 0. If(cid:80)\u221e\n(cid:80)\u221e\nt=1 \u03b7t = \u221e and\n\n\u03c6 are quadratic and linear, respectively.\n\nt < \u221e, then limt\u2192\u221e D\u03a8(w\u2217, wt) = 0 a.s..\n\nand \u03c6( \u00afw(2)\n\non \u03c3\u22121\n\nt + t0 + 1\n\n(3.10)\n\nt=1 \u03b72\n\nt\n\n.\n\nt\n\n4 Generalization Error Bounds\n\ni=1 (cid:96)(cid:0)h(xi), yi\n(cid:80)n\n\n(cid:2)(cid:96)(h(x), y)(cid:3)\n(cid:1). The best model minimizing the generalization error then becomes\n\nHere we apply our high-probability convergence rates for SCMD to establish generalization error\nbounds for SGD. In this setting, we assume a training sample z = {z1, . . . , zn} of size n \u2208 N is drawn\nindependently from a probability measure \u03c1 de\ufb01ned on the sample space Z, and our aim is to learn a\nhypothesis h : X (cid:55)\u2192 R from a hypothesis space W with good generalization performance. The quality\nof h at (x, y) is quanti\ufb01ed by (cid:96)(h(x), y), where (cid:96) : R \u00d7 Y (cid:55)\u2192 R+ is convex w.r.t. the \ufb01rst argument.\nThe generalization error and empirical error of h are de\ufb01ned respectively by E(h) = Ez\nand Ez(h) = 1\nh\u03c1 = arg minh E(h). We consider a non-parametric learning setting with W being a reproducing\nkernel Hilbert space (RKHS) associated to a Mercer kernel K : X \u00d7 X (cid:55)\u2192 R which is continuous,\nsymmetric and positive semi-de\ufb01nite [9, 34]. In this learning setting, the candidate models take the\nform hw(x) = (cid:104)w, Kx(cid:105) with w \u2208 W. For brevity, we denote the norm in the RKHS W by (cid:107) \u00b7 (cid:107)2\nand introduce abbreviations E(w) = E(hw),Ez(w) = Ez(hw). We assume (3.2) and apply the SGD\nscheme to minimize Ez(w). To be speci\ufb01c, we let w1 = 0. At the t-th iteration, we randomly choose\nan index jt from the uniform distribution over {1, . . . , n} and produce wt+1 by\n\nn\n\nwt+1 = wt \u2212 \u03b7t(cid:96)(cid:48)(cid:0)(cid:104)wt, Kxjt\n\n(cid:1)Kxjt\n\n,\n\n(cid:105), yjt\n\nt \u2208 N.\n\n(4.1)\n\n5\n\n\f2(cid:107)w(cid:107)2\n\nL2\n\n\u03c1X\n\nK (L2\n\n+ \u03bb(cid:107)w(cid:107)2\n\n\u03c1X ), where LK : L2\n\nIt is clear that (4.1) is a speci\ufb01c instantiation of (2.2) with \u03a8(w) = 1\n2, f (w, z) =\n(cid:96)((cid:104)w, Kx(cid:105), y), r(w) = 0 and \u02dc\u03c1 in Section 2 being the uniform distribution over {z1, . . . , zn}2.\nTherefore, the objective function to which SGD is applied becomes \u03c6(w) = Ez(w).\nTo state our generalization bounds, we need to introduce an assumption on a polynomial decay rate\nof approximation errors.\nAssumption 3. We assume the approximation error D(\u03bb) := inf w\u2208W E(w)\u2212E(h\u03c1)+\u03bb(cid:107)w(cid:107)2\n2 enjoys\na polynomial decay with exponent 0 < \u03b1 \u2264 1 in the sense D(\u03bb) \u2264 c\u03b1\u03bb\u03b1,\u2200\u03bb > 0, where c\u03b1 > 0.\nRemark 3. Assumption 3 is standard in learning theory and satis\ufb01ed under some mild conditions\non the smoothness of the function h\u03c1 and the representation power of W [9, 33]. If (cid:96) is smooth,\n2, which quanti\ufb01es the\n\u03c1X (square-integrable function class with marginal measure\n\nthen D(\u03bb) can be controlled by (cid:101)D(\u03bb) := inf w\u2208W (cid:107)hw \u2212 h\u03c1(cid:107)2\n\u03c1X ) and is well studied in approximation theory. (cid:101)D(\u03bb) decays polynomially with \u03b1 \u2208 (0, 1] if\n\napproximation of h\u03c1 by RKHS in L2\n\nh\u03c1 \u2208 L\u03b1/2\n\u03c1X is the integral operator associated to K [9, Proposition\n8.5]. Similar results hold if (cid:96) is Lipschitz continuous. Assumption 3 also holds if we use Gaussian\nkernels with \ufb02exible variances and distributions with geometric noise conditions [35]. It should be\nmentioned that kernels need not to be universal for Assumption 3 since it concerns the target function\nh\u03c1, which may admit more regularity (e.g., expressed by LK) than continuity, while universality\nmeans that D(\u03bb) \u2192 0 as \u03bb \u2192 0 for all continuous h\u03c1 [34].\nWe now establish a generalization error bound for a weighted average of iterates produced by (4.1) to\nbe proved in Supplementary Material F, which is derived by decomposing the excess generalization\nT ) \u2212 E(h\u03c1) into three components: an estimation error, an approximation error and a\nerror E( \u00afw(1)\n\ncomputational error. As we will see in the proof, the term(cid:0)(cid:80)T\n\n(cid:1)\u2212\u03b1 is due to the approximation\n\nand computational error, while the term n\u2212 \u03b1\n1+\u03b1 is due to the estimation and approximation error. The\nbound becomes n\u2212 \u03b1\n\u03b4 for suf\ufb01ciently large T , which enjoys a logarithmic dependency on\nT and demonstrates the ability of SGD to avoid over\ufb01tting.\nTheorem 10. Let {wt}t\u2208N be the sequence produced by (4.1) with \u03b7t \u2264 (2A)\u22121\u03c3\u03a8, \u03b7t+1 \u2264 \u03b7t\nt=1 \u03b7t \u2265 1 and\n\u03b4 \u2208 (0, 2/e), the following inequality holds with probability at least 1 \u2212 \u03b4\n\nt < \u221e. Suppose Assumption 3 holds. Then, for any T satisfying(cid:80)T\n\nand(cid:80)\u221e\n\n\u03c1X (cid:55)\u2192 L2\n\nt=1 \u03b72\n\n1+\u03b1 log\n\nt=1 \u03b7t\n\n3\n2 8T\n\nE( \u00afw(1)\n\nT ) \u2212 E(h\u03c1) \u2264 C5 max\n\n, n\u2212 \u03b1\n\n1+\u03b1\n\n3\n2\n\nlog\n\n8T\n\u03b4\n\n,\n\n(4.2)\n\nwhere C5 is a constant independent of T (explicitly given in the proof).\n\ngeneralization bounds, as shown in Corollary 11. The bound O(cid:0)n\u2212 \u03b1\n\nWe consider speci\ufb01c step sizes in Theorem 10 and choose an appropriate time index to get concrete\n\n(cid:1) coincides with\n\nn\n\u03b4\n\nO(n\u2212 \u03b1\n1+\u03b1 log n) (up to a logarithmic factor) in expectation for convex and smooth loss functions [21],\nand largely improves the bound O(n\u2212 \u03b1\n1+2\u03b1 log n) in expectation for convex and non-smooth loss\nfunctions [21]. In particular, if \u03b1 = 1 we derive the optimal bound O(n\u2212 1\n\u03b4 ) in a general\ncase with neither Bernstein conditions on variances nor capacity assumptions on hypothesis spaces\n(up to a logarithmic factor). It is also clear that SGD with different step sizes can achieve similar\ngeneralization bounds. However, the computational complexity to ful\ufb01ll this statistical potential can\nbe signi\ufb01cantly different. Corollary 11, with the proof omitted, follows directly from Theorem 10\n\nand(cid:80)T\nCorollary 11. Consider {wt}t\u2208N by (4.1) and \u03b4 \u2208 (0, 2/e). Let Assumption 3 hold and(cid:80)T\n\nt=1 t\u2212\u03b8 \u2265 (1 \u2212 \u03b8)\u22121(T 1\u2212\u03b8 \u2212 1), \u03b8 \u2208 (0, 1). Denote (cid:100)a(cid:101) the least integer no less than a.\n\nt=1 \u03b7t \u2265 1.\n\n2 log\n\n2 n\n\n3+\u03b2\n\n1+\u03b1 log\n\n3+\u03b1\u03b2\n\n2\n\n(a) If we take \u03b7t = \u03b71t\u2212\u03b8 with \u03b71 \u2264 (2A)\u22121 and \u03b8 \u2208 (1/2, 1), then with probability 1 \u2212 \u03b4 that\n\n(cid:26)(cid:16) T(cid:88)\n\n(cid:17)\u2212\u03b1\n\n\u03b7t\n\nt=1\n\n(cid:27)\n\nE( \u00afw(1)\n\nIf we further take T \u2217 =(cid:6)n\n\nT ) \u2212 E(h\u03c1) = O\n\n(1+\u03b1)(1\u2212\u03b8)(cid:7), then we get E( \u00afw(1)\n\nT \u2212\u03b1(1\u2212\u03b8) + n\u2212 \u03b1\n\n1\n\n(cid:18)(cid:16)\n\n(cid:17)\n\n(cid:19)\nT \u2217 ) \u2212 E(h\u03c1) = O(cid:0)n\u2212 \u03b1\n\nT\n\u03b4\n\nlog\n\n1+\u03b1\n\n3\n2\n\n.\n\n(cid:1).\n\n3\n2 n\n\u03b4\n\n1+\u03b1 log\n\n2\u03c1 is related to the draw of training examples while \u02dc\u03c1 is related to the draw of indices for SGD.\n\n6\n\n\f(cid:18)(cid:16)\n\n(cid:17)\n\n(cid:19)\n\n.\n\n(b) If we take \u03b7t = \u03b71(t log\u03b2(et))\u2212 1\n\n2 with \u03b71 \u2264 (2A)\u22121 and \u03b2 > 1, then with probability 1 \u2212 \u03b4 that\n\nE( \u00afw(1)\n\nIf we further take T \u2217 =(cid:6)n\n\nT ) \u2212 E(h\u03c1) = O\n\nT \u2212 \u03b1\n\n1+\u03b1(cid:7), then we get E( \u00afw(1)\n\n2 log\n\n2\n\n\u03b1\u03b2\n\n1+\u03b1\n\n2 T + n\u2212 \u03b1\n\nT \u2217 ) \u2212 E(h\u03c1) = O(cid:0)n\u2212 \u03b1\n\nlog\n\n3\n2\n\nT\n\u03b4\n\n1+\u03b1 log\n\n3+\u03b1\u03b2\n\n2\n\n(cid:1).\n\nn\n\u03b4\n\nIt should be noted that our discussions depend on the existence of a minimizer of Ez(\u00b7) over the RKHS\nwith a \ufb01nite norm. This assumption can be relaxed to the existence of a minimizer of E(\u00b7) over the\nRKHS with a \ufb01nite norm to derive similar generalization bounds. Indeed, one can perform deductions\nsimilar to the proof of Theorem 3 by taking w in (3.4) to be the minimizer of E(\u00b7). However, in this\ncase it becomes a challenge to derive estimation error bounds with a logarithmic dependency on T .\n\n5 Related Work and Discussions\n\n5.1 Convex Objectives\n\n\u221a\n\nT ) were established for online gradient descent\nFor general convex objectives, regret bounds O(\nwith T iterations [44], from which one can directly derive convergence rates O(T \u2212 1\n2 ) for SGD\nwith some averaging schemes. This result was extended to stochastic forward-backward split-\nting [11]. A convergence rate O(T \u2212 1\n2 log T ) was established for the T -th individual iterate of\nSGD [32]. All the above mentioned rates were stated in expectation and derived based on an\nassumption E[(cid:107)f(cid:48)(wt, zt)(cid:107)2\u2217 + (cid:107)r(cid:48)(wt)(cid:107)2\u2217] \u2264 G for a G \u2265 0 and t \u2208 N. This boundedness as-\nsumption was successfully removed for studying convergence rates in expectation under some\nsmoothness assumption [23, 40, 42] or Assumption 1 [30]. As compared to these convergence\nrates in expectation, high-probability convergence rates were much less studied and were often\nbased on a stronger assumption on the almost sure boundedness of subgradients. Under the assump-\ntion max{D\u03a8(w\u2217, wt), supz (cid:107)f(cid:48)(wt, z)(cid:107)\u2217} \u2264 G for a G > 0 and all t \u2208 N, it was shown with\nprobability 1 \u2212 \u03b4 that \u03c6( \u00afw(1)\nT de\ufb01ned in Theorem 4 [12, 24].\nHigh-probability bounds were also established for stochastic dual averaging under the boundedness as-\nsumption on iterates and subgradients [37]. In our discussion, we show that the same high-probability\nconvergence rate (up to a logarithmic factor) holds without any boundedness assumptions on either\nthe iterates {wt} or the associated subgradients. In particular, we show that {wt}t\u2264T automatically\n1 \u2212 \u03b4 that (cid:107)wt \u2212 w\u2217(cid:107)2\n\u03b4 ) for the particular SGD [19]. However, the discussion\nin [19] requires a stronger assumption on the H\u00f6lder continuity of loss functions which excludes\nnon-differentiable loss functions such as hinge loss and the absolute loss satisfying (3.2). Secondly,\nthey only consider the one-pass SGD where each training example is used only once.\n\nfalls into a ball with radius O((cid:112)log T /\u03b4) with high probability. It was shown with probability\n(cid:0)\u03c6(wt) \u2212 \u03c6(w\u2217)(cid:1), while\n\nWe also give a suf\ufb01cient condition for almost sure \ufb01niteness of(cid:80)\u221e\n\nT ) \u2212 \u03c6(w\u2217) = O(cid:0)T \u2212 1\n\n(cid:1) for \u00afw(1)\n\n2 = O((cid:107)w\u2217(cid:107)2\n\n2 log T\n\n2 log\n\n1\n2 1\n\u03b4\n\nmost results on almost sure convergence are achieved for strongly convex objectives.\n\nt=1 \u03b7t\n\n5.2 Strongly Convex Objectives\nFor \u03bb-exp-concave loss functions, a regret bound O(\u03bb\u22121 log T ) was established for an online Newton\n\nmethod [15], which implies convergence rates O(cid:0)(\u03bbT )\u22121 log T(cid:1) for some average of iterates pro-\n\nduced by the stochastic counterpart. This result was extended to online forward-backward splitting\n[11] and SCMD [12] applied to \u03bb-strongly convex objectives. Optimal convergence rates O((\u03bbT )\u22121)\nfor the suboptimality of objective values were derived based on a suf\ufb01x averaging scheme [28], a\nepoch-GD scheme based on a doubling trick [14] and a weighted averaging with a weight of t + 1 for\nwt [16]. However, the above mentioned results are all associated to convergence rates in expectation\nand require to impose boundedness assumptions on subgradients encountered during the iterations.\nThis boundedness assumption was relaxed as Ez[(cid:107)f(cid:48)(wt, z)(cid:107)2\u2217] \u2264 A1 + B1(cid:107)F (cid:48)(wt)(cid:107)2\u2217 for SGD [6]\nwith A1, B1 \u2265 0, which was further removed for SGD [26] and stochastic mirror descent [17] by\nimposing smoothness assumptions on loss functions. All the above mentioned results are stated in ex-\n\npectation. With probability 1\u2212\u03b4, it was shown (cid:107)wT \u2212w\u2217(cid:107)2 = O(cid:0)(\u03bb2T )\u22121+(\u03bbT )\u22121 log(\u03b4\u22121 log T )(cid:1)\nfor SGD [28]. High-probability convergence rates O(cid:0)(\u03bbT )\u22121 log(\u03b4\u22121 log T )(cid:1) were also established\n\nfor the suboptimality of objective values for the T -th iterate of the epoch-GD [14]. These two high-\nprobability rates were derived based on an assumption on almost sure boundedness of subgradients\n\n7\n\n\fwhich is more challenging to remove [14, 28]. As a comparison, we establish the same convergence\nrate (up to a logarithmic factor) for a more general SCMD without boundedness assumptions on\nsubgradients. Suf\ufb01cient conditions as in Theorem 9 were established for almost sure convergence of\nSGD [5, 26] and stochastic mirror descent [17], which were extended to SCMD in Theorem 9.\n\n5.3 Generalization Error Bounds\n\nof SGD with T iterates scales as O(n\u22121(cid:80)T\n\nWhile computational complexity of SGD has been extensively studied in the optimization community,\nthere is much less work on the generalization property of the model trained by SGD. Classical\ngeneralization bounds only hold for one-pass SGD [24, 27, 28, 32, 36, 38, 39] where each training\nexample can be used at most once. In practice, however, multiple passes are often used to produce a\nmodel with good generalization behavior [13]. The landmark work in [7] developed a framework\nto analyze generalization performance of multi-pass stochastic learning algorithms by taking into\naccount the computational complexity of learning algorithms. Under this framework, the interplay\namong estimation errors, computational errors and approximation errors can be studied, showing that\nan implicit regularization can be achieved in the absence of penalization or constraints by tuning either\nthe step size or the number of passes (the iteration number divided by the training set size) [13, 20, 21,\n29]. In a parametric setting, it was shown that SGD is algorithmically stable and the stability measure\nt=1 \u03b7t) [13], based on which a generalization bound\nE[E( \u00afw(1)\nn) and T = O(n) without\nconsidering approximation errors. The discussion in [13] requires to impose a smoothness assumption\non loss functions. Generalization analysis was considered separately for smooth and non-smooth loss\nfunctions [21]. For smooth loss functions, it was shown E[E( \u00afw(1)\n1+\u03b1 log n) for\n\u221a\n\u03b1+1(cid:101) [21], based on the stability property of SGD established in [13]. For\n\u221a\n\u03b7t = \u03b71/\nnon-smooth loss functions, it was shown E[E( \u00afw(1)\n2\u03b1+1 log n) for \u03b7t = \u03b71/\nt\n2\u03b1+1(cid:101), by controlling estimation errors with Rademacher complexities [3, 21]. Still,\nand T = (cid:100)n\nthe bounds in [13, 21] require to impose a boundedness assumption on subgradients and are stated\nin expectation. As a comparison, we establish high-probability bounds without any boundedness\nassumptions on subgradients. Furthermore, our generalization analysis extends the analysis in [13] to\nnon-smooth loss functions and substantially improve the bound O(n\u2212 \u03b1\n2\u03b1+1 log n) [21] in this setting.\n\n\u221a\n2 ) was established for \u03b7t = O(1/\n\nT )] \u2212 inf w\u2208W E(w) = O(n\u2212 1\n\nT )] \u2212 E(h\u03c1) = O(n\u2212 \u03b1\n\nT )] \u2212 E(h\u03c1) = O(n\u2212 \u03b1\n\n(cid:1) in Corollary 11 is optimal in the sense that it\n\nt with T = (cid:100)n\n\n2\n\n2\n\nThe generalization error bound O(cid:0)n\u2212 \u03b1\n\n1+\u03b1 log\n\n3+\u03b1\u03b2\n\n2\n\nn\n\u03b4\n\nT )\u2212Ez( \u00afw(1)\n\nmatches the best available bound for Tikhonov regularization (up to a logarithmic factor) [9, 21, 34].\nWe achieve this improvement by controlling better estimation errors. Speci\ufb01cally, estimation errors\nwere shown to scale polynomially w.r.t. the number of passes [13, 21], which dominate the other two\nerrors for large T . In this way, one needs to tune T to balance the estimation, approximation and\ncomputational errors. As a comparison, we show bounds scaling logarithmically w.r.t. the number of\npasses for E( \u00afw(1)\nT ) (Theorem 10). This implies that estimation errors will never essentially\ndominate the other two errors and one can run SGD with a suf\ufb01cient number of passes with little\nover\ufb01tting if step sizes are square-summable, due to the key observation on the almost boundedness\nof iterates established in Theorem 3. Another trick in getting almost optimal bounds includes the\nuse of Assumption 3 to control E(w\u03bb) \u2212 Ez(w\u03bb) with a linear (instead of quadratic) function of\nsupz f (w\u03bb, z) and to select a suitable \u03bb, where w\u03bb = arg minw\u2208W E(w) + \u03bb(cid:107)w(cid:107)2\n2. Optimal learning\nrates were given for multi-pass SGD with the least squares loss function [10, 20, 29]. However,\ntheir analysis is based on an integral operator approach and does not apply to general loss functions.\nGeneralization bounds for SGD were also studied from a PAC-Bayesian perspective [22]. However,\nthe high-probability bounds there require to impose Lipschitz continuity, smoothness and strong\nconvexity assumptions on loss functions, and ignore computational and approximation errors [22].\n\n6 Simulations\n\nOur analysis implies that SGD can be run with a suf\ufb01cient number of iterations with little over\ufb01tting\nif step sizes are square-summable, which meanwhile can achieve similar generalization performance\nwith different computational complexities. In this section, we include some experimental results to\nvalidate these theoretical \ufb01ndings. We apply SGD (4.1) with a linear kernel Kx = x and the hinge\nloss (cid:96)(a, y) = max{0, 1 \u2212 ya} to several binary classi\ufb01cation datasets (ADULT, GISETTE, IJCNN,\nMUSHROOMS, PHISHING and SPLICE). All these datasets, described in Supplementary Material\n\n8\n\n\f(cid:1)\u22121(cid:80)t\n\nt =(cid:0)(cid:80)t\n\nG, can be download from the LIBSVM website [8]. We consider polynomially decaying step sizes of\nthe form \u03b7t = 5t\u2212\u03b8 with \u03b8 \u2208 {0.25, 0.51, 0.75} (we consider \u03b8 = 0.51, instead of \u03b8 = 0.5, since the\nassociated step size sequence is square-summable). We repeat experiments 12 times and report the\naverage of results. In Figure 1, we plot test errors of \u00afw(3)\nk=\u02dct+1 \u03b7kwk versus\nthe number of passes (the iteration number divided by the training set size), where \u02dct = 2(cid:98)log2 t(cid:99)\u22121.\nreturns an \u03b1-suf\ufb01x average of iterates [28] with \u03b1 \u2208 [1/2, 3/4] and one can adapt the\nIntuitively, \u00afw(3)\nproof of Theorem 4 to show that \u00afw(3)\n. Moreover, \u00afw(3)\nj=1 \u03b7jwj with k = 20, 21, 22, . . .. From Figure 1,\nwe see that SGD is resistant to over\ufb01tting for appropriate step sizes. For example, we observe no\nover\ufb01tting even if the number of passes exceeds 1000 for SGD with \u03b8 \u2208 {0.51, 0.75}. Moreover,\nSGD with \u03b8 \u2208 {0.51, 0.75} can achieve similar generalization errors on ADULT, IJCNN, PHISHING\nand SPLICE, towards which SGD with \u03b8 = 0.51 requires a signi\ufb01cantly smaller number of passes.\nThis is well consistent with Corollary 11.\n\nis easily computable on-the-\ufb02y by storing only(cid:80)k\n\nk=\u02dct+1 \u03b7k\n\nt\n\nenjoys similar generalization bounds as \u00afw(1)\n\nt\n\nt\n\nt\n\n(a) ADULT.\n\n(b) GISETTE.\n\n(c) IJCNN.\n\n(d) MUSHROOMS.\n\n(e) PHISHING.\n\n(f) SPLICE.\n\nFigure 1: Test errors versus the number of passes.\n\n7 Conclusions\n\nIn this paper, we establish a rigorous theoretical foundation for SCMD by providing optimal conver-\ngence rates (up to a logarithmic factor) in the stochastic optimization setting without boundedness\nassumptions on either subgradients or iterates, which in turn also shed new insights on the generaliza-\ntion behavior of the multi-pass SGD in the statistical learning theory setting. In particular, we justify\nthe immunity of multi-pass SGD to over\ufb01tting by giving estimation error bounds with a logarithmic\ndependency on the number of passes for square-summable step sizes, while existing bounds scale\npolynomially [13, 21]. This improvement is based on the key observation on the almost boundedness\nof iterates with high probability. Our generalization analysis of SGD also substantially improves\nlearning rates in [21], removes bounded subgradient assumptions in [13, 21, 22], removes smoothness\nassumptions in [13, 22] and is performed in high probability instead of in expectation [13, 21]. It\nwould be interesting to extend our results to a non-convex setting [43] and to general mirror descent\nalgorithms with a non-differentiable mirror map [18].\n\nAcknowledgments\n\nThis work is supported in part by the National Key Research and Development Program of\nChina (Grant No. 2017YFB1003102), the National Natural Science Foundation of China (Grant\n\n9\n\n10-210-1100101102Passes0.40.60.811.21.41.61.82Test Error10-1100101Passes101102Test Error10-210-1100101102Passes0.20.250.30.350.40.450.5Test Error10-1100101102103Passes100Test Error10-1100101102Passes0.20.250.30.350.40.450.50.550.60.65Test Error100101102103Passes100101102Test Error\fNos. 61806091 and 61672478), the Science and Technology Innovation Committee Founda-\ntion of Shenzhen (Grant No. ZDSYS201703031748284) and Shenzhen Peacock Plan (Grant No.\nKQTD2016112514355531).\n\nReferences\n\n[1] A. Agarwal, M. J. Wainwright, P. L. Bartlett, and P. K. Ravikumar. Information-theoretic lower bounds on\nthe oracle complexity of convex optimization. In Advances in Neural Information Processing Systems,\npages 1\u20139, 2009.\n\n[2] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate\n\nO(1/n). In Advances in Neural Information Processing Systems, pages 773\u2013781, 2013.\n\n[3] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.\n\nJournal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[4] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[5] L. Bottou. On-line learning and stochastic approximations. In D. Saad, editor, On-line Learning in Neural\n\nNetworks, pages 9\u201342. Cambridge University Press, New York, NY, USA, 1998.\n\n[6] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM\n\nReview, 60(2):223\u2013311, 2018.\n\n[7] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In Advances in Neural Information\n\nProcessing Systems, pages 161\u2013168, 2008.\n\n[8] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on\n\nIntelligent Systems and Technology, 2(3):27, 2011.\n\n[9] F. Cucker and D.-X. Zhou. Learning Theory: an Approximation Theory Viewpoint. Cambridge University\n\nPress, 2007.\n\n[10] A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Annals of\n\nStatistics, 44(4):1363\u20131399, 2016.\n\n[11] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting. In Advances\n\nin Neural Information Processing Systems, pages 495\u2013503, 2009.\n\n[12] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Conference\n\non Learning Theory, pages 14\u201326, 2010.\n\n[13] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent.\n\nIn International Conference on Machine Learning, pages 1225\u20131234, 2016.\n\n[14] E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-\n\nconvex optimization. Journal of Machine Learning Research, 15(1):2489\u20132512, 2014.\n\n[15] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine\n\nLearning, 69(2):169\u2013192, 2007.\n\n[16] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) convergence rate\n\nfor the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.\n\n[17] Y. Lei and D.-X. Zhou. Convergence of online mirror descent. Applied and Computational Harmonic\n\nAnalysis, 2018. doi: https://doi.org/10.1016/j.acha.2018.05.005.\n\n[18] Y. Lei and D.-X. Zhou. Learning theory of randomized sparse Kaczmarz method. SIAM Journal on\n\nImaging Sciences, 11(1):547\u2013574, 2018.\n\n[19] Y. Lei, L. Shi, and Z.-C. Guo. Convergence of unregularized online learning algorithms. Journal of\n\nMachine Learning Research, 18(171):1\u201333, 2018.\n\n[20] J. Lin and L. Rosasco. Optimal learning for multi-pass stochastic gradient methods. In Advances in Neural\n\nInformation Processing Systems, pages 4556\u20134564, 2016.\n\n[21] J. Lin, R. Camoriano, and L. Rosasco. Generalization properties and implicit regularization for multiple\n\npasses SGM. In International Conference on Machine Learning, pages 2340\u20132348, 2016.\n\n[22] B. London. A PAC-bayesian analysis of randomized learning with application to stochastic gradient\n\ndescent. In Advances in Neural Information Processing Systems, pages 2931\u20132940, 2017.\n\n[23] E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine\n\nlearning. In Advances in Neural Information Processing Systems, pages 451\u2013459, 2011.\n\n[24] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[25] A.-S. Nemirovsky and D.-B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. John\n\nWiley & Sons, 1983.\n\n[26] L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richt\u00e1rik, K. Scheinberg, and M. Tak\u00e1\u02c7c. SGD and hogwild!\n\nconvergence without the bounded gradients assumption. arXiv preprint arXiv:1802.03801, 2018.\n\n10\n\n\f[27] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In\n\nAdvances in Neural Information Processing Systems, pages 1116\u20131124, 2014.\n\n[28] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic\n\noptimization. In International Conference on Machine Learning, pages 449\u2013456, 2012.\n\n[29] L. Rosasco and S. Villa. Learning with incremental iterative regularization.\n\nInformation Processing Systems, pages 1630\u20131638, 2015.\n\nIn Advances in Neural\n\n[30] S. Shalev-Shwartz and A. Tewari. Stochastic methods for (cid:96)1-regularized loss minimization. Journal of\n\nMachine Learning Research, 12:1865\u20131892, 2011.\n\n[31] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In\n\nInternational Conference on Machine Learning, pages 807\u2013814. ACM, 2007.\n\n[32] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization convergence results and\n\noptimal averaging schemes. In International Conference on Machine Learning, pages 71\u201379, 2013.\n\n[33] S. Smale and D.-X. Zhou. Estimating the approximation error in learning theory. Analysis and Applications,\n\n1(01):17\u201341, 2003.\n\n[34] I. Steinwart and A. Christmann. Support Vector Machines. Springer Science & Business Media, 2008.\n[35] I. Steinwart and C. Scovel. Fast rates for support vector machines using gaussian kernels. Annals of\n\nStatistics, 35(2):575\u2013607, 2007.\n\n[36] P. Tarres and Y. Yao. Online learning as stochastic approximation of regularization paths: optimality and\n\nalmost-sure convergence. IEEE Transactions on Information Theory, 60(9):5716\u20135735, 2014.\n\n[37] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research, 11:2543\u20132596, 2010.\n\n[38] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational\n\nMathematics, 8(5):561\u2013596, 2008.\n\n[39] Y. Ying and D.-X. Zhou. Online regularized classi\ufb01cation algorithms. IEEE Transactions on Information\n\nTheory, 52(11):4775\u20134788, 2006.\n\n[40] Y. Ying and D.-X. Zhou. Unregularized online learning algorithms with general loss functions. Applied\n\nand Computational Harmonic Analysis, 42(2):224\u2013244, 2017.\n\n[41] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In\n\nInternational Conference on Machine Learning, pages 919\u2013926, 2004.\n\n[42] P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization.\n\nIn International Conference on Machine Learning, pages 1\u20139, 2015.\n\n[43] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. W. Glynn. Stochastic mirror descent in variationally\ncoherent optimization problems. In Advances in Neural Information Processing Systems, pages 7043\u20137052,\n2017.\n\n[44] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In International\n\nConference on Machine Learning, pages 928\u2013936, 2003.\n\n11\n\n\f", "award": [], "sourceid": 776, "authors": [{"given_name": "Yunwen", "family_name": "Lei", "institution": "Southern University of Science and Technology"}, {"given_name": "Ke", "family_name": "Tang", "institution": "Southern University of Science and Technology"}]}