{"title": "Fast Rates for Exp-concave Empirical Risk Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1477, "page_last": 1485, "abstract": "We consider Empirical Risk Minimization (ERM) in the context of stochastic optimization with exp-concave and smooth losses---a general optimization framework that captures several important learning problems including linear and logistic regression, learning SVMs with the squared hinge-loss, portfolio selection and more. In this setting, we establish the first evidence that ERM is able to attain fast generalization rates, and show that the expected loss of the ERM solution in $d$ dimensions converges to the optimal expected loss in a rate of $d/n$. This rate matches existing lower bounds up to constants and improves by a $\\log{n}$ factor upon the state-of-the-art, which is only known to be attained by an online-to-batch conversion of computationally expensive online algorithms.", "full_text": "Fast Rates for Exp-concave\nEmpirical Risk Minimization\n\nTomer Koren\n\nTechnion\n\nHaifa 32000, Israel\n\ntomerk@technion.ac.il\n\nkfiryl@tx.technion.ac.il\n\nK\ufb01r Y. Levy\n\nTechnion\n\nHaifa 32000, Israel\n\nAbstract\n\nWe consider Empirical Risk Minimization (ERM) in the context of stochastic op-\ntimization with exp-concave and smooth losses\u2014a general optimization frame-\nwork that captures several important learning problems including linear and lo-\ngistic regression, learning SVMs with the squared hinge-loss, portfolio selection\nand more. In this setting, we establish the \ufb01rst evidence that ERM is able to at-\ntain fast generalization rates, and show that the expected loss of the ERM solution\nin d dimensions converges to the optimal expected loss in a rate of d/n. This\nrate matches existing lower bounds up to constants and improves by a log n factor\nupon the state-of-the-art, which is only known to be attained by an online-to-batch\nconversion of computationally expensive online algorithms.\n\n1\n\nIntroduction\n\nStatistical learning and stochastic optimization with exp-concave loss functions captures several\nfundamental problems in statistical machine learning, which include linear regression, logistic re-\ngression, learning support-vector machines (SVMs) with the squared hinge loss, and portfolio se-\nlection, amongst others. Exp-concave functions constitute a rich class of convex functions, which is\nsubstantially richer than its more familiar subclass of strongly convex functions.\nSimilarly to their strongly-convex counterparts, it is well-known that exp-concave loss functions\nare amenable to fast generalization rates. Speci\ufb01cally, a standard online-to-batch conversion [6]\nof either the Online Newton Step algorithm [8] or exponential weighting schemes [5, 8] in d di-\nmensions gives rise to convergence rate of d/n, as opposed to the standard 1/\nn rate of generic\n(Lipschitz) stochastic convex optimization. Unfortunately, the latter online methods are highly in-\nef\ufb01cient computationally-wise; e.g., the runtime complexity of the Online Newton Step algorithm\nscales as d4 with the dimension of the problem, even in very simple optimization scenarios [13].\nAn alternative and widely-used learning paradigm is that of Empirical Risk Minimization (ERM),\nwhich is often regarded as the strategy of choice due to its generality and its statistical ef\ufb01ciency. In\nthis scheme, a sample of training instances is drawn from the underlying data distribution, and the\nminimizer of the sample average (or the regularized sample average) is computed. As opposed to\nmethods based on online-to-batch conversions, the ERM approach enables the use of any optimiza-\ntion procedure of choice and does not restrict one to use a speci\ufb01c online algorithm. Furthermore,\nthe ERM solution often enjoys several distribution-dependent generalization bounds in conjunction,\nand thus is able to obliviously adapt to the properties of the underlying data distribution.\n\u221a\nIn the context of exp-concave functions, however, nothing is known about the generalization abilities\nn convergence rate that applies to any convex losses. Surprisingly,\nof ERM besides the standard 1/\nit appears that even in the speci\ufb01c and extensively-studied case of linear regression with the squared\nloss, the state of affairs remains unsettled: this important case was recently addressed by Shamir\n\n\u221a\n\n1\n\n\f[19], who proved a \u2126(d/n) lower bound on the convergence rate of any algorithm, and conjectured\nthat the rate of an ERM approach should match this lower bound.\nIn this paper, we explore the convergence rate of ERM for stochastic exp-concave optimization.\nWe show that when the exp-concave loss functions are also smooth, a slightly-regularized ERM\napproach yields a convergence rate of O(d/n), which matches the lower bound of Shamir [19] up\nto constants. In fact, our result shows for ERM a generalization rate tighter than the state-of-the-art\nobtained by the Online Newton Step algorithm, improving upon the latter by a log n factor. Even in\nthe speci\ufb01c case of linear regression with the squared loss, our result improves by a log(n/d) factor\nupon the best known fast rates provided by the Vovk-Azoury-Warmuth algorithm [3, 22].\nOur results open an avenue for potential improvements to the runtime complexity of exp-concave\nstochastic optimization, by permitting the use of accelerated methods for large-scale regularized\nloss minimization. The latter has been the topic of an extensive research effort in recent years, and\nnumerous highly-ef\ufb01cient methods have been developed; see, e.g., Johnson and Zhang [10], Shalev-\nShwartz and Zhang [16, 17] and the references therein.\nOn the technical side, our convergence analysis relies on stability arguments introduced by Bousquet\nand Elisseeff [4]. We prove that the expected loss of the regularized ERM solution does not change\nsigni\ufb01cantly when a single instance, picked uniformly at random from the training sample, is dis-\ncarded. Then, the technique of Bousquet and Elisseeff [4] allows us to translate this average stability\nproperty into a generalization guarantee. We remark that in all previous stability analyses that we\nare aware of, stability was shown to hold uniformly over all discarded training intances, either with\nprobability one [4, 16] or in expectation [20]; in contrast, in the case of exp-concave functions it is\ncrucial to look at the average stability.\nIn order to bound the average stability of ERM, we make use of a localized notion of strong con-\nvexity, de\ufb01ned with respect to a local norm at a certain point in the optimization domain. Roughly\nspeaking, we show that when looking at the right norm, which is determined by the local properties\nof the empirical risk at the right point, the minimizer of the empirical risk becomes stable. This\npart of our analysis is inspired by recent analysis techniques of regularization-based online learning\nalgorithms [1], that use local norms to study the regret performance of online linear optimization\nalgorithms.\n\n1.1 Related Work\n\nThe study of exp-concave loss functions was initiated in the online learning community by Kivinen\nand Warmuth [12], who considered the problem of prediction with expert advice with exp-concave\nlosses. Later, Hazan et al. [8] considered a more general framework that allows for a continuous\ndecision set, and proposed the Online Newton Step (ONS) algorithm that attains a regret bound that\ngrows logarithmically with the number of optimization rounds. Mahdavi et al. [15] considered the\nONS algorithm in the statistical setting, and showed how it can be used to establish generalization\nbounds that hold with high probability, while still keeping the fast 1/n rate.\nFast convergence rates in stochastic optimization are known to be achievable under various condi-\ntions. Bousquet and Elisseeff [4] and Shalev-Shwartz et al. [18] have shown, via a uniform stability\nargument, that ERM guarantees a convergence rate of 1/n for strongly convex functions. Sridharan\net al. [21] proved a similar result, albeit using the notion of localized Rademacher complexity. For\nthe case of smooth and non-negative losses, Srebro et al. [20] established a 1/n rate in low-noise\nconditions, i.e., when the expected loss of the best hypothesis is of order 1/n. For further discussion\nof fast rates in stochastic optimization and learning, see [20] and the references therein.\n\n2 Setup and Main Results\n\nWe consider the problem of minimizing a stochastic objective\nF (w) = E[f (w, Z)]\n\n(1)\nover a closed and convex domain W \u2286 Rd in d-dimensional Euclidean space. Here, the expec-\ntation is taken with respect to a random variable Z distributed according to an unknown distri-\nbution over a parameter space Z. Given a budget of n samples z1, . . . , zn of the random vari-\n\nable Z, we are required to produce an estimate (cid:98)w \u2208 W whose expected excess loss, de\ufb01ned by\n\n2\n\n\fE[F ((cid:98)w)]\u2212 minw\u2208W F (w), is small. (Here, the expectation is with respect the randomization of the\ntraining set z1, . . . , zn used to produce (cid:98)w.)\n\nWe make the following assumptions over the loss function f. First, we assume that for any \ufb01xed\nparameter z \u2208 Z, the function f (\u00b7, z) is \u03b1-exp-concave over the domain W for some \u03b1 > 0, namely,\nthat the function exp (\u2212\u03b1f (\u00b7, z)) is concave over W. We will also assume that f (\u00b7, z) is \u03b2-smooth\nover W with respect to Euclidean norm (cid:107)\u00b7(cid:107)2, which means that its gradient is \u03b2-Lipschitz with\nrespect to the same norm:\n\n\u2200 w, w(cid:48) \u2208 W ,\n\n(cid:107)\u2207f (w, z) \u2212 \u2207f (w(cid:48), z)(cid:107)2 \u2264 \u03b2(cid:107)w \u2212 w(cid:48)(cid:107)2 .\n\n(2)\nIn particular, this property implies that f (\u00b7, z) is differentiable. For simplicity, and without loss of\ngenerality, we assume \u03b2 \u2265 1. Finally, we assume that f (\u00b7, z) is bounded over W, in the sense that\n|f (w, z) \u2212 f (w(cid:48), z)| \u2264 C for all w, w(cid:48) \u2208 W for some C > 0.\nIn this paper, we analyze a regularized Empirical Risk Minimization (ERM) procedure for optimiz-\ning the stochastic objective in Eq. (1), that based on the sample z1, . . . , zn computes\n\n(cid:98)w = arg min\n\nw\u2208W\n\n(cid:98)F (w) ,\n\nwhere\n\n(cid:98)F (w) =\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nf (w, zi) +\n\n1\nn\n\nR(w) .\n\n(3)\n\n(4)\n\nThe function R : W (cid:55)\u2192 R serves as a regularizer, which is assumed to be 1-strongly-convex with\nrespect to the Euclidean norm; for instance, one can simply choose R(w) = 1\n2. The strong\n\nconstant B > 0.\nOur main result, which we now present, establishes a fast 1/n convergence rate for the expected\n\nconvexity of R implies in particular that (cid:98)F is also strongly convex, which ensures that the optimizer\n(cid:98)w is unique. For our bounds, we will assume that |R(w) \u2212 R(w(cid:48))| \u2264 B for all w, w(cid:48) \u2208 W for some\nexcess loss of the ERM estimate (cid:98)w given in Eq. (3).\nERM estimate (cid:98)w de\ufb01ned in Eqs. (3) and (4) based on an i.i.d. sample z1, . . . , zn, the expected excess\n\nTheorem 1. Let f : W \u00d7 Z (cid:55)\u2192 R be a loss function de\ufb01ned over a closed and convex domain W \u2286\nRd, which is \u03b1-exp-concave, \u03b2-smooth and B-bounded with respect to its \ufb01rst argument. Let R :\nW (cid:55)\u2192 R be a 1-strongly-convex and B-bounded regularization function. Then, for the regularized\n\n2(cid:107)w(cid:107)2\n\nloss is bounded as\n\nE[F ((cid:98)w)] \u2212 min\n\nw\u2208W F (w) \u2264 24\u03b2d\n\n\u03b1n\n\n+\n\n100Cd\n\nn\n\n+\n\nB\nn\n\n= O\n\n(cid:18) d\n\n(cid:19)\n\n.\n\nn\n\nIn other words, the theorem states that for ensuring an expected excess loss of at most \u0001, a sample\nof size n = O(d/\u0001) suf\ufb01ces. This result improves upon the best known fast convergence rates for\nexp-concave functions by a O(log n) factor, and matches the lower bound of Shamir [19] for the\nspecial case where the loss function is the squared loss. For this particular case, our result af\ufb01rms\nthe conjecture of Shamir [19] regarding the sample complexity of ERM for the squared loss; see\nSection 2.1 below for details.\nIt is important to note that Theorem 1 establishes a fast convergence rate with respect to the actual\nexpected loss F itself, and not for a regularized version thereof (and in particular, not with respect\n\nto the expectation of (cid:98)F ). Notably, the magnitude of the regularization we use is only O(1/n), as\n\nn) regularization used in standard regularized loss minimization methods\n\n\u221a\nopposed to the O(1/\n(that can only give rise to a traditional O(1/\n\n\u221a\n\nn) rate).\n\n2.1 Results for the Squared Loss\n\nIn this section we focus on the important special case where the loss function f is the squared loss,\n2 (w\u00b7x \u2212 y)2 where x \u2208 Rd is an instance vector and y \u2208 R is a target value.\nnamely, f (w; x, y) = 1\nThis case, that was extensively studied in the past, was recently addressed by Shamir [19] who gave\nlower bounds on the sample complexity of any learning algorithm under mild assumptions.\n\n3\n\n\f2(cid:107)w(cid:107)2\n\nShamir [19] analyzed learning with the squared loss in a setting where the domain is W = {w \u2208\nRd : (cid:107)w(cid:107)2 \u2264 B} for some constant B > 0, and the parameters distribution is supported over\n|y| \u2264 B}. It is not hard to verify that in this setup, for the\n{x \u2208 Rd : (cid:107)x(cid:107)2 \u2264 1} \u00d7 {y \u2208 R :\nsquared loss we can take \u03b2 = 1, \u03b1 = 4B2 and C = 2B2. Furthermore, if we choose the standard\nTheorem 1 implies that the expected excess loss of the regularized ERM estimator (cid:98)w we de\ufb01ned in\n2 B2 for all w, w(cid:48) \u2208 W. As a consequence,\nregularizer R(w) = 1\n\n2, we have |R(w)\u2212 R(w(cid:48))| \u2264 1\n\nEq. (3) is bounded by O(B2d/n).\n\u221a\nOn the other hand, standard uniform convergence results for generalized linear functions [e.g., 11]\nshow that, under the same conditions, ERM also enjoys an upper bound of O(B2/\nn) over its\nexpected excess risk. Overall, we conclude:\n2 (w\u00b7 x \u2212 y)2 over the domain W = {w \u2208 Rd :\nCorollary 2. For the squared loss f (w; x, y) = 1\n|y| \u2264 B}, the regularized ERM\n(cid:107)w(cid:107)2 \u2264 B} with Z = {x \u2208 Rd : (cid:107)x(cid:107)2 \u2264 1} \u00d7 {y \u2208 R :\n(cid:18)\n\nestimator (cid:98)w de\ufb01ned in Eqs. (3) and (4) based on an i.i.d. sample of n instances has\n\n(cid:27)(cid:19)\n\n(cid:26) B2d\n\nw\u2208W F (w) = O\n\nmin\n\n,\n\nB2\u221a\nn\n\nn\n\n.\n\nE[F ((cid:98)w)] \u2212 min\n\nThis result slightly improves, by a log(n/d) factor, upon the bound conjectured by Shamir [19] for\nthe ERM estimator, and matches the lower bound proved therein up to constants.1 Previous fast-rate\nresults for ERM that we are aware of either included excess log factors [2] or were proven under\nadditional distributional assumptions [14, 9]; see also the discussion in [19]. We remark that Shamir\nconjectures this bound for ERM without any regularization. For the speci\ufb01c case of the squared loss,\nit is indeed possible to obtain the same rates without regularizing; we defer details to the full version\nof the paper. However, in practice, regularization has several additional bene\ufb01ts: it renders the ERM\noptimization problem well-posed (i.e., ensures that the underlying matrix that needs to be inverted\nis well-conditioned), and guarantees it has a unique minimizer.\n\n3 Proof of Theorem 1\n\nOur proof of Theorem 1 proceeds as follows. First, we relate the expected excess risk of the ERM\n\nestimator (cid:98)w to its average leave-one-out stability [4]. Then, we bound this stability in terms of\ncertain local properties of the empirical risk at the point (cid:98)w. To introduce the average stability notion\n\nwe study, we \ufb01rst de\ufb01ne for each i = 1, . . . , n the following empirical leave-one-out risk:\n\n(cid:98)F i(w) =\n\n(cid:88)\n\nj(cid:54)=i\n\n1\nn\n\n1\nn\n\nR(w)\n\nf (w, zj) +\n\n(i = 1, . . . , n) .\n\nNamely, (cid:98)F i is the regularized empirical risk corresponding to the sample obtained by discard-\ning the instance zi. Then, for each i we let (cid:98)wi = arg minw\u2208W (cid:98)F i(w) be the ERM estimator\ncorresponding to (cid:98)F i. The average leave-one-out stability of (cid:98)w is then de\ufb01ned as the quantity\n(cid:80)n\ni=1(f ((cid:98)wi, zi) \u2212 f ((cid:98)w, zi)).\n\n1\nn\nIntuitively, the average leave-one-out stability serves as an unbiased estimator of the amount of\nchange in the expected loss of the ERM estimator when one of the instances z1, . . . , zn, chosen\nuniformly at random, is removed from the training sample. We note that looking at the average\nis crucial for us, and the stronger condition of (expected) uniform stability does not hold for exp-\nconcave functions. For further discussion of the various stability notions, refer to Bousquet and\nElisseeff [4].\n\nOur main step in proving Theorem 1 involves bounding the average leave-one-out stability of (cid:98)w\nTheorem 3 (average leave-one-out stability). For any z1, . . . , zn \u2208 Z and for (cid:98)w1, . . . ,(cid:98)wn and (cid:98)w as\n\nde\ufb01ned in Eq. (3), which is the purpose of the next theorem.\n\nde\ufb01ned above, we have\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:0)f ((cid:98)wi, zi) \u2212 f ((cid:98)w, zi)(cid:1) \u2264 24\u03b2d\n\n\u03b1n\n\n+\n\n100Cd\n\nn\n\n.\n\n1We remark that Shamir\u2019s result assumes two different bounds over the magnitude of the predictors w and\nthe target values y, while here we assume both are bounded by the same constant B. We did not attempt to\ncapture this re\ufb01ned dependence on the two different parameters.\n\n4\n\n\fBefore proving this theorem, we \ufb01rst show how it can be used to obtain our main theorem. The proof\nfollows arguments similar to those of Bousquet and Elisseeff [4] and Shalev-Shwartz et al. [18].\n\nProof of Theorem 1. To obtain the stated result, it is enough to upper bound the expected excess loss\n\nof (cid:98)wn, which is the minimizer of the regularized empirical risk over the i.i.d. sample {z1, . . . , zn\u22121}.\n\nTo this end, \ufb01x an arbitrary w(cid:63) \u2208 W. We \ufb01rst write\n\nR(w(cid:63)) = E[(cid:98)F (w(cid:63))] \u2265 E[(cid:98)F ((cid:98)w)] ,\n\n1\nn\n\nF (w(cid:63)) +\n\nwhich holds true since (cid:98)w is the minimizer of (cid:98)F over W. Hence,\nE[F ((cid:98)wn)]\u2212 F (w(cid:63)) \u2264 E[F ((cid:98)wn) \u2212 (cid:98)F ((cid:98)w)] +\nNext, notice that the random variables (cid:98)w1, . . . ,(cid:98)wn have exactly the same distribution: each is the\noutput of regularized ERM on an i.i.d. sample of n \u2212 1 examples. Also, notice that (cid:98)wi, which is the\n\nminimizer of the sample obtained by discarding the i\u2019th example, is independent of zi. Thus, we\nhave\n\nR(w(cid:63)) .\n\n1\nn\n\n(5)\n\nn(cid:88)\n\ni=1\n\ni=1\n\n1\nn\n\nn(cid:88)\nE[F ((cid:98)wi)] =\nn(cid:88)\nE[f ((cid:98)w, zi)] +\n\n1\nn\n\nE[f ((cid:98)wi, zi)] .\nE[R((cid:98)w)] .\n\n1\nn\n\n1\nn\n\nE[F ((cid:98)wn)] =\nE[(cid:98)F ((cid:98)w)] =\nn(cid:88)\n\nFurthermore, we can write\n\nof the average stability:\n\nPlugging these expressions into Eq. (5) gives a bound over the expected excess loss of (cid:98)wn in terms\n\ni=1\n\nE[F ((cid:98)wn)]\u2212 F (w(cid:63)) \u2264 1\nUsing Theorem 3 for bounding average stability term on the right-hand side, and our assumption\nthe expected excess loss of (cid:98)wn.\nthat supw,w(cid:48)\u2208W |R(w) \u2212 R(w(cid:48))| \u2264 B to bound the second term, we obtain the stated bound over\n\nE[f ((cid:98)wi, zi) \u2212 f ((cid:98)w, zi)] +\n\nE[R(w(cid:63)) \u2212 R((cid:98)w)] .\n\n1\nn\n\ni=1\n\nn\n\nThe remainder of the section is devoted to the proof of Theorem 3. Before we begin with the proof\nof the theorem itself, we \ufb01rst present a useful tool for analyzing the stability of minimizers of convex\nfunctions, which we later apply to the empirical leave-one-out risks.\n\n3.1 Local Strong Convexity and Stability\n\nOur stability analysis for exp-concave functions is inspired by recent analysis techniques of\nregularization-based online learning algorithms, that make use of strong convexity with respect to\nlocal norms [1]. The crucial strong-convexity property is summarized in the following de\ufb01nition.\nDe\ufb01nition 4 (Local strong convexity). We say that a function g : K (cid:55)\u2192 R is locally \u03c3-strongly-\nconvex over a domain K \u2286 Rd at x with respect to a norm (cid:107)\u00b7(cid:107), if\n\n\u2200 y \u2208 K ,\n\ng(y) \u2265 g(x) + \u2207g(x)\u00b7(y \u2212 x) +\n\n(cid:107)y \u2212 x(cid:107)2 .\n\n\u03c3\n2\n\nIn words, a function is locally strongly-convex at x if it can be lower bounded (globally over its entire\ndomain) by a quadratic tangent to the function at x; the nature of the quadratic term in this lower\nbound is determined by a choice of a local norm, which is typically adapted to the local properties\nof the function at the point x.\nWith the above de\ufb01nition, we can now prove the following stability result for optima of convex\nfunctions, that underlies our stability analysis for exp-concave functions.\n\n5\n\n\fLemma 5. Let g1, g2 : K (cid:55)\u2192 R be two convex functions de\ufb01ned over a closed and convex domain\nK \u2286 Rd, and let x1 \u2208 arg minx\u2208K g1(x) and x2 \u2208 arg minx\u2208K g2(x). Assume that g2 is locally\n\u03c3-strongly-convex at x1 with respect to a norm (cid:107)\u00b7(cid:107). Then, for h = g2 \u2212 g1 we have\n\nFurthermore, if h is convex then\n\n(cid:107)x2 \u2212 x1(cid:107) \u2264 2\n\u03c3\n\n(cid:107)\u2207h(x1)(cid:107)\u2217 .\n\n0 \u2264 h(x1) \u2212 h(x2) \u2264 2\n\u03c3\n\n(cid:0)(cid:107)\u2207h(x1)(cid:107)\u2217(cid:1)2\n\n.\n\nProof. The local strong convexity of g2 at x1 implies\n\n\u2207g2(x1)\u00b7(x1 \u2212 x2) \u2265 g2(x1) \u2212 g2(x2) +\n\n(cid:107)x2 \u2212 x1(cid:107)2 .\n\n\u03c3\n2\n\nNotice that g2(x1) \u2212 g2(x2) \u2265 0, since x2 is a minimizer of g2. Also, since x1 is a minimizer of g1,\n\ufb01rst-order optimality conditions imply that \u2207g1(x1)\u00b7(x1 \u2212 x2) \u2264 0, whence\n\n\u2207g2(x1)\u00b7(x1 \u2212 x2) = \u2207g1(x1)\u00b7(x1 \u2212 x2) + \u2207h(x1)\u00b7(x1 \u2212 x2) \u2264 \u2207h(x1)\u00b7(x1 \u2212 x2) .\n\nCombining the observations yields\n\n(cid:107)x2 \u2212 x1(cid:107)2 \u2264 \u2207h(x1)\u00b7(x1 \u2212 x2) \u2264 (cid:107)\u2207h(x1)(cid:107)\u2217\u00b7(cid:107)x1 \u2212 x2(cid:107) ,\n\n\u03c3\n2\n\nwhere we have used H\u00a8older\u2019s inequality in the last inequality. This gives the \ufb01rst claim of the lemma.\nTo obtain the second claim, we \ufb01rst observe that\n\ng1(x2) + h(x2) \u2264 g1(x1) + h(x1) \u2264 g1(x2) + h(x1)\n\nwhere we used the fact that x2 is the minimizer of g2 = g1 + h for the \ufb01rst inequality, and the fact\nthat x1 is the minimizer of g1 for the second. This establishes the lower bound 0 \u2264 h(x1) \u2212 h(x2).\nFor the upper bound, we use the assumed convexity of h to write\n\n(cid:0)(cid:107)\u2207h(x1)(cid:107)\u2217(cid:1)2\n\n,\n\nh(x1) \u2212 h(x2) \u2264 \u2207h(x1)\u00b7(x1 \u2212 x2) \u2264 (cid:107)\u2207h(x1)(cid:107)\u2217\u00b7(cid:107)x1 \u2212 x2(cid:107) \u2264 2\n\u03c3\n\nwhere the second inequality follows from H\u00a8older\u2019s inequality, and the \ufb01nal one from the \ufb01rst claim\nof the lemma.\n\n3.2 Average Stability Analysis\n\n\u03c3 Id +(cid:80)n\n\nWith Lemma 5 at hand, we now turn to prove Theorem 3. First, a few de\ufb01nitions are needed. For\n\nbrevity, we henceforth denote fi(\u00b7) = f (\u00b7, zi) for all i. We let hi = \u2207fi((cid:98)w) be the gradient of fi\nat the point (cid:98)w de\ufb01ned in Eq. (3), and let H = 1\nj for\n4C , \u03b1}. Finally, we will use (cid:107)\u00b7(cid:107)M to denote the norm induced by a positive\nM induced by M simply\nstability expressions fi((cid:98)wi)\u2212fi((cid:98)w) in terms of a certain norm of the gradient hi of the corresponding\nfunction fi. As the proof below reveals, this norm is the local norm at (cid:98)w with respect to which the\nleave-one-out risk (cid:98)F i is locally strongly convex.\n\nall i, where \u03c3 = 1\nde\ufb01nite matrix M, i.e., (cid:107)x(cid:107)M =\nequals (cid:107)x(cid:107)M\u22121 =\nxTM\u22121x.\nIn order to obtain an upper bound over the average stability, we \ufb01rst bound each of the individual\n\nxTM x. In this case, the dual norm (cid:107)x(cid:107)\u2217\n\n\u03c3 Id +(cid:80)\n\n2 min{ 1\n\u221a\n\ni=1 hihT\n\ni and Hi = 1\n\nj(cid:54)=i hjhT\n\n\u221a\n\nLemma 6. For all i = 1, . . . , n it holds that\n\nfi((cid:98)wi) \u2212 fi((cid:98)w) \u2264 6\u03b2\n\n\u03c3\n\n(cid:0)(cid:107)hi(cid:107)\u2217\n\nHi\n\n(cid:1)2\n\n.\n\nNotice that the expression on the right-hand side might be quite large for a particular function fi;\nindeed, uniform stability does not hold in our case. However, as we show below, the average of these\nexpressions is small. The proof of Lemma 6 relies on Lemma 5 above and the following property of\nexp-concave functions, established by Hazan et al. [8].\n\n6\n\n\f.\n\n\u03c3\n2\n\n(6)\n\n2 min{ 1\n\n\u2200 x, y \u2208 K ,\n\nLemma 7 (Hazan et al. [8], Lemma 3). Let f : K (cid:55)\u2192 R be an \u03b1-exp-concave function over a convex\ndomain K \u2286 Rd such that |f (x) \u2212 f (y)| \u2264 C for any x, y \u2208 K. Then for any \u03c3 \u2264 1\n4C , \u03b1} it\nholds that\n\n(cid:0)\u2207f (x)\u00b7(y \u2212 x)(cid:1)2\nProof of Lemma 6. We apply Lemma 5 with g1 = (cid:98)F and g2 = (cid:98)F i (so that h = \u2212 1\n\ufb01rst verify that (cid:98)F i is indeed (\u03c3/n)-strongly-convex at (cid:98)w with respect to the norm (cid:107)\u00b7(cid:107)Hi. Since each\n\nf (y) \u2265 f (x) + \u2207f (x)\u00b7(y \u2212 x) +\n\n(7)\n4C , \u03b1}. Also, the strong convexity of the regularizer R implies that\n(8)\n\nfi is \u03b1-exp-concave, Lemma 7 shows that for all w \u2208 W,\n\n(cid:0)hi\u00b7(w \u2212 (cid:98)w)(cid:1)2\nfi(w) \u2265 fi((cid:98)w) + \u2207fi((cid:98)w)\u00b7(w \u2212 (cid:98)w) +\n2 min{ 1\n(cid:107)w \u2212 (cid:98)w(cid:107)2\nR(w) \u2265 R((cid:98)w) + \u2207R((cid:98)w)\u00b7(w \u2212 (cid:98)w) +\n(cid:88)\n(cid:0)hi\u00b7(w \u2212 (cid:98)w)(cid:1)2\n(cid:107)w \u2212 (cid:98)w(cid:107)2\n\n(cid:98)F i(w) \u2265 (cid:98)F i((cid:98)w) + \u2207(cid:98)F i((cid:98)w)\u00b7(w \u2212 (cid:98)w) +\n= (cid:98)F i((cid:98)w) + \u2207(cid:98)F i((cid:98)w)\u00b7(w \u2212 (cid:98)w) +\n\n2 .\nSumming Eq. (7) over all j (cid:54)= i with Eq. (8) and dividing through by n gives\n\n(cid:107)w \u2212 (cid:98)w(cid:107)2\n\n2\n\nwith our choice of \u03c3 = 1\n\nn fi). We should\n\n1\n2n\n\n\u03c3\n2n\n\nj(cid:54)=i\n\n\u03c3\n2\n\n1\n2\n\n+\n\nHi\n\n,\n\n,\n\n\u03c3\n2n\n\nwhich establishes the strong convexity.\nNow, applying Lemma 5 gives\n\n.\n\nHi\n\nHi\n\n=\n\n2\n\u03c3\n\n(cid:107)hi(cid:107)\u2217\n\n(cid:107)\u2207h((cid:98)w)(cid:107)\u2217\n\n\u03c3\nOn the other hand, since fi is convex, we have\n\n(cid:107)(cid:98)wi \u2212 (cid:98)w(cid:107)Hi \u2264 2n\nfi((cid:98)wi) \u2212 fi((cid:98)w) \u2264 \u2207fi((cid:98)wi)\u00b7((cid:98)wi \u2212 (cid:98)w)\n= \u2207fi((cid:98)w)\u00b7((cid:98)wi \u2212 (cid:98)w) +(cid:0)\u2207fi((cid:98)wi) \u2212 \u2207fi((cid:98)w)(cid:1)\u00b7((cid:98)wi \u2212 (cid:98)w) .\n(cid:1)2\n(cid:0)(cid:107)hi(cid:107)\u2217\n\u2207fi((cid:98)wi)\u00b7((cid:98)wi \u2212 (cid:98)w) = hi\u00b7((cid:98)wi \u2212 (cid:98)w) \u2264 (cid:107)hi(cid:107)\u2217\n(cid:0)\u2207fi((cid:98)wi) \u2212 \u2207fi((cid:98)w)(cid:1)\u00b7((cid:98)wi \u2212 (cid:98)w) \u2264 (cid:107)\u2207fi((cid:98)wi) \u2212 \u2207fi((cid:98)w)(cid:107)2\u00b7(cid:107)(cid:98)wi \u2212 (cid:98)w(cid:107)2 \u2264 \u03b2(cid:107)(cid:98)wi \u2212 (cid:98)w(cid:107)2\n\n\u00b7(cid:107)(cid:98)wi \u2212 (cid:98)w(cid:107)Hi \u2264 2\n\nAlso, since fi is \u03b2-smooth (with respect to the Euclidean norm), we can bound the second term in\nEq. (10) as follows:\n\nThe \ufb01rst term can be bounded using H\u00a8older\u2019s inequality and Eq. (9) as\n\n(9)\n\n(10)\n\n2 ,\n\nHi\n\nHi\n\n\u03c3\n\n.\n\nand since Hi (cid:23) (1/\u03c3)Id, we can further bound using Eq. (9),\n\u2264 4\n\u03c3\n\n2 \u2264 \u03c3(cid:107)(cid:98)wi \u2212 (cid:98)w(cid:107)2\n\n(cid:107)(cid:98)wi \u2212 (cid:98)w(cid:107)2\n\nHi\n\n(cid:0)(cid:107)hi(cid:107)\u2217\n\nHi\n\n(cid:1)2\n\n.\n\nCombining the bounds (and simplifying using our assumption \u03b2 \u2265 1) gives the lemma.\n\nNext, we bound a sum involving the local-norm terms introduced in Lemma 6.\nLemma 8. Let I = {i \u2208 [n] : (cid:107)hi(cid:107)\u2217\n\nH > 1\n\n(cid:88)\n2}. Then |I| \u2264 2d, and we have\n\n(cid:1)2 \u2264 2d .\n\n(cid:0)(cid:107)hi(cid:107)\u2217\n\nHi\n\n(cid:80)\ni H\u22121hi for all i = 1, . . . , n. First, we claim that ai > 0 for all i, and\ni ai \u2264 d. The fact that ai > 0 is evident from H\u22121 being positive-de\ufb01nite. For the sum of the\n\nProof. Denote ai = hT\n\nai =\n\ni H\u22121hi =\nhT\n\ntr(H\u22121hihT\n\ni ) \u2264 tr(H\u22121H) = tr(Id) = d ,\n\n(11)\n\ni /\u2208I\n\nn(cid:88)\n\nai\u2019s, we write:\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\n7\n\n\fwhere we have used the linearity of the trace, and the fact that H (cid:23)(cid:80)n\n(cid:80)\ni\u2208I ai =(cid:80)\n\nNow, our claim that |I| \u2264 2d is evident: if (cid:107)hi(cid:107)\u2217\nour second claim, we \ufb01rst write Hi = H \u2212 hihT\nobtain\n\n2 for more than 2d terms, then the sum\ni H\u22121hi must be larger than d, which is a contradiction to Eq. (11). To prove\ni and use the Sherman-Morrison identity [e.g., 7] to\n\ni .\ni=1 hihT\n\nH > 1\n\ni\u2208I hT\n\ni = (H \u2212 hihT\nH\u22121\nfor all i /\u2208 I. Note that for i /\u2208 I we have hT\non the right-hand side is well de\ufb01ned. We therefore have:\n\ni )\u22121 = H\u22121 +\ni H\u22121hi < 1, so that the identity applies and the inverse\n\nH\u22121hihT\n1 \u2212 hT\n\ni H\u22121\ni H\u22121hi\n\n= hT\n\ni H\u22121\n\ni hi = hT\n\ni H\u22121hi +\n\n= ai +\n\na2\ni\n1 \u2212 ai\n\n\u2264 2ai ,\n\nwhere the inequality follows from the fact that 1 \u2212 ai \u2265 ai for i /\u2208 I. Summing this inequality over\ni /\u2208 I and recalling that the ai\u2019s are nonnegative, we obtain\nai \u2264 2\n\n(cid:1)2 \u2264 2\n\n(cid:0)(cid:107)hi(cid:107)\u2217\n\nn(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nai = 2d ,\n\nHi\n\ni /\u2208I\n\ni /\u2208I\n\ni=1\n\n(cid:0)(cid:107)hi(cid:107)\u2217\n\nHi\n\n(cid:1)2\n\n(cid:1)2\n\n(cid:0)hT\n\ni H\u22121hi\n\n1 \u2212 hT\n\ni H\u22121hi\n\nwhich concludes the proof.\n\nTheorem 3 is now obtained as an immediate consequence of our lemmas above.\n\nProof of Theorem 3. As a consequence of Lemmas 6 and 8, we have\n\n1\nn\n\n(cid:88)\n(cid:0)fi((cid:98)wi) \u2212 fi((cid:98)w)(cid:1) \u2264 6\u03b2\n\n(cid:0)fi((cid:98)wi) \u2212 fi((cid:98)w)(cid:1) \u2264 C|I|\n(cid:88)\n(cid:0)(cid:107)hi(cid:107)\u2217\n\ni\u2208I\n\nn\n\n\u2264 2Cd\nn\n\n,\n\n(cid:1)2 \u2264 12\u03b2d\n\n\u03c3n\n\n.\n\n\u03c3n\n\ni /\u2208I\n\u03b1} \u2264 2(4C + 1\n\u03c3 = 2 max{4C, 1\n\nHi\n\n\u03b1 ) gives the result.\n\nand\n\n(cid:88)\n\ni /\u2208I\n\n1\nn\n\nSumming the inequalities and using 1\n\n4 Conclusions and Open Problems\n\nWe have proved the \ufb01rst fast convergence rate for a regularized ERM procedure for exp-concave\nloss functions. Our bounds match the existing lower bounds in the speci\ufb01c case of the squared loss\nup to constants, and improve by a logarithmic factor upon the best known upper bounds achieved by\nonline methods.\nOur stability analysis required us to assume smoothness of the loss functions, in addition to their\nexp-concavity. We note, however, that the Online Newton Step algorithm of Hazan et al. [8] for\nonline exp-concave optimization does not require such an assumption. Even though most of the\npopular exp-concave loss functions are also smooth, it would be interesting to understand whether\nsmoothness is indeed required for the convergence of the ERM estimator we study in the present\npaper, or whether it is simply a limitation of our analysis.\nAnother interesting issue left open in our work is how to obtain bounds on the excess risk of ERM\nthat hold with high probability, and not only in expectation. Since the excess risk is non-negative,\none can always apply Markov\u2019s inequality to obtain a bound that holds with probability 1 \u2212 \u03b4 but\nscales linearly with 1/\u03b4. Also, using standard concentration inequalities (or success ampli\ufb01cation\n\ntechniques), we may also obtain high probability bounds that scale with(cid:112)log(1/\u03b4)/n, losing the\n\nfast 1/n rate. We leave the problem of obtaining bounds that depends both linearly on 1/n and\nlogarithmically on 1/\u03b4 for future work.\n\n8\n\n\fReferences\n[1] J. D. Abernethy, E. Hazan, and A. Rakhlin. Interior-point methods for full-information and\n\nbandit online learning. Information Theory, IEEE Transactions on, 58(7):4164\u20134175, 2012.\n\n[2] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. cambridge\n\nuniversity press, 2009.\n\n[3] K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the\n\nexponential family of distributions. Machine Learning, 43(3):211\u2013246, 2001.\n\n[4] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning\n\nResearch, 2:499\u2013526, 2002.\n\n[5] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[6] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[7] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n[8] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[9] D. Hsu, S. M. Kakade, and T. Zhang. Random design analysis of ridge regression. Foundations\n\nof Computational Mathematics, 14(3):569\u2013600, 2014.\n\n[10] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[11] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk\nIn Advances in neural information processing\n\nbounds, margin bounds, and regularization.\nsystems, pages 793\u2013800, 2009.\n\n[12] J. Kivinen and M. K. Warmuth. Averaging expert predictions. In Computational Learning\n\nTheory, pages 153\u2013167. Springer, 1999.\n\n[13] T. Koren. Open problem: Fast stochastic exp-concave optimization. In Conference on Learning\n\nTheory, pages 1073\u20131075, 2013.\n\n[14] G. Lecu\u00b4e and S. Mendelson. Performance of empirical risk minimization in linear aggregation.\n\narXiv preprint arXiv:1402.5763, 2014.\n\n[15] M. Mahdavi, L. Zhang, and R. Jin. Lower and upper bounds on the generalization of stochas-\ntic exponentially concave optimization. In Proceedings of The 28th Conference on Learning\nTheory, 2015.\n\n[16] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss. The Journal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[17] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for\n\nregularized loss minimization. Mathematical Programming, pages 1\u201341, 2014.\n\n[18] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform\n\nconvergence. The Journal of Machine Learning Research, 11:2635\u20132670, 2010.\n\n[19] O. Shamir. The sample complexity of learning linear predictors with the squared loss. arXiv\n\npreprint arXiv:1406.5143, 2014.\n\n[20] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In Advances in\n\nneural information processing systems, pages 2199\u20132207, 2010.\n\n[21] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives.\n\nAdvances in Neural Information Processing Systems, pages 1545\u20131552, 2009.\n\nIn\n\n[22] V. Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213\u2013248, 2001.\n\n9\n\n\f", "award": [], "sourceid": 900, "authors": [{"given_name": "Tomer", "family_name": "Koren", "institution": "Technion"}, {"given_name": "Kfir", "family_name": "Levy", "institution": "Technion"}]}