{"title": "Smoothness, Low Noise and Fast Rates", "book": "Advances in Neural Information Processing Systems", "page_first": 2199, "page_last": 2207, "abstract": "We establish an excess risk bound of O(H R_n^2 + sqrt{H L*} R_n) for ERM with an H-smooth loss function and a hypothesis class with Rademacher complexity R_n, where L* is the best risk achievable by the hypothesis class. For typical hypothesis classes where R_n = sqrt{R/n}, this translates to a learning rate of \u0303 O(RH/n) in the separable (L* = 0) case and O(RH/n + sqrt{L* RH/n}) more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective.", "full_text": "Smoothness, Low-Noise and Fast Rates\n\nNathan Srebro\nnati@ttic.edu\n\nKarthik Sridharan\nkarthik@ttic.edu\n\nAmbuj Tewari\n\nambuj@cs.utexas.edu\n\nToyota Technological Institute at Chicago\n\nComputer Science Dept., University of Texas at Austin\n\n(cid:16)\n\nAbstract\n\u221a\nHR2\n\nn +\n\n(cid:17)\n\nWe establish an excess risk bound of \u02dcO\nfor ERM with an H-smooth loss\nfunction and a hypothesis class with Rademacher complexity Rn, where L\u2217 is the best risk achiev-\n\nable by the hypothesis class. For typical hypothesis classes where Rn =(cid:112)R/n, this translates to\n\na learning rate of \u02dcO (RH/n) in the separable (L\u2217 = 0) case and \u02dcO\nmore\ngenerally. We also provide similar guarantees for online and stochastic convex optimization of a\nsmooth non-negative objective.\n\nRH/n +(cid:112)L\u2217RH/n\n\n(cid:17)\n\n(cid:16)\n\nHL\u2217Rn\n\n(cid:80)n\n\nIntroduction\n\ni=1 \u03c6(h(xi), yi) of an i.i.d. sample (x1, y1), . . . , (xn, yn).\n\n1\nConsider empirical risk minimization for a hypothesis class H = {h : X \u2192 R} w.r.t. some non-negative loss function\n\u03c6(t, y). That is, we would like to learn a predictor h with small risk L (h) = E [\u03c6(h(X), Y )] by minimizing the\nempirical risk \u02c6L(h) = 1\nn\nStatistical guarantees on the excess risk are well understood for parametric (i.e. \ufb01nite dimensional) hypothesis classes.\nMore formally, these are hypothesis classes with \ufb01nite VC-subgraph dimension [23] (aka pseudo-dimension). For\nsuch classes learning guarantees can be obtained for any bounded loss function (i.e. s.t. |\u03c6| \u2264 b < \u221e) and the relevant\nmeasure of complexity is the VC-subgraph dimension.\nAlternatively, even for some non-parametric hypothesis classes (i.e. those with in\ufb01nite VC-subgraph dimension),\ne.g. the class of low-norm linear predictors HB = {hw : x (cid:55)\u2192 (cid:104)w, x(cid:105)|(cid:107)w(cid:107) \u2264 B} , guarantees can be obtained in terms\nof scale-sensitive measures of complexity such as fat-shattering dimensions [1], covering numbers [23] or Rademacher\ncomplexity [2]. The classical statistical learning theory approach for obtaining learning guarantees for such scale-\nsensitive classes is to rely on the Lipschitz constant D of \u03c6(t, y) w.r.t. t (i.e. bound on its derivative w.r.t. t). The\nexcess risk can then be bounded as (in expectation over the sample):\n\nL\n\nRn(H) is the Rademacher complexity, which typically scales as Rn(H) = (cid:112)R/n. E.g. for (cid:96)2-bounded linear\n\n(1)\nwhere \u02c6h = arg min \u02c6L(h) is the empirical risk minimizer (ERM), L\u2217 = inf h L (h) is the approximation error, and\npredictors, R = B2 sup(cid:107)X(cid:107)2\n2.\nIn this paper we address two de\ufb01ciencies of the guarantee (1). First, the bound applies only to loss functions with\nbounded derivative, like the hinge loss and logistic loss popular for classi\ufb01cation, or the absolute-value ((cid:96)1) loss for\n2 (t \u2212 y)2, for which the second derivative is\nregression. It is not directly applicable to the squared loss \u03c6(t, y) = 1\nbounded, but not the \ufb01rst. We could try to simply bound the derivative of the squared loss in terms of a bound on\nthe magnitude of h(x), but e.g. for norm-bounded linear predictors HB this results in a very disappointing excess risk\n\nbound of the form O((cid:112)B4(max(cid:107)X(cid:107))4/n). One aim of this paper is to provide clean bounds on the excess risk for\n\nsmooth loss functions such as the squared loss with a bounded second, rather then \ufb01rst, derivative.\n\n(cid:16)\u02c6h\n(cid:17) \u2264 L\u2217 + 2DRn(H) = L\u2217 + 2\n\n(cid:114)\n\nD2 R\nn\n\n1\n\n\f\u221a\nn might be unavoidable in general.\nThe second de\ufb01ciency of (1) is the dependence on 1/\nBut at least for \ufb01nite dimensional (parametric) classes, we know it can be improved to a 1/n rate when the distribution\nis separable, i.e. when there exists h \u2208 H with L (h) = 0 and so L\u2217 = 0. In particular, if H is a class of bounded\nfunctions with VC-subgraph-dimension d (e.g. d-dimensional linear predictors), then in expectation over sample [22]:\n\n\u221a\nn. The dependence on 1/\n\n(cid:16)\u02c6h\n(cid:17) \u2264 L\u2217 + O\n\n(cid:32)\n\n(cid:114)\n\ndDL\u2217 log n\n\n(cid:33)\n\n(cid:17)\n\n(cid:18) HR\n\n(cid:19)\n\n(cid:16)\u02c6h\n(cid:17) \u2264 L\n\n\u2217\n\n(cid:16)\n\n(cid:32)\n\n(cid:114)\n\n(cid:33)\n\ndD log n\n\nL\n\nThe(cid:112)1/n term disappears in the separable case, and we get a graceful degradation between the(cid:112)1/n rate to the 1/n\nsuch as the hinge-loss, the excess risk might scale as(cid:112)1/n and not 1/n, even in the separable case. However, for\n\nrate for separable case. Could we get a 1/n separable rate, and such a graceful degradation, in non-parametric case?\nAs we will show, the two de\ufb01ciencies are actually related. For non-parametric classes, and non-smooth Lipschitz loss,\n\nH-smooth non-negative loss functions, where the second derivative of \u03c6(t, y) w.r.t. t is bounded by H, a 1/n separable\nrate is possible. In Section 2 we obtain the following bound on the excess risk (up to logarithmic factors):\n\n(2)\n\n+\n\nn\n\nn\n\nHR2\n\nn(H) +\n\n\u221a\nHL\u2217Rn(H)\n\n\u2217\n= L\n\n+ \u02dcO\n\nHR\n\nHRL\u2217\n\n\u2264 2L\n\u2217\n\n+ \u02dcO\n\n+ \u02dcO\n\n.\n\nn\n\nn\n\nn\n\nL\n\n+\n\n(3)\n2 \u2264 1, the excess risk is bounded by\n\nIn particular, for (cid:96)2-norm-bounded linear predictors HB with sup(cid:107)X(cid:107)2\n\n\u02dcO(HB2/n +(cid:112)HB2L\u2217/n). Another interesting distinction between parametric and non-parametric classes, is that\n\n\u221a\neven for the squared-loss, the bound (3) is tight and the non-separable rate of 1/\nn is unavoidable. This is in con-\ntrast to the parametric (\ufb01ne dimensional) case, where a rate of 1/n is always possible for the squared loss, regardless\nof the approximation error L\u2217 [16]. The differences between parametric and scale-sensitive classes, and between\nnon-smooth, smooth and strongly convex loss functions are discussed in Section 4 and summarized in Table 1.\nThe guarantees discussed thus far are general learning guarantees for the stochastic setting that rely only on the\nRademacher complexity of the hypothesis class, and are phrased in terms of minimizing some scalar loss function. In\nSection 3 we consider also the online setting, in addition to the stochastic setting, and present similar guarantees for\nonline and stochastic convex optimization [32, 24]. The guarantees of Section 3 match equation (3) for the special\ncase of a convex loss function and norm-bounded linear predictors, but Section 3 capture a more general setting of\noptimizing an arbitrary non-negative convex objective, which we require to be smooth (there is no separate discussion\nof a \u201cpredictor\u201d and a scalar loss function in Section 3). Results in Section 3 are expressed in terms of properties of\nthe norm, rather then a measure of concentration like the Radamacher complexity as in (3) and Section 2. However,\nthe online and stochastic convex optimization setting of Section 3 is also more restrictive, as we require the objective\nbe convex (in Section 2 we make no assumption about the convexity of hypothesis class H nor the loss function \u03c6).\nSpeci\ufb01cally, for a non-negative H-smooth convex objective, over a domain bounded by B, we prove that the average\n\nonline regret (and excess risk of stochastic optimization) is bounded by O(HB2/n +(cid:112)HB2L\u2217/n). Comparing with\nthe bound of O((cid:112)D2B2/n) when the loss is D-Lipschitz rather then H-smooth [32, 21], we see the same relationship\n\ndiscussed above for ERM. Unlike the bound (3) for the ERM, the convex optimization bound avoids polylogarithmic\nfactors. The results in Section 3 also generalize to smoothness and boundedness with respect to non-Euclidean norms.\nStudying the online and stochastic convex optimization setting (Section 3), in addition to ERM (Section 2), has several\nadvantages. First, it allows us to obtain a learning guarantee for an ef\ufb01cient single-pass learning methods, namely\nstochastic gradient descent (or mirror descent), as well as for the non-stochastic regret. Second, the bound we obtain\nin the convex optimization setting (Section 3) is actually better then the bound for the ERM (Section 2) as it avoids all\npolylogarithmic and large constant factors. Third, the bound is applicable to other non-negative online or stochastic\noptimization problems beyond classi\ufb01cation, including problems for which ERM is not applicable (see, e.g., [24]).\nThe detailed proofs of the statements claimed in this paper can be found in the supplementary material corresponding\nto the paper.\n\n2 Empirical Risk Minimization with Smooth Loss\nRecall that the Rademacher complexity of H for any n \u2208 N given by [2]:\n1\nn\n\nE\u03c3\u223cUnif({\u00b11}n)\n\nRn(H) =\n\n(cid:34)\n\nsup\n\nx1,...,xn\u2208X\n\nsup\nh\u2208H\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)\n\ni=1\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n.\n\nh(xi)\u03c3i\n\n(4)\n\n2\n\n\fThroughout we shall consider the \u201cworst case\u201d Rademacher complexity.\nOur starting point is the learning bound (1) that applies to D-Lipschitz loss functions, i.e. such that |\u03c6(cid:48)(t, y)| \u2264 D\n(we always take derivatives w.r.t. the \ufb01rst argument). What type of bound can we obtain if we instead bound the\nsecond derivative \u03c6(cid:48)(cid:48)(t, y)? We will actually avoid talking about the second derivative explicitly, and instead say that\na function is H-smooth iff its derivative is H-Lipschitz. For twice differentiable \u03c6, this just means that |\u03c6(cid:48)(cid:48)| \u2264 H.\nThe central observation, which allows us to obtain guarantees for smooth loss functions, is that for a smooth loss, the\nderivative can be bounded in terms of the function value:\n\nLemma 2.1. For an H-smooth non-negative function f : R (cid:55)\u2192 R, we have: |f(cid:48)(t)| \u2264(cid:112)4Hf (t)\n\u03c6(cid:48)( \u02c6w, x) at the ERM \u02c6w. Applying Lemma 2.1 to L(\u02c6h), we can bound |E [\u03c6(cid:48)( \u02c6w, X)]| \u2264(cid:113)\n\nThis Lemma allows us to argue that close to the optimum value, where the value of the loss is small, then so is its\nderivative. Looking at the dependence of (1) on the derivative bound D, we are guided by the following heuristic\nintuition: Since we should be concerned only with the behavior around the ERM, perhaps it is enough to bound\n4HL(\u02c6h). What we would\nactually want is to bound each |\u03c6(cid:48)( \u02c6w, x)| separately, or at least have the absolute value inside the expectation\u2014this\nis where the non-negativity of the loss plays an important role. Ignoring this important issue for the moment and\nplugging this instead of D into (1) yields L(\u02c6h) \u2264 L\u2217 + 4\nHL(\u02c6h)Rn(H). Solving for L(\u02c6h) yields the desired bound\n(3).\nThis rough intuition is captured by the following Theorem:\nTheorem 1. For an H-smooth non-negative loss \u03c6 s.t.\u2200x,y,h |\u03c6(h(x), y)| \u2264 b, for any \u03b4 > 0 we have that with\nprobability at least 1 \u2212 \u03b4 over a random sample of size n, for any h \u2208 H,\n\n(cid:113)\n\nL (h) \u2264 \u02c6L(h) + K\n\n(cid:32)(cid:113)\n(cid:32)\u221a\n(cid:16)\u02c6h\n(cid:17) \u2264 L\u2217 + K\n\nand so:\n\nL\n\n(cid:32)\u221a\n(cid:32)\u221a\n\n(cid:114)\n(cid:114)\n\n(cid:33)\n(cid:33)\n\n\u02c6L(h)\n\nH log1.5n Rn(H) +\n\nb log(1/\u03b4)\n\nn\n\n+ H log3n R2\n\nn(H) +\n\nb log(1/\u03b4)\n\nn\n\nL\u2217\n\nH log1.5n Rn(H) +\n\nb log(1/\u03b4)\n\nn\n\n+ H log3n R2\n\nn(H) +\n\nb log(1/\u03b4)\n\nn\n\n(cid:33)\n(cid:33)\n\nwhere K < 105 is a numeric constant derived from [20] and [6].\nNote that only the \u201ccon\ufb01dence\u201d terms depended on b = sup|\u03c6|, and this is typically not the dominant term\u2014we\nbelieve it is possible to also obtain a bound that holds in expectation over the sample (rather than with high probability)\nand that avoids a direct dependence on sup|\u03c6|.\nTo prove Theorem 1 we use the notion of Local Rademacher Complexity [3], which allows us to focus on the behavior\nclose to the ERM. To this end, consider the following empirically restricted loss class\n(x, y) (cid:55)\u2192 \u03c6(h(x), y) : h \u2208 H, \u02c6L(h) \u2264 r\n\nL\u03c6(r) :=\n\n(cid:111)\n\n(cid:110)\n\n\u221a\n\nLemma 2.2, presented below, solidi\ufb01es the heuristic intuition discussed above, by showing that the Rademacher com-\nplexity of L\u03c6(r) scales with\nHr. The Lemma can be seen as a higher-order version of the Lipschitz Composition\nLemma [2], which states that the Rademacher complexity of the unrestricted loss class is bounded by DRn(H). Here,\nwe use the second, rather then \ufb01rst, derivative, and obtain a bound that depends on the empirical restriction:\nLemma 2.2. For a non-negative H-smooth loss \u03c6 bounded by b and any function class H bounded by B:\n\nRn(L\u03c6(r)) \u2264\n\n\u221a\n\n12Hr Rn(H)\n\n16 log3/2\n\n(cid:32)\n\n(cid:19)\n\n(cid:18) nB\n\nRn(H)\n\n(cid:32)\n\n\u221a\n\nn\n\n(cid:33)(cid:33)\n\n\u2212 14 log3/2\n\n12HB\u221a\nb\n\nApplying Lemma 2.2, Theorem 1 follows using standard Local Rademacher argument [3].\n\n2.1 Related Results\n\u221a\n\nRates faster than 1/\n\nn have been previously explored under various conditions, including when L\u2217 is small.\n\n3\n\n\fThe Finite Dimensional Case : Lee et al [16] showed faster rates for squared loss, exploiting the strong convexity\nof this loss function, even when L\u2217 > 0, but only with \ufb01nite VC-subgraph-dimension. Panchenko [22] provides fast\nrate results for general Lipschitz bounded loss functions, still in the \ufb01nite VC-subgraph-dimension case. Bousquet [6]\nprovided similar guarantees for linear predictors in Hilbert spaces when the spectrum of the kernel matrix (covariance\nof X) is exponentially decaying, making the situation almost \ufb01nite dimensional. All these methods rely on \ufb01niteness\nof effective dimension to provide fast rates. In this case, smoothness is not necessary. Our method, on the other hand,\nestablishes fast rates, when L\u2217 = 0, for function classes that do not have \ufb01nite VC-subgraph-dimension. We show\nhow in this non-parametric case, smoothness is necessary and plays an important role (see also Table 1).\nAggregation : Tsybakov [29] studied learning rates for aggregation, where a predictor is chosen from the convex\nhull of a \ufb01nite set of base predictors. This is equivalent to an (cid:96)1 constraint where each base predictor is viewed\nas a \u201cfeature\u201d. As with (cid:96)1-based analysis, since the bounds depend only logarithmically on the number of base\npredictors (i.e. dimensionality), and rely on the scale of change of the loss function, they are of \u201cscale sensitive\u201d\nnature. For such an aggregate classi\ufb01er, Tsybakov obtained a rate of 1/n when zero (or small) risk is achieve by\none of the base classi\ufb01ers. Using Tsybakov\u2019s result, it is not enough for zero risk to be achieved by an aggregate\n(i.e. bounded ell1) classi\ufb01er in order to obtain the faster rate. Tsybakov\u2019s core result is thus in a sense more similar\nto the \ufb01nite dimensional results, since it allows for a rate of 1/n when zero error is achieved by a \ufb01nite cardinality\n(and hence \ufb01nite dimension) class. Tsybakov then used the approximation error of a small class of base predictors\nw.r.t. a large hypothesis class (i.e. a covering) to obtain learning rates for the large hypothesis class by considering\n\u221a\naggregation within the small class. However these results only imply fast learning rates for hypothesis classes with\nvery low complexity. Speci\ufb01cally to get learning rates better than 1/\nn using these results, the covering number of\nthe hypothesis class at scale \u0001 needs to behave as 1/\u0001p for some p < 2. But typical classes, including the class of\nlinear predictors with bounded norm, have covering numbers that scale as 1/\u00012 and so these methods do not imply\nfast rates for such function classes. In fact, to get rates of 1/n with these techniques, even when L\u2217 = 0, requires\ncovering numbers that do not increase with \u0001 at all, and so actually \ufb01nite VC-subgraph-dimension. Chesneau et al\n\u221a\n[10] extend Tsybakov\u2019s work also to general losses, deriving similar results for Lipschitz loss function. The same\ncaveats hold: even when L\u2217 = 0, rates faster when 1/\nn require covering numbers that grow slower than 1/\u00012, and\nrates of 1/n essentially require \ufb01nite VC-subgraph-dimension. Our work, on the other hand, is applicable whenever the\nRademacher complexity (equivalently covering numbers) can be controlled. Although it uses some similar techniques,\nit is also rather different from the work of Tsybakov and Chesneau et al, in that it points out the importance of\nsmoothness for obtaining fast rates in the non-parametric case: Chesneau et al relied only on the Lipschitz constant,\nwhich we show, in Section 4, is not enough for obtaining fast rates in the non-parametric case, even when L\u2217 = 0.\nLocal Rademacher Complexities : Bartlett et al [3] developed a general machinery for proving possible fast rates\nbased on local Rademacher complexities. However, it is important to note that the localized complexity term typically\ndominates the rate and still needs to be controlled. For example, Steinwart [27] used Local Rademacher Complexity to\nprovide fast rate on the 0/1 loss of Support Vector Machines (SVMs) ((cid:96)2-regularized hinge-loss minimization) based on\nthe so called \u201cgeometric margin condition\u201d and Tsybakov\u2019s margin condition. Steinwart\u2019s analysis is speci\ufb01c to SVMs.\nWe also use Local Rademacher Complexities in order to obtain fast rates, but do so for general hypothesis classes,\nbased only on the standard Rademacher complexity Rn(H) of the hypothesis classes, as well as the smoothness of the\nloss function and the magnitude of L\u2217, but without any further assumptions on the hypothesis classes itself.\nNon-Lipschitz Loss : Beyond the strong connections between smoothness and fast rates which we highlight, we are\nalso not aware of prior work providing an explicit and easy-to-use result for controlling a generic non-Lipschitz loss\n(such as the squared loss) solely in terms of the Rademacher complexity.\n\n3 Online and Stochastic Optimization of Smooth Convex Objectives\nWe now turn to online and stochastic convex optimization. In these settings a learner chooses w \u2208 W, where W is a\nclosed convex set in a normed vector space, attempting to minimize an objective (cid:96)(w, z) on instances z \u2208 Z, where\n(cid:96) : W \u00d7Z \u2192 R is an objective function which is convex in w. This captures learning linear predictors w.r.t. a convex\nloss function \u03c6(t, z), where Z = X \u00d7 Y and (cid:96)(w, (x, y)) = \u03c6((cid:104)w, x(cid:105), y), and extends beyond supervised learning.\nWe consider the case where the objective (cid:96)(w, z) is H-smooth w.r.t. some norm (cid:107)w(cid:107) (the reader may choose to think\nof W as a subset of a Euclidean or Hilbert space, and (cid:107)w(cid:107) as the (cid:96)2-norm): By this we mean that for any z \u2208 Z, and\nall w, w(cid:48) \u2208 W\n\n(cid:107)\u2207(cid:96)(w, z) \u2212 \u2207(cid:96)(w(cid:48), z)(cid:107)\u2217 \u2264 H (cid:107)w \u2212 w(cid:48)(cid:107)\n\n4\n\n\fwhere (cid:107) \u00b7 (cid:107)\u2217 is the dual norm. The key here is to generalize Lemma 2.1 to smoothness w.r.t. a vector w, rather than\nscalar smoothness:\n\nLemma 3.1. For an H-smooth non-negative f : W \u2192 R, for all w \u2208 W: (cid:107)\u2207f (w)(cid:107)\u2217 \u2264(cid:112)4Hf (w)\n\nIn order to consider general norms, we will also need to rely on a non-negative regularizer F : W (cid:55)\u2192 R that is a\n1-strongly convex (see De\ufb01nition in e.g. [31]) w.r.t. to the norm (cid:107)w(cid:107) for all w \u2208 W. For the Euclidean norm we can\nuse the squared Euclidean norm regularizer: F (w) = 1\n\n2 (cid:107)w(cid:107)2.\n\n3.1 Online Optimization Setting\n\nIn the online convex optimization setting we consider an n round game played between a learner and an adversary\n(Nature) where at each round i, the player chooses a wi \u2208 W and then the adversary picks a zi \u2208 Z. The player\u2019s\nchoice wi may only depend on the adversary\u2019s choices in previous rounds. The goal of the player is to have low\naverage objective value 1\nn\nA classic algorithm for this setting is Mirror Descent [4], which starts at some arbitrary w1 \u2208 W and updates wi+1\naccording to zi and a stepsize \u03b7 (to be discussed later) as follows:\n\ni=1 (cid:96)(wi, zi) compared to the best single choice in hind sight [9].\n\n(cid:80)n\n\nwi+1 \u2190 arg min\nw\u2208W\n\n(cid:104)\u03b7\u2207(cid:96)(wi, zi) \u2212 \u2207F (wi), w(cid:105) + F (w)\n\n(5)\n\nFor the Euclidean norm with F (w) = 1\nwi+1 \u2190 \u03a0W(wi \u2212 \u03b7\u2207(cid:96)(wi, zi)) where \u03a0W(w) = arg minw(cid:48)\u2208W (cid:107)w \u2212 w(cid:48)(cid:107) is the projection onto W.\nTheorem 2. For any B \u2208 R and L\u2217 if we use stepsize \u03b7 =\nH 2B4+HB2nL\u2217 for the Mirror Descent algorithm\nthen for any instance sequence z1, . . . , zn \u2208 Z, the average regret w.r.t. any w\u2217 \u2208 W s.t. F (w\u2217) \u2264 B2 and\n\n2(cid:107)w(cid:107)2, the update (5) becomes projected online gradient descent [32]:\n\nHB2+\n\n\u221a\n\n1\n\n(cid:80)n\nj=1 (cid:96)(w\u2217, zi) \u2264 L\u2217 is bounded by:\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:96)(wi, zi) \u2212 1\nn\n\nn(cid:88)\n\ni=1\n\n(cid:96)(w\u2217, zi) \u2264 4HB2\n\nn\n\n+ 2\n\n(cid:115)\n\nHB2L\u2217\n\nn\n\nNote that the stepsize depends on the bound L\u2217 on the loss in hindsight. The above theorem can be proved using\nLemma 3.1 and Theorem 1 of [26].\n\n3.2 Stochastic Optimization\n\nAn online algorithm can also serve as an ef\ufb01cient one-pass learning algorithm in the stochastic setting. Here, we again\nconsider an i.i.d. sample z1, . . . , zn from some unknown distribution (as in Section 2), and we would like to \ufb01nd w\nwith low risk L(w) = E [(cid:96)(w, Z)]. When z = (x, y) and (cid:96)(w, z) = \u03c6((cid:104)w, x(cid:105), y) this agrees with the supervised\nlearning risk discussed in the Introduction and analyzed in Section 2. But instead of focusing on the ERM, we run\nMirror Descent on the sample, and then take \u02dcw = 1\ni=1 wi. Standard arguments [8] allow us to convert the online\nregret bound of Theorem 2 to a bound on the excess risk:\nCorollary 3. For any B \u2208 R and L\u2217, if we run Mirror Descent on the sample with \u03b7 =\nfor any w\u2217 \u2208 W with F (w\u2217) \u2264 B2 and L(w\u2217) \u2264 L\u2217, with expectation over the sample:\n\nH 2B4+HB2nL\u2217 , then\n\n(cid:80)n\n\nHB2+\n\n\u221a\n\n1\n\nn\n\n(cid:115)\n\nL ( \u02dcwn) \u2212 L (w(cid:63)) \u2264 4HB2\n\nn\n\n+ 2\n\nHB2L\u2217\n\nn\n\n.\n\nIt is instructive to contrast this guarantee with similar looking guarantees derived recently in the stochastic convex\noptimization literature [14]. There, the model is stochastic \ufb01rst-order optimization, i.e.\nthe learner gets to see an\nunbiased estimate \u2207l(w, zi) of the gradient of L(w). The variance of the estimate is assumed to be bounded by \u03c32.\n\u221a\nThe expected accuracy after n gradient evaluations then has two terms: a \u201caccelerated\u201d term that is O(H/n2) and a\nslow O(\u03c3/\nn) term. While this result is applicable more generally (since it doesn\u2019t require non-negativity of (cid:96)), it is\nnot immediately clear if our guarantees can be derived using it. The main dif\ufb01culty is that \u03c3 depends on the norm of\nthe gradient estimates. Thus, it cannot be bounded in advance even if we know that L(w(cid:63)) is small. That said, it is\n\n5\n\n\fintuitively clear that towards the end of the optimization process, the gradient norms will typically be small if L(w(cid:63))\nis small because of the self bounding property (Lemma 3.1).\nIt is interesting to note that using stability arguments, a guarantee very similar to Corollary 3, avoiding the polyloga-\nrithmic factors of Theorem 1 as well as the dependence on the bound on the loss, can be obtained also for a \u201cbatch\u201d\nlearning rule similar to ERM, but incorporating regularization. For given regularization parameter \u03bb > 0 de\ufb01ne the\nregularized empirical loss as \u02c6L\u03bb(w) := \u02c6L(w) + \u03bbF (w) and consider the Regularized Empirical Risk Minimizer\n\n\u02c6w\u03bb = arg min\nw\u2208W\n\n\u02c6L\u03bb(w)\n\n(6)\n\nThe following theorem provides a bound on excess risk similar to Corollary 3:\nTheorem 4. For any B \u2208 R and L\u2217 if we set \u03bb = 128H\nn2 + 128HL\u2217\nand L(w(cid:63)) \u2264 L\u2217, we have that in expectation over sample of size n:\n\nn +\n\n1282H 2\n\nnB2\n\nthen for all w(cid:63) \u2208 W with F (w(cid:63)) \u2264 B2\n\n(cid:113)\n\n(cid:115)\n\nL ( \u02c6w\u03bb) \u2212 L (w(cid:63)) \u2264 256HB2\n\nn\n\n+\n\n2048HB2L\u2217\n\nn\n\n.\n\nTo prove Theorem 4 we use stability arguments similar to the ones used by Shalev-Shwartz et al [24], which are in turn\nbased on Bousquet and Elisseeff [7]. However, while Shalev-Shwartz et al [24] use the notion of uniform stability,\nhere it is necessary to look at stability in expectation to get the faster rates.\n\n4 Tightness\n\nIn this Section we return to the learning rates for the ERM for parametric and for scale-sensitive hypothesis classes\n(i.e. in terms of the dimensionality and in terms of scale sensitive complexity measures), discussed in the Introduction\nand analyzed in Section 2. We compare the guarantees on the learning rates in different situations, identify differences\nbetween the parametric and scale-sensitive cases and between the smooth and non-smooth cases, and argue that these\ndifferences are real by showing that the corresponding guarantees are tight. Although we discuss the tightness of the\nlearning guarantees for ERM in the stochastic setting, similar arguments can also be made for online learning.\nTable 1 summarizes the bounds on the excess risk of the ERM implied by Theorem 1 as well previous bounds for Lips-\nchitz loss on \ufb01nite-dimensional [22] and scale-sensitive [2] classes, and a bound for squared-loss on \ufb01nite-dimensional\nclasses [9, Theorem 11.7] that can be generalized to any smooth strongly convex loss. We shall now show that the\n\nLoss function is:\nD-Lipschitz\n\nH-smooth\n\nParametric\n\ndim(H) \u2264 d ,\n\n(cid:113) dDL\u2217\n(cid:113) dHL\u2217\n\nn\n\nScale-Sensitive\n\n|h| \u2264 1 Rn(H) \u2264(cid:112)R/n\n(cid:113) D2R\n(cid:113) HRL\u2217\n(cid:113) HRL\u2217\n\nn\n\nn\n\nn\n\ndD\nn +\ndH\nn +\nH\n\u03bb\n\nH-smooth and \u03bb-strongly Convex\n\nHR\nn +\nHR\nn +\nTable 1: Bounds on the excess risk, up to polylogarithmic factors.\n\ndH\nn\n\nn\n\n\u221a\n\nn dependencies in Table 1 are unavoidable. To do so, we will consider the class H = {x (cid:55)\u2192 (cid:104)w, x(cid:105) : (cid:107)w(cid:107) \u2264 1} of\n1/\n(cid:96)2-bounded linear predictors (all norms in this Section are Euclidean), with different loss functions, and various spe-\n\nci\ufb01c distributions over X \u00d7Y, where X =(cid:8)x \u2208 Rd : (cid:107)x(cid:107) \u2264 1(cid:9) and Y = [0, 1]. For the non-parametric lower-bounds,\n\nwe will allow the dimensionality d to grow with the sample size n.\nIn\ufb01nite dimensional, Lipschitz (non-smooth), separable\nConsider the absolute difference loss \u03c6(h(x), y) = |h(x) \u2212 y|, take d = 2n and consider the following distribution: X\n(cid:80)n\nn ri, where r1, . . . , rd \u2208 {\u00b11} is\nis uniformly distributed over the d standard basis vectors ei and if X = ei, then Y = 1\u221a\ni=1 riei, (cid:107)w(cid:63)(cid:107) = 1 and L\u2217 = L (w(cid:63)) = 0.\nan arbitrary sequence of signs unknown to the learner. Taking w(cid:63) = 1\u221a\nHowever any sample (x1, y1), . . . , (xn, yn) reveals at most n of 2n signs ri, and no information on the remaining\nsigns. This means that for any learning algorithm, there exists a choice of ri\u2019s such that on at least n of the remaining\npoints not seen by the learner, he/she has to suffer a loss of at least 1/\n\n\u221a\nn, yielding an overall risk of at least 1/\n\n4n.\n\n\u221a\n\nn\n\n6\n\n\f1\nd\n\n1\nd\n\n(cid:107)w \u2212 w(cid:63)(cid:107)2\n\u221a\n\nof the expected risk is w(cid:63) =(cid:80)d\n\nd\n\n2\n\n\u221a\n\nIn\ufb01nite dimensional, smooth, non-separable, even if strongly convex\nConsider the squared loss \u03c6(h(x), y) = (h(x) \u2212 y)2 which is 2-smooth and 2-strongly convex. For any \u03c3 \u2265 0 let\nn/\u03c3 and consider the following distribution: X is uniform over ei as before, but this time Y |X is random, with\nd =\nY |(X = ei) \u223c N ( ri\n\u221a\n, \u03c3), where again ri are pre-determined, unknown to the learner, random signs. The minimizer\n\nei, with (cid:107)w(cid:63)(cid:107) = 1\nL (w) \u2212 L (w(cid:63)) = E [(cid:104)w \u2212 w(cid:63), x(cid:105)]2 =\n\n\u221a\nri\nd\n2\n\ni=1\n\nd(cid:88)\n2 and L\u2217 = L(w(cid:63)) = \u03c32. Furthermore, for any w \u2208 W,\n\n(w[i] \u2212 w(cid:63)[i])2 =\n\n\u221a\nIf the norm constraint becomes tight, i.e. (cid:107) \u02c6w(cid:107) = 1, then L( \u02c6w)\u2212 L(w(cid:63)) \u2265 1/(4d) = \u03c3/(4\nn). Oth-\nerwise, each coordinate is a separate mean estimation problem, with ni samples, where ni is the number of appearances\n\nof ei in the sample. We have E(cid:2)( \u02c6w[i] \u2212 w(cid:63)[i])2(cid:3) = \u03c32/ni and so L( \u02c6w)\u2212L\u2217 = 1\n\n\u2265(cid:113) L\u2217\n\nd (cid:107) \u02c6w \u2212 w(cid:63)(cid:107)2 = 1\n\n(cid:80)d\n\nL\u2217/(4\n\nn) =\n\n\u221a\n\ni=1\n\ni=1\n\nn\n\nd\n\n\u03c32\nni\n\nFinite dimensional, smooth, not strongly convex, non-separable:\nTake d = 1, with X = 1 with probability q and X = 0 with probability 1 \u2212 q. Conditioned on X = 0 let Y = 0\nqn and Y = \u22121 with\ndeterministically and while conditioned on X = 1 let Y = +1 with probability p = 1\nprobability 1 \u2212 p. Consider the following 1-smooth loss :\n(h(x) \u2212 y)2\nif |h(x) \u2212 y| \u2264 1/2\n|h(x) \u2212 y| \u2212 1/4 if |h(x) \u2212 y| \u2265 1/2\n\n2 + 0.2\u221a\n\nFirst, irrespective of choice of w, when x = 0, we always have h(x) = 0 and so suffer no loss. This happens with\nprobability 1 \u2212 q. Next observe that for p > 0.5, the optimal predictor is w(cid:63) \u2265 1/2. However, for n > 20, with\n\nprobability at least 0.25,(cid:80)n\nHowever for p > 0.5 and n > 20, L\u2217 > q/2 and so with probability 0.25, L( \u02c6w) \u2212 L\u2217 >(cid:112)0.32L\u2217/n.\n\ni=1 yi < 0, and so \u02c6w \u2264 \u22121/2. Hence, L( \u02c6w) \u2212 L\u2217 > L(\u22121/2) \u2212 L(1/2) =(cid:112)0.16 q/n.\n\n\u03c6(h(x), y) =\n\n(cid:40)\n\n5\n\nImplications\n\n5.1\n\nImproved Margin Bounds\n\n\u201cMargin bounds\u201d provide a bound on the expected zero-one loss of a classi\ufb01ers based on the margin 0/1 error on\nthe training sample. Koltchinskii and Panchenko [13] provides margin bounds for a generic class H based on the\nRademacher complexity of the class. This is done by using a non-smooth Lipschitz \u201cramp\u201d loss that upper bounds the\nzero-one loss and is upper-bounded by the margin zero-one loss. However, such an analysis unavoidably leads to a\n1/\n\nn rate even in the separable case. Following the same idea we use the following smooth \u201cramp\u201d:\n\n\u221a\n\n\uf8f1\uf8f2\uf8f3 1\n\n0\n\n\u03c6(t) =\n\n1+cos(\u03c0t/\u03b3)\n\n2\n\nt \u2264 0\nt \u2265 \u03b3\n\n0 < t < \u03b3\n\nThis loss function is \u03c02\n4\u03b32 -smooth and is lower bounded by the zero-one loss and upper bounded by the \u03b3 mar-\ngin loss. Using Theorem 1 we can now provide improved margin bounds for the zero-one loss of any classi\ufb01er\n\nbased on empirical margin error. Denote err(h) = E(cid:2) 11{h(x)(cid:54)=y}(cid:3) the zero-one risk and for any \u03b3 > 0 and sample\n(x1, y1), . . . , (xn, yn) \u2208 X \u00d7 {\u00b11} de\ufb01ne the \u03b3-margin empirical zero one loss as(cid:99)err\u03b3(h) := 1\n(cid:19)\n\ni=1 11{yih(xi)<\u03b3}.\nTheorem 5. For any hypothesis class H, with |h| \u2264 b, and any \u03b4 > 0, with probability at least 1 \u2212 \u03b4, simultaneously\nfor all margins \u03b3 > 0 and all h \u2208 H:\n\n(cid:80)n\n\n(cid:113) log(log( 4b\n\nlog(log( 4b\n\n(cid:18)\n\n(cid:19)\n\nn\n\nlog1.5 n\n\n\u03b3 Rn(H) +\n\n\u03b3 )/\u03b4)\n\nn\n\n+ log3 n\n\n\u03b32 R2\n\n\u03b3 )/\u03b4)\n\nn\n\nerr(h) \u2264 (cid:99)err\u03b3(h) + K\n\n(cid:18)(cid:113)(cid:99)err\u03b3(h)\n\nwhere K is a numeric constant from Theorem 1.\n\nIn particular, for appropriate numeric constant K :\n\nerr(h) \u2264 1.01(cid:99)err\u03b3(h) + K\n\n(cid:32)\n\n2 log3 n\n\n\u03b32 R2\n\nn(H) +\n\n2 log(log( 4b\n\n\u03b3 )/\u03b4)\n\nn\n\nn(H) +\n(cid:33)\n\nImproved margin bounds of the above form have been previously shown speci\ufb01cally for linear prediction in a Hilbert\nspace based on the PAC Bayes theorem [19, 15]. However PAC-Bayes based results are speci\ufb01c to certain linear\nfunction class. Theorem 5, in contrast, is a generic concentration-based result that can be applied to any function class.\n\n7\n\n\fInteraction of Norm and Dimension\n\n5.2\nConsider the problem of learning a low-norm linear predictor with respect to the squared loss \u03c6(t, z) = (t \u2212 z)2,\nwhere X \u2208 Rd, for \ufb01nite but very large d, and where the expected norm of X is low. Speci\ufb01cally, let X be Gaussian\nwith E(cid:107)X(cid:107)2 = B, Y = (cid:104)w\u2217, X(cid:105) + N (0, \u03c32) with (cid:107)w\u2217(cid:107) = 1, and consider learning a linear predictor using (cid:96)2\nregularization. What determines the sample complexity? How does the error decrease as the sample size increases?\nFrom a scale-sensitive statistical learning perspective, we expect that the sample complexity, and the decrease of the\nerror, should depend on the norm B, especially if d (cid:29) B2. However, for any \ufb01xed d and B, even if d (cid:29) B2,\nasymptotically as the number of samples increase, the excess risk of norm-constrained or norm-regularized regression\nactually behaves as L( \u02c6w) \u2212 L\u2217 \u2248 d\nn \u03c32, and depends (to \ufb01rst order) only on the dimensionality d and not on B [17].\n\u221a\nThe asymptotic dependence on the dimensionality alone can be understood through Table 1. In this non-separable\nsituation, parametric complexity controls can lead to a 1/n rate, ultimately dominating the 1/\nn rate resulting from\nL\u2217 > 0 when considering the scale-sensitive, non-parametric complexity control B. Combining Theorem 4 with the\nasymptotic d\nn \u03c32 behavior, and noting that at the worst case we can predict using a zero vector, yields the following\noverall picture on the expected excess risk of ridge regression with an optimally chosen \u03bb:\n\nL( \u02c6w\u03bb) \u2212 L\u2217 \u2264 O(cid:0)min(cid:0)B2, B2/n + B\u03c3/\n\nn, d\u03c32/n(cid:1)(cid:1)\n\n\u221a\n\nRoughly speaking, each term above describes the behavior in a different regime of the sample size. The \ufb01rst regime\nhas excess risk of order B2 which occurs until n = \u0398(B2). The second (\u201clow-noise\u201d) regime is one where the excess\nrisk is dominated by the norm and behaves as B2/n, until n = \u0398(B2/\u03c32) and L( \u02c6w) = \u0398(L\u2217). The third (\u201cslow\u201d)\nregime, where the excess risk is controlled by the norm and the approximation error and behaves as B\u03c3/\nn, until\nn = \u0398(d2\u03c32/B2) and L( \u02c6w) = L\u2217 + \u0398(B2/d). The fourth (\u201casymptotic\u201d) regime is where excess risk behaves as\nd/n. This sheds further light on recent work by Liang and Srebro [18] based on exact asymptotics.\n\n\u221a\n\n5.3 Sparse Prediction\n\nThe use of the (cid:96)1 norm has become popular for learning sparse predictors in high dimensions, as in the LASSO. The\nLASSO estimator [28] \u02c6w is obtained by considering the squared loss \u03c6(z, y) = (z\u2212 y)2 and minimizing \u02c6L(w) subject\nto (cid:107)w(cid:107)1 \u2264 B. Let us assume there is some (unknown) sparse reference predictor w0 that has low expected loss and\nsparsity (number of non-zeros) (cid:107)w0(cid:107)0 = k, and that (cid:107)x(cid:107)\u221e \u2264 1, y \u2264 1. In order to choose B and apply Theorem 1 in\nthis setting, we need to bound (cid:107)w0(cid:107)1. This can be done by, e.g., assuming that the features x[i] in the support of w0\nare mutually uncorrelated. Under such an assumption, we have: (cid:107)w0(cid:107)2\nThus, Theorem 1 along with Rademacher complexity bounds from [11] gives us,\n\n1 \u2264 kE(cid:10)w0, x(cid:11)2 \u2264 2k(L(w0) + Ey2) \u2264 4k.\n\nL( \u02c6w) \u2264 L(w0) + \u02dcO\n\n(cid:16)\n\nk log(d)/n +(cid:112)k L(w0) log(d)/n\n\n(cid:17)\n\n.\n\n(7)\n\nIt is possible to relax the no-correlation assumption to a bound on the correlations, as in mutual incoherence, or to other\nweaker conditions [25]. But in any case, unlike typical analysis for compressed sensing, where the goal is recovering\nw0 itself, here we are only concerned with correlations inside the support of w0. Furthermore, we do not require that\nthe optimal predictor is sparse or that the model is well speci\ufb01ed: only that there exists a low risk predictor using a\nsmall number of fairly uncorrelated features.\nBounds similar to (7) have been derived using specialized arguments [12, 30, 5]\u2014here we demonstrate that bounds of\nthese forms can be obtained under simple conditions, using the generic framework we suggest. It is also interesting to\nnote that the methods and results of Section 3 can also be applied to this setting. We use the entropy regularizer\n\n(cid:88)\nwhich is non-negative and 1-strongly convex with respect to (cid:107)w(cid:107)1 on W = (cid:8)w \u2208 Rd(cid:12)(cid:12)w[i] \u2265 0,(cid:107)w(cid:107)1 \u2264 B(cid:9), with\neach feature\u2019s negation). Recalling that(cid:13)(cid:13)w0(cid:13)(cid:13)1 \u2264 2\n(cid:16)\n\nF (w) \u2264 B2(1 + log d) (we consider here only non-negative weights\u2014in order to allow w[i] < 0 we can include also\nk in the entropy regularizer (8), we have\nfrom Theorem 4 we that L( \u02c6w\u03bb) \u2264 L(w0) + O\nwhere \u02c6w\u03bb is the regularized\nempirical minimizer (6) using the entropy regularizer (8) with \u03bb as in Theorem 4. The advantage here is that using\nTheorem 4 instead of Theorem 1 avoids the extra logarithmic factors.\n\nk log(d)/n +(cid:112)k L(w0) log(d)/n\n\nk and using B = 2\n\n(cid:18) x[i]\n\n(cid:19)\n\ni\n\n\u221a\n\nF (w) = B\n\nx[i] log\n\nB2\ne\n\n(cid:17)\n\n1/d\n\n(8)\n\n+\n\n\u221a\n\n8\n\n\fReferences\n[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability.\n\nFOCS, 0:292\u2013301, 1993.\n\n[2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3:463\u2013\n\n482, 2002.\n\n[3] P.L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. Annals of Statistics, 33(4):1497\u20131537, 2005.\n[4] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations\n\nResearch Letters, 31:167\u2013175, 2003.\n\n[5] P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics,\n\n37(4):1705\u20131732, 2009.\n\n[6] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms.\n\nPhD thesis, Ecole Polytechnique, 2002.\n\n[7] Olivier Bousquet and Andr\u00b4e Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499\u2013526, 2002.\n[8] N. Cesa-Bianchi, A. Conconi, and C.Gentile. On the generalization ability of on-line learning algorithms. In NIPS, pages\n\n359\u2013366, 2002.\n\n[9] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\n[10] Christophe Chesneau and Guillaume Lecu. Adapting to unknown smoothness by aggregation of thresholded wavelet estima-\n\ntors. 2006.\n\n[11] S.M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and\n\nregularization. In NIPS, 2008.\n\n[12] V. Koltchinskii. Sparsity in penalized empirical risk minimization. Ann. Inst, H. Poincar\u00b4e Probab. Statist., 45(1):7\u201357, 2009.\n[13] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classi-\n\n\ufb01ers. Ann. of Stats., 30(1):1\u201350, 2002.\n\n[14] G. Lan. Convex Optimization Under Inexact First-order Information. PhD thesis, Georgia Institute of Technology, 2009.\n[15] J. Langford and J. Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Information Processing Systems 15, pages\n\n423\u2013430, 2003.\n\n[16] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE\n\nTrans. on Information Theory, 1998.\n\n[17] P. Liang, F. Bach, G. Bouchard, and M. I. Jordan. Asymptotically optimal regularization in smooth parametric models. In\n\nNIPS, 2010.\n\n[18] P. Liang and N. Srebro. On the interaction between norm and dimensionality: Multiple regimes in learning. In ICML, 2010.\n[19] D. A. McAllester. Simpli\ufb01ed PAC-Bayesian margin bounds. In COLT, pages 203\u2013215, 2003.\n[20] Shahar Mendelson. Rademacher averages and phase transitions in glivenko-cantelli classes. IEEE Trans. On Information\n\nTheory, 48(1):251\u2013263, 2002.\n\n[21] A. Nemirovski and D. Yudin. Problem complexity and method ef\ufb01ciency in optimization. Nauka Publishers, Moscow, 1978.\n[22] D. Panchenko. Some extensions of an inequality of vapnik and chervonenkis. Electronic Communications in Probability,\n\n7:55\u201365, 2002.\n\n[23] David Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984.\n[24] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In COLT, 2009.\n[25] S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTI-C, 2009. Available at\n\nttic.uchicago.edu/\u223cshai.\n\n[26] S.Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, Hebrew University of Jerusalem,\n\n2007.\n\n[27] I. Steinwart and C. Scovel. Fast rates for support vector machines using gaussian kernels. ANNALS OF STATISTICS, 35:575,\n\n2007.\n\n[28] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267\u2013288, 1996.\n[29] A. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics, 32:135\u2013166, 2004.\n[30] S. A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics, 36(2):614\u2013645, 2008.\n[31] C. Zalinescu. Convex analysis in general vector spaces. World Scienti\ufb01c Publishing Co. Inc., River Edge, NJ, 2002.\n[32] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1318, "authors": [{"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}]}