{"title": "On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 800, "abstract": "We provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes. These bounds make short work of providing a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either $L_2$ or $L_1$ constraints), margin bounds (including both $L_2$ and $L_1$ margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and $L_2$ covering numbers (with $L_p$ norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds (improving upon a number of previous results). Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction (up to a constant factor of 2).", "full_text": "On the Complexity of Linear Prediction:\n\nRisk Bounds, Margin Bounds, and Regularization\n\nSham M. Kakade\n\nTTI Chicago\n\nChicago, IL 60637\nsham@tti-c.org\n\nKarthik Sridharan\n\nTTI Chicago\n\nChicago, IL 60637\n\nAmbuj Tewari\n\nTTI Chicago\n\nChicago, IL 60637\n\nkarthik@tti-c.org\n\ntewari@tti-c.org\n\nAbstract\n\nThis work characterizes the generalization ability of algorithms whose predic-\ntions are linear in the input vector. To this end, we provide sharp bounds for\nRademacher and Gaussian complexities of (constrained) linear classes, which di-\nrectly lead to a number of generalization bounds. This derivation provides simpli-\n\ufb01ed proofs of a number of corollaries including: risk bounds for linear prediction\n(including settings where the weight vectors are constrained by either L2 or L1\nconstraints), margin bounds (including both L2 and L1 margins, along with more\ngeneral notions based on relative entropy), a proof of the PAC-Bayes theorem,\nand upper bounds on L2 covering numbers (with Lp norm constraints and rela-\ntive entropy constraints). In addition to providing a uni\ufb01ed analysis, the results\nherein provide some of the sharpest risk and margin bounds. Interestingly, our\nresults show that the uniform convergence rates of empirical risk minimization\nalgorithms tightly match the regret bounds of online learning algorithms for linear\nprediction, up to a constant factor of 2.\n\n1\n\nIntroduction\n\nLinear prediction is the cornerstone of an extensive number of machine learning algorithms, in-\ncluding SVM\u2019s, logistic and linear regression, the lasso, boosting, etc. A paramount question is to\nunderstand the generalization ability of these algorithms in terms of the attendant complexity re-\nstrictions imposed by the algorithm. For example, for the sparse methods (e.g. regularizing based\non L1 norm of the weight vector) we seek generalization bounds in terms of the sparsity level. For\nmargin based methods (e.g. SVMs or boosting), we seek generalization bounds in terms of either\nthe L2 or L1 margins. The focus of this paper is to provide a more uni\ufb01ed analysis for methods\nwhich use linear prediction.\n\nGiven a training set {(xi, yi)}n\nthe F -regularized \u2113-risk. More speci\ufb01cally,\n\ni=1, the paradigm is to compute a weight vector \u02c6w which minimizes\n\n\u02c6w = argmin\n\nw\n\n1\nn\n\n\u2113(hw, xii , yi) + \u03bbF (w)\n\n(1)\n\nn\n\nXi=1\n\nwhere \u2113 is the loss function, F is the regularizer, and hw, xi is the inner product between vectors x\nand w. In a formulation closely related to the dual problem, we have:\n\n\u02c6w = argmin\nw:F (w)\u2264c\n\n1\nn\n\nn\n\nXi=1\n\n\u2113(hw, xii , yi)\n\n(2)\n\nwhere, instead of regularizing, a hard restriction over the parameter space is imposed (by the constant\nc). This works provides generalization bounds for an extensive family of regularization functions F .\n\n\fRademacher complexities (a measure of the complexity of a function class) provide a direct route\nto obtaining such generalization bounds, and this is the route we take. Such bounds are analogous\nto VC dimensions bounds, but they are typically much sharper and allow for distribution dependent\nbounds. There are a number of methods in the literature to use Rademacher complexities to obtain\neither generalization bounds or margin bounds. Bartlett and Mendelson [2002] provide a general-\nization bound for Lipschitz loss functions. For binary prediction, the results in Koltchinskii and\nPanchenko [2002] provide means to obtain margin bounds through Rademacher complexities.\n\nIn this work, we provide sharp bounds for Rademacher and Gaussian complexities of linear classes,\nwith respect to a strongly convex complexity function F (as in Equation 1). These bounds provide\nsimpli\ufb01ed proofs of a number of corollaries: generalization bounds for the regularization algorithm\nin Equation 2 (including settings where the weight vectors are constrained by either L2 or L1 con-\nstraints), margin bounds (including L2 and L1 margins, and, more generally, for Lp margins), a\nproof of the PAC-Bayes theorem, and L2 covering numbers (with Lp norm constraints and relative\nentropy constraints). Our bounds are often tighter than previous results and our proofs are all under\nthis more uni\ufb01ed methodology.\n\nOur proof techniques \u2014 reminiscent of those techniques for deriving regret bounds for online learn-\ning algorithms \u2014 are rooted in convex duality (following Meir and Zhang [2003]) and use a more\ngeneral notion of strong convexity (as in Shalev-Shwartz and Singer [2006]). Interestingly, the risk\nbounds we provide closely match the regret bounds for online learning algorithms (up to a constant\nfactor of 2), thus showing that the uniform converge rates of empirical risk minimization algorithms\ntightly match the regret bounds of online learning algorithms (for linear prediction). The Discussion\nprovides this more detailed comparison.\n\n1.1 Related Work\n\nA staggering number of results have focused on this problem in varied special cases. Perhaps the\nmost extensively studied are margin bounds for the 0-1 loss. For L2-margins (relevant for SVM\u2019s,\nperceptron based algorithms, etc.), the sharpest bounds are those provided by Bartlett and Mendel-\nson [2002] (using Rademacher complexities) and Langford and Shawe-Taylor [2003], McAllester\n[2003] (using the PAC-Bayes theorem). For L1-margins (relevant for Boosting, winnow, etc),\nbounds are provided by Schapire et al. [1998] (using a self-contained analysis) and Langford et al.\n[2001] (using PAC-Bayes, with a different analysis). Another active line of work is on sparse meth-\nods \u2014 particularly methods which impose sparsity via L1 regularization (in lieu of the non-convex\nL0 norm). For L1 regularization, Ng [2004] provides generalization bounds for this case, which\nfollow from the covering number bounds of Zhang [2002]. However, these bounds are only stated\nas polynomial in the relevant quantities (dependencies are not provided).\n\nPrevious to this work, the most uni\ufb01ed framework for providing generalization bounds for linear\nprediction stem from the covering number bounds in Zhang [2002]. Using these covering number\nbounds, Zhang [2002] derives margin bounds in a variety of cases. However, providing sharp gen-\neralization bounds for problems with L1 regularization (or L1 constraints in the dual) requires more\ndelicate arguments. As mentioned, Ng [2004] provides bounds for this case, but the techniques used\nby Ng [2004] would result in rather loose dependencies (the dependence on the sample size n would\nbe n\u22121/4 rather than n\u22121/2). We discuss this later in Section 4.\n\n2 Preliminaries\n\nOur input space, X , is a subset of a vector space, and our output space is Y. Our samples (X, Y ) \u2208\nX \u00d7 Y are distributed according to some unknown distribution P . The inner product between\nvectors x and w is denoted by hw, xi, where w \u2208 S (here, S is a subset of the dual space to\nour input vector space). A norm of a vector x is denoted by kxk, and the dual norm is de\ufb01ned as\nkwk\u22c6 = sup{hw, xi : kxk \u2264 1}. We further assume that for all x \u2208 X , kxk \u2264 X.\nLet \u2113 : R\u00d7Y \u2192 R+ be our loss function of interest. Throughout we shall consider linear predictors\nof form hw, xi. The expected of loss of w is denoted by L(w) = E[\u2113(hw, xi , y)]. As usual, we are\nprovided with a sequence of i.i.d. samples {(xi, yi)}n\ni=1, and our goal is to minimize our expected\nnPn\nloss. We denote the empirical loss as \u02c6L(w) = 1\ni=1 \u2113(hw, xii , yi).\n\n\fThe restriction we make on our complexity function F is that it is a strongly convex function. In\nparticular, we assume it is strongly convex with respect to our dual norm: a function F : S \u2192 R is\nsaid to be \u03c3-strongly convex w.r.t. to k \u00b7 k\u2217 iff \u2200u, v \u2208 S, \u2200\u03b1 \u2208 [0, 1], we have\n\nF (\u03b1u + (1 \u2212 \u03b1)v) \u2264 \u03b1F (u) + (1 \u2212 \u03b1)F (v) \u2212\n\n\u03c3\n2\n\n\u03b1(1 \u2212 \u03b1)ku \u2212 vk2\n\u2217 .\n\nSee Shalev-Shwartz and Singer [2006] for more discussion on this generalized de\ufb01nition of strong\nconvexity.\nRecall the de\ufb01nition of the Rademacher and Gaussian complexity of a function class F ,\nf (xi)\u01ebi#\n\nRn(F) = E\"sup\n\nGn(F) = E\"sup\n\nf (xi)\u01ebi#\n\n1\nn\n\n1\nn\n\nf \u2208F\n\nf \u2208F\n\nn\n\nn\n\nwhere, in the former, \u01ebi independently takes values in {\u22121, +1} with equal probability, and, in the\nlatter, \u01ebi are independent, standard normal random variables. In both expectations, (x1, . . . , xn) are\ni.i.d.\n\nXi=1\n\nXi=1\n\nnPn\n\ni=1 \u2113(f (xi), yi) is the empirical loss.\n\nAs mentioned in the Introduction, there are number of methods in the literature to use Rademacher\ncomplexities to obtain either generalization bounds or margin bounds. Two results are particularly\nuseful to us. First, Bartlett and Mendelson [2002] provides the following generalization bound for\nLipschitz loss functions. Here, L(f ) = E[\u2113(f (x), y)] is the expected of loss of f : X \u2192 R, and\n\u02c6L(f ) = 1\nTheorem 1. (Bartlett and Mendelson [2002]) Assume the loss \u2113 is Lipschitz (with respect to its\n\ufb01rst argument) with Lipschitz constant L\u2113 and that \u2113 is bounded by c. For any \u03b4 > 0 and with\nprobability at least 1 \u2212 \u03b4 simultaneously for all f \u2208 F , we have that\nL(f ) \u2264 \u02c6L(f ) + 2L\u2113Rn(F) + cr log(1/\u03b4)\n\nwhere Rn(F) is the Rademacher complexity of a function class F , and n is the sample size.\nThe second result, for binary prediction, from Koltchinskii and Panchenko [2002] provides a mar-\ngin bound in terms of the Rademacher complexity. The following is a variant of Theorem 2 in\nKoltchinskii and Panchenko [2002]:\n(Koltchinskii and Panchenko [2002]) The zero-one loss function is given by\nTheorem 2.\n\u2113(f (x), y) = 1[yf (x) \u2264 0], where y \u2208 {+1,\u22121}. Denote the fraction of the data having \u03b3-\nmargin mistakes by K\u03b3(f ) := |{i:yif (xi)<\u03b3}|\n. Assume that \u2200f \u2208 F we have supx\u2208X |f (x)| \u2264 C.\nThen, with probability at least 1 \u2212 \u03b4 over the sample, for all margins \u03b3 > 0 and all f \u2208 F we have,\n+s log(log2\n\nL(f ) \u2264 K\u03b3(f ) + 4Rn(F)\n\n4C\n\u03b3 )\n\n+r log(1/\u03b4)\n\n2n\n\n2n\n\nn\n\n\u03b3\n\nn\n\n.\n\n(We provide a proof in the appendix.) The above results show that if we provide sharp bounds on the\nRademacher complexities then we obtain sharp generalization bounds. Typically, we desire upper\nbounds on the Rademacher complexity that decrease with n.\n\n3 Complexities of Linear Function Classes\n\nGiven a subset W \u2286 S, de\ufb01ne the associated class of linear functions FW as FW := {x 7\u2192 hw, xi :\nw \u2208 W}. Our main theorem bounds the complexity of FW for certain sets W.\nTheorem 3. (Complexity Bounds) Let S be a closed convex set and let F : S \u2192 R be \u03c3-strongly\nconvex w.r.t. k \u00b7 k\u2217 s.t. inf w\u2208S F (w) = 0. Further, let X = {x : kxk \u2264 X}. De\ufb01ne W = {w \u2208\nS : F (w) \u2264 W 2\n\n\u2217 }. Then, we have\nRn(FW ) \u2264 XW\u2217r 2\n\n\u03c3n\n\n,\n\nGn(FW ) \u2264 XW\u2217r 2\n\n\u03c3n\n\n.\n\nThe restriction inf w\u2208S F (w) = 0 is not a signi\ufb01cant one since adding a constant to F still keeps it\nstrongly convex. Interestingly, the complexity bounds above precisely match the regret bounds for\nonline learning algorithms (for linear prediction), a point which we return to in the Discussion. We\n\ufb01rst provide a few examples, before proving this result.\n\n\f3.1 Examples\n\n(1) Lp/Lq norms. Let S = Rd. Take k\u00b7k, k\u00b7k\u2217 to be the Lp, Lq norms for p \u2208 [2,\u221e), 1/p+1/q = 1,\nwhere kxkp :=(cid:16)Pd\nq and note that it is 2(q\u22121)-strongly convex\non Rd w.r.t. itself. Set X , W as in Theorem 3. Then, we have\nRn(FW ) \u2264 XW\u2217r p \u2212 1\n\nj=1 |xi|p(cid:17)1/p\n\n. Choose F (w) = k\u00b7k2\n\n(3)\n\n.\n\nn\n\n(2) L\u221e/L1 norms. Let S = {w \u2208 Rd : kwk1 = W1 , wj \u2265 0} be the W1-scaled probability\nsimplex. Take k \u00b7 k, k \u00b7 k\u2217 to be the L\u221e, L1 norms, kxk\u221e = max1\u2264j\u2264d |xj|. Fix a probability\ndistribution \u00b5 > 0 and let F (w) = entro\u00b5(w) := Pj(wj/W1) log(wj/(W1\u00b5j)). For any \u00b5,\nentro\u00b5(w) is 1/W 2\n1 -strongly convex on S w.r.t. k \u00b7 k1. Set X as in Theorem 3 and let W(E) =\n{w \u2208 S : entro\u00b5(w) \u2264 E}. Then, we have\n\nRn(FW(E)) \u2264 XW1r 2E\n\nn\n\n.\n\n(4)\n\nNote that if we take \u00b5 to be the uniform distribution then for any w \u2208 S we have that trivial upper\nbound of entro\u00b5(w) \u2264 log d. Hence if we let W := W(log d) with uniform \u00b5 and note that it is the\nentire scaled probability simplex. Then\n\nRn(FW ) \u2264 XW1r 2 log d\n\nn\n\n.\n\n(5)\n\nThe restriction wj \u2265 0 can be removed in the de\ufb01nition of S by the standard trick of doubling the\ndimension of x to include negated copies of each coordinate. So, if we have S = {w \u2208 Rd :\nkwk1 \u2264 W1} and we set X as above and W = S, then we get Rn(FW ) \u2264 XW1p2 log(2d)/n.\n\nIn this way, even though the L1 norm is not strongly convex (so our previous Theorem does not\ndirectly apply to it), the class of functions imposed by this L1 norm restriction is equivalent to that\nimposed by the above entropy restriction. Hence, we are able to analyze the generalization properties\nof the optimization problem in Equation 2.\n(3) Smooth norms. A norm is (2, D)-smooth on S if for any x, y \u2208 S,\n\nd2\ndt2kx + tyk2 \u2264 2D2kyk2 .\n\nLet k\u00b7k be a (2, D)-smooth norm and k\u00b7k\u2217 be its dual. Lemma 11 in the appendix proves that k\u00b7k\u2217\nis 2/D2-strongly convex w.r.t. itself. Set X , W as in Theorem 3. Then, we have\n\nRn(FW ) \u2264\n\nXW\u2217D\n\u221an\n\n.\n\n(6)\n\n(4) Bregman divergences. For a strongly convex F , de\ufb01ne the Bregman divergence \u2206F (wkv) :=\nF (w) \u2212 F (v) \u2212 h\u2207F (v), w \u2212 vi. It is interesting to note that Theorem 3 is still valid if we choose\nW\u2217 = {w \u2208 S : \u2206F (wkv) \u2264 W 2\n\u2217 } for some \ufb01xed v \u2208 S. This is because the Bregman divergence\n\u2206F (\u00b7kv) inherits the strong convexity of F .\nExcept for (5), none of the above bounds depend explicitly on the dimension of the underlying space\nand hence can be easily extended to in\ufb01nite dimensional spaces under appropriate assumptions.\n\n3.2 The Proof\n\nFirst, some background on convex duality is in order. The Fenchel conjugate of F : S \u2192 R is\nde\ufb01ned as:\n\nF \u2217(\u03b8) := sup\n\nw\u2208S hw, \u03b8i \u2212 F (w) .\n\nA simple consequence of this de\ufb01nition is Fenchel-Young inequality,\n\n\u2200\u03b8, w \u2208 S, hw, \u03b8i \u2264 F (w) + F \u2217(\u03b8) .\n\n\fIf F is \u03c3-strongly convex, then F \u2217 is differentiable and\n\n\u2200\u03b8, \u03b7, F \u2217(\u03b8 + \u03b7) \u2264 F \u2217(\u03b8) + h\u2207F \u2217(\u03b8), \u03b7i +\n\n1\n2\u03c3k\u03b7k2\n\u2217 .\n\n(7)\n\nSee the Appendix in Shalev-Shwartz [2007] for proof. Using this inequality we can control the\nexpectation of F \u2217 applied to a sum of independent random variables.\nLemma 4. Let S be a closed convex set and let F : S \u2192 R be \u03c3-strongly convex w.r.t. k\u00b7k\u2217. Let Zi\nbe mean zero independent random vectors such that E[kZik2] \u2264 V 2. De\ufb01ne Si := Pj\u2264i Zi. Then\nF \u2217(Si) \u2212 iV 2/2\u03c3 is a supermartingale. Furthermore, if inf w\u2208S F (w) = 0, then E[F \u2217(Sn)] \u2264\nnV 2/2\u03c3.\n\nProof. Note that inf w\u2208S F (w) = 0 implies F \u2217(0) = 0. Inequality (7) gives,\n\nF \u2217(Si\u22121 + Zi) \u2264 F \u2217(Si) + h\u2207F \u2217(Si\u22121), Zii +\n\n1\n2\u03c3kZik2\n\u2217 .\n\nTaking conditional expectation w.r.t. Z1, . . . , Zi\u22121 and noting that Ei\u22121[Zi] = 0 and Ei\u22121[kZik2\nV 2, we get\n\n\u2217] \u2264\n\nEi\u22121[F \u2217(Si)] \u2264 F \u2217(Si\u22121) + 0 +\n\nV 2\n2\u03c3\n\nwhere Ei\u22121[\u00b7] abbreviates E[\u00b7| Z1, . . . , Zi\u22121]. To end the proof, note that inf w\u2208S F (w) = 0 implies\nF \u2217(0) = 0.\n\nLike Meir and Zhang [2003] (see Section 5 therein), we begin by using conjugate duality to bound\nthe Rademacher complexity. To \ufb01nish the proof, we exploit the strong convexity of F by applying\nthe above lemma.\n\nProof. Fix x1, . . . , xn such that kxik \u2264 X. Let \u03b8 = 1\nGaussian random variables (our proof only requires that E[\u01ebi] = 0 and E[\u01eb2\n\u03bb > 0. By Fenchel\u2019s inequality, we have hw, \u03bb\u03b8i \u2264 F (w) + F \u2217(\u03bb\u03b8) which implies\n\nnPi \u01ebixi where \u01ebi\u2019s are i.i.d. Rademacher or\n\ni ] = 1). Choose arbitrary\n\nSince, F (w) \u2264 W 2\n\nhw, \u03b8i \u2264\n\u2217 for all w \u2208 W, we have\n\nF (w)\n\n\u03bb\n\n+\n\nF \u2217(\u03bb\u03b8)\n\n\u03bb\n\n.\n\nw\u2208W hw, \u03b8i \u2264\nsup\n\nW 2\n\u2217\n\u03bb\n\n+\n\nF \u2217(\u03bb\u03b8)\n\n\u03bb\n\n.\n\nTaking expectation (w.r.t. \u01ebi\u2019s), we get\n\nE(cid:20) sup\nw\u2208W hw, \u03b8i(cid:21) \u2264\n\nW 2\n\u2217\n\u03bb\n\n+\n\n1\n\u03bb\n\nE [F \u2217(\u03bb\u03b8)] .\n\n(so that Sn = \u03bb\u03b8) and note that the conditions of Lemma 4 are satis\ufb01ed with\n\nNow set Zi = \u03bb\u01ebi xi\nV 2 = \u03bb2B2/n2 and hence E[F \u2217(\u03bb\u03b8)] \u2264 \u03bb2X 2\n\nn\n\n2\u03c3n . Plugging this above, we have\n\nSetting \u03bb =q 2\u03c3nW 2\n\nX 2\n\n\u2217\n\nE(cid:20) sup\nw\u2208W hw, \u03b8i(cid:21) \u2264\n\nW 2\n\u2217\n\u03bb\n\n+\n\n\u03bbX 2\n2\u03c3n\n\n.\n\ngives\n\nw\u2208W hw, \u03b8i(cid:21) \u2264 XW\u2217r 2\nE(cid:20) sup\n\n\u03c3n\n\n.\n\nwhich completes the proof.\n\n\f4 Corollaries\n\n4.1 Risk Bounds\n\nWe now provide generalization error bounds for any Lipschitz loss function \u2113, with Lipschitz con-\nstant L\u2113. Based on the Rademacher generalization bound provided in the Introduction (see Theo-\nrem 1) and the bounds on Rademacher complexity proved in previous section, we obtain the follow-\ning corollaries.\nCorollary 5. Each of the following statements holds with probability at least 1\u2212 \u03b4 over the sample:\n\n\u2022 Let W be as in the Lp/Lq norms example. For all w \u2208 W,\n\nL(w) \u2264 \u02c6L(w) + 2L\u2113XW\u2217r p \u2212 1\n\nn\n\n+ L\u2113XW\u2217r log(1/\u03b4)\n\n2n\n\n\u2022 Let W be as in the L\u221e/L1 norms example. For all w \u2208 W,\n\nL( \u02c6w) \u2264 \u02c6L(w) + 2L\u2113XW1r 2 log(d)\n\nn\n\n+ L\u2113XW1r log(1/\u03b4)\n\n2n\n\nNg [2004] provides bounds for methods which use L1 regularization. These bounds are only stated\nas polynomial bounds, and, the methods used (covering number techniques from Pollard [1984] and\ncovering number bounds from Zhang [2002]) would provide rather loose bounds (the n dependence\nwould be n\u22121/4).\nIn fact, even a more careful analysis via Dudley\u2019s entropy integral using the\ncovering numbers from Zhang [2002] would result in a worse bound (with additional log n factors).\nThe above argument is sharp and rather direct.\n\n4.2 Margin Bounds\n\nIn this section we restrict ourselves to binary classi\ufb01cation where Y = {+1,\u22121}. Our prediction\nis given by sign(hw, xi). The zero-one loss function is given by \u2113(hw, xi , y) = 1[y hw, xi \u2264\n0]. Denote the fraction of the data having \u03b3-margin mistakes by K\u03b3(f ) := |{i:yif (xi)<\u03b3}|\n. We\nnow demonstrate how to get improved margin bounds using the upper bounds for the Rademacher\ncomplexity derived in Section 3.\n\nn\n\nBased on the Rademacher margin bound provided in the Introduction (see Theorem 2), we get the\nfollowing corollary which will directly imply the margin bounds we are aiming for. The bound for\nthe p = 2 case has been used to explain the performance of SVMs. Our bound essentially matches\nthe best known bound [Bartlett and Mendelson, 2002] which was an improvement over previous\nbounds [Bartlett and Shawe-Taylor, 1999] proved using fat-shattering dimension estimates. For the\nL\u221e/L1 case, our bound improves the best known bound [Schapire et al., 1998] by removing a factor\nof \u221alog n.\nCorollary 6. (Lp Margins) Each of the following statements holds with probability at least 1 \u2212 \u03b4\nover the sample:\n\n\u2022 Let W be as in the Lp/Lq norms example. For all \u03b3 > 0, w \u2208 W,\n)\n\n4XW\u2217\n\nXW\u2217\n\n\u03b3 r p \u2212 1\n\nn\n\n+s log(log2\n\nn\n\n\u03b3\n\n+r log(1/\u03b4)\n\n2n\n\nL(w) \u2264 K\u03b3(w) + 4\n\n\u2022 Let W be as in the L\u221e/L1 norms example. For all \u03b3 > 0, w \u2208 W,\n)\n\n4XW1\n\nXW1\n\n\u03b3 r 2 log(d)\n\nn\n\n+s log(log2\n\nn\n\n\u03b3\n\nL(w) \u2264 K\u03b3(w) + 4\n\n+r log(1/\u03b4)\n\n2n\n\nThe following result improves the best known results of the same kind, [Langford et al., 2001, The-\norem 5] and [Zhang, 2002, Theorem 7], by removing a factor of \u221alog n. These results themselves\nwere an improvement over previous results obtained using fat-shattering dimension estimates.\n\n\fCorollary 7. (Entropy Based Margins) Let X be such that for all x \u2208 X , kxk\u221e \u2264 X. Consider\nthe class W = {w \u2208 Rd : kwk1 \u2264 W1}. Fix an arbitrary prior \u00b5. We have that with probability\nat least 1 \u2212 \u03b4 over the sample, for all margins \u03b3 > 0 and all weight vector w \u2208 W,\n\nL(w) \u2264 K\u03b3(w) + 8.5\n\nXW1\n\n\u03b3 r entro\u00b5(w) + 2.5\n\nn\n\nwhere entro\u00b5(w) :=Pi\n\nProof. Proof is provided in the appendix.\n\n|wi|\nkwk1\n\nlog( |wi|\n\u00b5ikwk1\n\n)\n\n+s log(log2\n\nn\n\n4XW1\n\n\u03b3\n\n)\n\n+r log(1/\u03b4)\n\n2n\n\n4.3 PAC-Bayes Theorem\n\nWe now show that (a form of) the PAC Bayesian theorem [McAllester, 1999] is a consequence of\nTheorem 3. In the PAC Bayesian theorem, we have a set of hypothesis (possibly in\ufb01nite) C. We\nchoose some prior distribution over this hypothesis set say \u00b5, and after observing the training data,\nwe choose any arbitrary posterior \u03bd and the loss we are interested in is \u2113\u03bd(x, y) = Ec\u223c\u03bd\u2113(c, x, y)\nthat is basically the expectation of the loss when hypothesis c \u2208 C are drawn i.i.d. using distribution\n\u03bd. Note that in this section we are considering a more general form of the loss.\nThe key observation as that we can view \u2113\u03bd(x) as the inner product hd\u03bd(\u00b7), \u2113(\u00b7, x, y)i between the\nmeasure d\u03bd(\u00b7) and the loss \u2113(\u00b7, x). This leads to the following straightforward corollary.\nCorollary 8. (PAC-Bayes) For a \ufb01xed prior \u00b5 over the hypothesis set C, and any loss bounded by 1,\nwith probability at least 1 \u2212 \u03b4 over the sample, simultaneously for all choice of posteriors \u03bd over C\nwe have that,\n\nL\u03bd \u2264 \u02c6L\u03bd + 4.5r max{KL(\u03bdk\u00b5), 2}\n\nn\n\n+r log(1/\u03b4)\n\n2n\n\n(8)\n\nProof. Proof is provided in the appendix.\n\nInterestingly, this result is an improvement over the original statement, in which the last term was\n\nplog(n/\u03b4)/n. Our bound removes this extra log(n) factor, so, in the regime where we \ufb01x \u03bd and\n\nexamine large n, this bound is sharper. We note that our goal was not to prove the PAC-Bayes\ntheorem, and we have made little attempt to optimize the constants.\n\n4.4 Covering Number Bounds\n\nIt is worth noting that using Sudakov\u2019s minoration results we can obtain upper bound on the L2\n(and hence also L1) covering numbers using the Gaussian complexities. The following is a direct\ncorollary of the Sudakov minoration theorem for Gaussian complexities (Theorem 3.18, Page 80 of\nLedoux and Talagrand [1991]).\nCorollary 9. Let FW be the function class from Theorem 3. There exists a universal constant K > 0\nsuch that its L2 covering number is bounded as follows:\n\n\u2200\u01eb > 0\n\nlog(N2(FW , \u01eb, n)) \u2264\n\n2K 2X 2W 2\n\u2217\n\n\u03c3\u01eb2\n\nThis bound is sharper than those that could be derived from the N\u221e covering number bounds of\nZhang [2002].\n\n5 Discussion: Relations to Online, Regret Minimizing, Algorithms\n\nIn this section, we make a further assumption that loss \u2113(hw, xi , y) is convex in its \ufb01rst argument.\nWe now show that in the online setting that the regret bounds for linear prediction closely match our\nrisk bounds. The algorithm we consider performs the update,\n\nwt+1 = \u2207F \u22121(\u2207F (wt) \u2212 \u03b7\u2207w\u2113(hwt, xti , yt))\n\n(9)\n\n\fThis algorithm captures both gradient updates, multiplicative updates, and updates based on the Lp\nnorms, through appropriate choices of F . See Shalev-Shwartz [2007] for discussion.\n\nFor the algorithm given by the above update, the following theorem is a bound on the cumulative\nregret. It is a corollary of Theorem 1 in Shalev-Shwartz and Singer [2006] (and also of Corollary 1\nin Shalev-Shwartz [2007]), applied to our linear case.\nCorollary 10. (Shalev-Shwartz and Singer [2006]) Let S be a closed convex set and let F : S \u2192 R\nbe \u03c3-strongly convex w.r.t. k \u00b7 k\u2217. Further, let X = {x : kxk \u2264 X} and W = {w \u2208 S : F (w) \u2264\nW 2\n\u2217 }. Then for the update given by Equation 9 if we start with w1 = argmin F (w), we have that\nfor all sequences {(xt, yt)}n\n\nt=1,\n\nn\n\nXt=1\n\n\u2113(hwt, xti , yt) \u2212 argmin\n\nw\u2208W\n\nXt=1\n\nn\n\n\u2113(hw, xti , yt) \u2264 L\u2113XW\u2217r 2n\n\n\u03c3\n\nFor completeness, we provide a direct proof in the Appendix.\nInterestingly, the regret above is\nprecisely our complexity bounds (when L\u2113 = 1). Also, our risk bounds are a factor of 2 worse,\nessentially due to the symmetrization step used in proving Theorem 1.\n\nReferences\n\nP. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.\n\nJournal of Machine Learning Research, 3:463\u2013482, 2002.\n\nP. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern\nclassi\ufb01ers. In B. Sch\u00a8olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods \u2013 Support\nVector Learning, pages 43\u201354. MIT Press, 1999.\n\nN. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\n\nV. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of\n\ncombined classi\ufb01ers. Annals of Statistics, 30(1):1\u201350, 2002.\n\nJ. Langford and J. Shawe-Taylor. PAC-Bayes & margins.\n\nIn Advances in Neural Information Processing\n\nSystems 15, pages 423\u2013430, 2003.\n\nJ. Langford, M. Seeger, and Nimrod Megiddo. An improved predictive accuracy bound for averaging classi\ufb01ers.\n\nIn Proceedings of the Eighteenth International Conference on Machine Learning, pages 290\u2013297, 2001.\n\nM. Ledoux and M. Talagrand. Probability in Banach spaces: Isoperimetry and processes, volume 23 of Ergeb-\n\nnisse der Mathematik und ihrer Grenzgebiete (3). Springer-Verlag, 1991.\n\nDavid A. McAllester. Simpli\ufb01ed PAC-Bayesian margin bounds. In Proceedings of the Sixteenth Annual Con-\n\nference on Computational Learning Theory, pages 203\u2013215, 2003.\n\nDavid A. McAllester. PAC-Bayesian model averaging. In Proceedings of the Twelfth Annual Conference on\n\nComputational Learning Theory, pages 164\u2013170, 1999.\n\nRon Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. Journal of Machine\n\nLearning Research, 4:839\u2013860, 2003.\n\nA.Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the Twenty-First\n\nInternational Conference on Machine Learning, 2004.\n\nDavid Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984.\n\nR.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: A new explanation for the effective-\n\nness of voting methods. The Annals of Statistics, 26(5):1651\u20131686, October 1998.\n\nS. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew Univer-\n\nsity, 2007.\n\nS. Shalev-Shwartz and Y. Singer. Convex repeated games and Fenchel duality. In Advances in Neural Informa-\n\ntion Processing Systems 20, 2006.\n\nM. Warmuth and A. K. Jagota. Continuous versus discrete-time non-linear gradient descent: Relative loss\nIn Fifth International Symposium on Arti\ufb01cial Intelligence and Mathematics,\n\nbounds and convergence.\n1997.\n\nT. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning\n\nResearch, 2:527\u2013550, 2002.\n\n\f", "award": [], "sourceid": 501, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}]}