{"title": "Fast-rate PAC-Bayes Generalization Bounds via Shifted Rademacher Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 10803, "page_last": 10813, "abstract": "The developments of Rademacher complexity and PAC-Bayesian theory have been largely independent. One exception is the PAC-Bayes theorem of Kakade, Sridharan, and Tewari (2008), which is established via Rademacher complexity theory by viewing Gibbs classifiers as linear operators. The goal of this paper is to extend this bridge between Rademacher complexity and state-of-the-art PAC-Bayesian theory. We first demonstrate that one can match the fast rate of Catoni's PAC-Bayes bounds (Catoni, 2007) using shifted Rademacher processes (Wegkamp, 2003; Lecu\u00e9 and Mitchell, 2012; Zhivotovskiy and Hanneke, 2018). We then derive a new fast-rate PAC-Bayes bound in terms of the \"flatness\" of the empirical risk surface on which the posterior concentrates. Our analysis establishes a new framework for deriving fast-rate PAC-Bayes bounds and yields new insights on PAC-Bayesian theory.", "full_text": "Fast-rate PAC-Bayes Generalization Bounds via\n\nShifted Rademacher Processes\n\nDepartment of Statistical Sciences\n\nDepartment of Computer Science\n\nShengyang Sun\u2217\n\nUniversity of Toronto,\n\nVector Institute\n\nssy@cs.toronto.edu\n\nJun Yang\u2217\n\nUniversity of Toronto,\n\nVector Institute\n\njun@utstat.toronto.edu\n\nDaniel M. Roy\n\nDepartment of Statistical Sciences\n\nUniversity of Toronto,\n\nVector Institute\n\ndroy@utstat.toronto.edu\n\nAbstract\n\nThe developments of Rademacher complexity and PAC-Bayesian theory have\nbeen largely independent. One exception is the PAC-Bayes theorem of Kakade,\nSridharan, and Tewari [21], which is established via Rademacher complexity the-\nory by viewing Gibbs classi\ufb01ers as linear operators. The goal of this paper is\nto extend this bridge between Rademacher complexity and state-of-the-art PAC-\nBayesian theory. We \ufb01rst demonstrate that one can match the fast rate of Catoni\u2019s\nPAC-Bayes bounds [8] using shifted Rademacher processes [27, 43, 44]. We then\nderive a new fast-rate PAC-Bayes bound in terms of the \u201c\ufb02atness\u201d of the empirical\nrisk surface on which the posterior concentrates. Our analysis establishes a new\nframework for deriving fast-rate PAC-Bayes bounds and yields new insights on\nPAC-Bayesian theory.\n\n1\n\nIntroduction\n\nPAC-Bayes theory [33, 38] was developed to provide probably approximately correct (PAC) guar-\nantees for supervised learning algorithms whose outputs can be expressed as a weighted majority\nvote. Its uses have expanded considerably since [3, 6, 14, 17, 18, 28, 39, 40]. See [12, 25, 32]\nfor gentle introductions. Indeed, there has been a surge of interest and work in PAC-Bayes the-\nory and its application to large-scale neural networks, especially towards studying generalization in\noverparametrized neural networks trained by variants of gradient descent [9\u201311, 30, 36, 37].\nPAC-Bayes bounds are one of several tools available for the study of the generalization and risk\nproperties of learning algorithms. One advantage of the PAC-Bayes framework is its ease of use: one\ncan obtain high-probability risk bounds for arbitrary (\u201cposterior\u201d) Gibbs classi\ufb01ers provided one can\ncompute or bound relative entropies with respect to some \ufb01xed (\u201cprior\u201d) Gibbs classi\ufb01er. Another\ntool for studying generalization is Rademacher complexity, a distribution-dependent complexity\nmeasure for classes of real-valued functions [4, 5, 23, 29, 34, 44].\nThe literature on PAC-Bayes bounds and bounds based on Rademacher complexity are essentially\ndisjoint. One point of contact is the work of Kakade, Sridharan, and Tewari [21], which builds the\n\ufb01rst bridge between PAC-Bayes theory and Rademacher complexity. By viewing Gibbs classi\ufb01ers as\n\n\u2217These authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flinear operators and relative entropy as a strictly convex regularizer, they were able to use their gen-\neral Rademacher complexity bounds on strictly convex linear classes to develop a slightly sharper\nversion of McAllester\u2019s PAC-Bayes bound [33]. This result offers new insight on PAC-Bayes the-\nory, including potential roles for data-dependent complexity estimates and stability. However, even\nwithin the PAC-Bayes community, this result is relatively unknown.\n\u221a\nWhile the PAC-Bayes bound established by Kakade, Sridharan, and Tewari\nimproves on\nm rate, where m denotes the number of data\nMcAllester\u2019s bound, it still converges at a slow 1/\nused to form the empirical risk estimate. This observation raises the question of whether one can\nmatch state-of-the-art PAC-Bayes bounds via a Rademacher-process argument. In particular, can\none match Catoni\u2019s bound [8, Thm. 1.2.6], which can obtain a fast 1/m rate of convergence?\nThere is an extensive literature on the problem of obtaining fast 1/m rates of convergence for the gen-\neralization error of (approximate) empirical risk minimization (ERM). Available approaches include\nthe use of local Rademacher complexity [4, 22], shifted empirical processes [27], offset Rademacher\ncomplexities [29], and local empirical entropy [44]. See also [16, 19, 20, 26, 31, 35] and [13] for an\nextensive survey. To date, these techniques have not been connected to PAC-Bayesian theory, which\npresents the opportunity to obtain new PAC-Bayes theory for ERM.\n\n1.1 Contributions\n\nIn this paper, we extend the bridge between Rademacher process theory and PAC-Bayes theory by\nconstructing new bounds using Rademacher process techniques. Among our contributions:\n\ni) We show how to recover Catoni\u2019s fast-rate PAC-Bayes bound [8], up to constants, using tail\nbounds on shifted Rademacher processes, which are special cases of shifted empirical processes\n[27, 43, 44]; See Section 3.\n\nii) We derive a new fast-rate PAC-Bayes bound, building on our shifted-Rademacher-process ap-\nproach. This bound is determined by the \u201c\ufb02atness\u201d of the empirical risk surface on which the\nposterior Gibbs classi\ufb01er concentrates. The notion of \u201c\ufb02atness\u201d is inspired by the proposal\nby Dziugaite and Roy [9] to formalize the empirical connection between \u201c\ufb02at minima\u201d and\ngeneralization using PAC-Bayes bounds; See Section 4.\n\niii) More generally, we introduce a new approach to derive fast-rate PAC-Bayes bounds and, in\n\nturn, offer new insight on PAC-Bayesian theory.\n\n2 Background\n\nLet D be an unknown distribution over a space Z of labeled examples, and let H be a hypothesis\nclass. Relative to a binary loss function (cid:96) : H \u00d7 Z \u2192 {0,1}, we de\ufb01ne the associated loss class\nF := {(cid:96)(h,\u00b7) : h \u2208 H } of functions from Z \u2192 {0,1}, each associated to one or more hypotheses.\nLet LD ( f ) := Ez\u223cD f (z) denote the expected loss, i.e., risk, of every hypothesis associated to f . Let\nS = (z1,\u00b7\u00b7\u00b7 ,zm) \u223c D m be a sequence of i.i.d. random variables. Let\ni=1 f (zi) denote\nthe empirical risk of every hypothesis associated to f .\nWe will be primarily interested in Gibbs classi\ufb01ers, i.e., distributions P on F which are interpreted\nas randomized classi\ufb01ers that classify each new example according to a hypothesis drawn indepen-\ndently from P. (It is more common to work with distributions over H , but these lead to looser\nresults.) For a Gibbs classi\ufb01er P and labeled example z \u2208 Z , let EP f (z) = E f\u223cP[ f (z)] be the ex-\npected loss P suffers when labeling z. For Gibbs classi\ufb01ers, the (expected) risk is de\ufb01ned to be\nLD (P) := E f\u223cPLD ( f ) = Ez\u223cDEP f (z). The (expected) empirical risk is\n\u02c6LS(P) := E f\u223cP \u02c6LS( f ) =\nm \u2211m\n1\n\ni=1 EP f (zi).\n\n\u02c6LS( f ) = 1\n\nm \u2211m\n\n2.1 PAC-Bayes\n\nThe PAC-Bayes framework [33] provides data-dependent generalization guarantees for Gibbs clas-\nsi\ufb01ers. Each bound is speci\ufb01ed in terms of a Gibbs classi\ufb01er P called the prior, as it must be\nindependent of the training sample. The bound then holds for all posterior distributions, i.e., Gibbs\nclassi\ufb01ers that may be de\ufb01ned in terms of the training sample.\n\n2\n\n\fTheorem 2.1 (PAC-Bayes [33]). For any prior distribution P over F , for any \u03b4 \u2208 (0,1), with\nprobability at least 1\u2212 \u03b4 over draws of training data S \u223c D m, for all distributions Q over F ,\n\nLD (Q) \u2264 \u02c6LS(Q) +\n\nKL (Q||P) + log m\n\u03b4\n\n2(m\u2212 1)\n\n.\n\n(1)\n\n(cid:115)\n\nNote in Theorem 2.1, the generalization bound scales as O(m\u2212 1\nPAC-Bayesian bound, in which the generalization bound scales as O(m\u22121).\nTheorem 2.2 (Fast-Rate PAC-Bayes [8, Thm 1.2.6]). For any prior distribution P over F , for any\n\u03b4 \u2208 (0,1) and C > 0, with probability at least 1\u2212 \u03b4 over draws of training data S \u223c D m, for all\ndistributions Q over F ,\n\n2 ). Catoni [8] presents a fast rate\n\nLD (Q) \u2264 1\n\n1\u2212 e\u2212C\n\nC \u02c6LS(Q) +\n\nKL (Q||P) + log 1\n\u03b4\n\nm\n\n.\n\n(2)\n\n(cid:34)\n\n(cid:35)\n\nBecause the constant C/(1\u2212 e\u2212C) > 1 holds for any C > 0, the generalization bound in Theorem 2.2\nwill always be bounded below by the empirical risk. Usually for a well-trained distribution Q over\ntraining set, the empirical risk \u02c6LS(Q) is small, therefore the generalization bound is dominated by\nthe KL term. Compared to the standard PAC-Bayes bound in Theorem 2.1, where the KL term\ndecreases at a rate O(m\u2212 1\n2 ), the KL term of Catoni\u2019s bound decreases at a rate O(m\u22121). For this\nreason, we say that Catoni\u2019s bound achieves a fast rate of convergence. Note that fast-rate bounds\ncan lead to much tighter bounds. Of course, C/(1\u2212 e\u2212C) \u2192 1 as C \u2192 0, but, in that limit, the\nconstants ignored in the asymptotic rate O(m\u22121) degrade. (See [28] for more discussion.)\n\n2.2 Rademacher Viewpoint\n\n\u02c6LS(Q)) as the inner product (cid:104)dQ/dP, LD (\u00b7)(cid:105) (resp., (cid:104)dQ/dP,\n\n(cid:104)g,h(cid:105) =(cid:82) g( f )h( f )P(d f ). The key observation of Kakade, Sridharan, and Tewari is that one can\n\nFix a prior Gibbs classi\ufb01er P on F . Then, for measurable functions g,h, consider the inner product\n\u02c6LS(\u00b7)(cid:105)) between\nview LD (Q) (resp.,\nthe posterior Q, represented by its Radon\u2013Nikodym derivative with P, and the risk (resp., empir-\nical risk), viewed as measurable function on F . Thus, Gibbs classi\ufb01ers can be viewed as linear\npredictors. Using their distribution-independent bounds on the Rademacher complexity of certain\nclasses of linear predictors, Kakade, Sridharan, and Tewari [21] derive a PAC-Bayes bound similar\nto Theorem 2.1. We refer to this as the \u201cRademacher viewpoint\u201d on PAC-Bayes.\nWe now summarize their argument in more detail. Let Q(\u03ba) := {Q : KL (Q||P) \u2264 \u03ba}. One\ncan follow the classical steps for controlling the generalization error uniformly over Q(\u03ba) us-\ning Rademacher complexity. Their \ufb01rst step is to connect supQ\u2208Q(\u03ba)\nES supQ\u2208Q(\u03ba)\nIn particular, with probability at least 1\u2212 \u03b4 ,\n\n(cid:2)LD (Q)\u2212 \u02c6LS(Q)(cid:3) to\n(cid:2)LD (Q)\u2212 \u02c6LS(Q)(cid:3) by the bounded difference inequality (McDiarmid\u2019s inequality).\n(cid:114)log(1/\u03b4 )\n(cid:35)\n\nThen they apply a symmetrization argument to obtain an upper bound in terms of Rademacher\ncomplexity [5]. In particular, recalling that S = (z1,\u00b7\u00b7\u00b7 ,zm) is our training data,\n\n(cid:2)LD (Q)\u2212 \u02c6LS(Q)(cid:3) \u2264 ES\n\n(cid:2)LD (Q)\u2212 \u02c6LS(Q)(cid:3) +\n\nQ\u2208Q(\u03ba)\n\nQ\u2208Q(\u03ba)\n\nsup\n\nsup\n\n(3)\n\n.\n\nm\n\nsup\n\nQ\u2208Q(\u03ba)\n\n(4)\nwhere {\u03b5i} are i.i.d. Rademacher random variables, i.e., P(\u03b5i = +1) = P(\u03b5i = \u22121) = 1/2. Their last\nstep is to bound the Rademacher complexity ESE\u0001 supQ\u2208Q(\u03ba)\nas the Rademacher complexity of a linear class with a (strongly) convex constraint [21]. According\n\ni=1 \u03b5iEQ f (zi)(cid:3), which can be seen\nto [21], the Rademacher complexity in Eq. (4) is of order(cid:112)\u03ba/m, which eventually leads to a term\nof order(cid:112)KL (Q||P) /m after applying a union bound argument on \u03ba.\n\nQ\u2208Q(\u03ba)\n\nm \u2211m\n\nsup\n\ni=1\n\n,\n\n(cid:2)LD (Q)\u2212 \u02c6LS(Q)(cid:3) \u2264 2ESE\u0001\n\nm\n\n\u2211\n\n1\nm\n\n\u03b5iEQ f (zi)\n\n(cid:34)\n(cid:2) 1\n\nES\n\nIn the end, using the above arguments and their sharp bounds on the Rademacher and Gaussian\ncomplexities of (constrained) linear classes [21, Thm. 1], Kakade, Sridharan, and Tewari obtain the\n\n3\n\n\ffollowing PAC-Bayes bound [21, Cor. 8]: for every prior P over F , with probability at least 1\u2212 \u03b4\nover draws of training data S \u223c D m, for all distribution Q over F ,\n\n(cid:114)max{KL (Q||P) ,2}\nNote that this PAC-Bayes bound has a slow rate of(cid:112)1/m, but it slightly improves the rate in the\nterm(cid:112)log(m/\u03b4 )/m of McAllester\u2019s bound [33] to(cid:112)log(1/\u03b4 )/m.\n\n(cid:114)log(1/\u03b4 )\n\nLD (Q) \u2264 \u02c6LS(Q) + 4.5\n\n(5)\n\nm\n\n+\n\n.\n\nm\n\nSince McAllester\u2019s bound is far from the state-of-art in PAC-Bayesian theory, this raises the ques-\ntion whether one can extend the \u201cRademacher viewpoint\u201d of PAC-Bayes to derive more advanced\nbounds, such as one matching the fast rate of Catoni\u2019s bound.\n\n3 Extending the Rademacher Viewpoint\n\nm \u2211m\n\ni=1 \u03b5(cid:48)\n\nThere are at least two dif\ufb01culties in the \u201cRademacher viewpoint\u201d that prevent fast rates. First, if we\nconnect the generalization error to Rademacher complexity using the bounded difference inequality,\n\na slow rate term(cid:112)log(1/\u03b4 )/m will appear. Second, as is shown by Kakade, Sridharan, and Tewari\na slow rate of order O((cid:112)KL (Q||P) /m). Therefore, in order to derive fast rate PAC-Bayes bounds,\n\n[21], the standard Rademacher complexity of (constraint) linear classes leads to an upper bound with\n\ni f (zi)} f\u2208F where the variables {\u03b5(cid:48)\n\nrate term of(cid:112)log(1/\u03b4 )/m. It remains to bound the deviation of shifted Rademacher processes to\n\nwe need to extend the \u201cRademacher viewpoint\u201d.\nIn order to obtain fast rates, we work with so-called shifted Rademacher processes, i.e., processes\nof the form { 1\ni} are independent from S, i.i.d., and take\ntwo values with equal probability. (These shifted Rademacher variables, {\u03b5(cid:48)\ni}, are not necessarily\nzero mean. When they take values in {\u00b11}, we obtain a standard Rademacher process.) Shifted\nRademacher processes are examples of shifted empirical processes [27, 43, 44].\nRecall that Rademacher complexity is the expected value of the supremum of Rademacher processes\nover a class [5]. In order to get a fast rate, we connect the tail probabilities of the supremum of the\ngeneralization error to the tail probabilities of shifted Rademacher processes via a symmetrization-\nin-deviation argument instead of the symmetrization-in-expectation argument. The key is that we\ncan avoid using the bounded difference inequality by bounding the deviation. This removes the slow\nget a fast rate bound of order O(KL (Q||P) /m).\nIn the following, we demonstrate how the extended \u201cRademacher viewpoint\u201d via shifted\nRademacher processes can be applied to derive a fast rate PAC-Bayes bound that matches the fast\nrate of Catoni\u2019s bound. Note that, since C/(1\u2212 e\u2212C) > 1 for \ufb01xed C > 0 in Catoni\u2019s bound in Eq. (2),\nwe can write C/(1\u2212 e\u2212C) = 1 + c for some constant c > 0. Furthermore, note that our goal in this\nsection is not to derive new PAC-Bayes bounds. Therefore, we do not make attempts to optimize the\nconstants.\nProposition 3.1 (Matching Catoni\u2019s Fast Rate via Shifted Rademacher Processes). For any given\nc > 0 and prior P over F , there exists constants C1, C2, and C3 such that, with probability at least\n1\u2212 \u03b4 , for all distributions Q over F\n\nLD (Q) \u2264(1 + c) \u02c6LS(Q) +C1\n\nKL (Q||P)\n\nm\n\n+C2\n\nlog 1\n\u03b4\nm +C3\n\n1\nm .\n\n(6)\n\nOutline of the proof. We wish to emphasize two key differences from traditional machinery for de-\nriving Rademacher-complexity-based generalization bounds. The complete proof is given in Ap-\npendix A.1.\nFix P and let Q(\u03ba) := {Q : KL (Q||P) \u2264 \u03ba} be de\ufb01ned as in Section 2.2. Rather than control\nsupQ\u2208Q(\u03ba)\nRademacher complexity, we bound the tail/deviation of supQ\u2208Q(\u03ba)\navoiding the use of the bounded differences inequality altogether. In particular, we can obtain fast\nrates by bounding the tail in terms of tail of supremum of shifted Rademacher processes [27, 43,\n44].\n\n(cid:2)LD (Q)\u2212 \u02c6LS(Q)(cid:3) in terms of its expectation via the bounded difference inequality and\n(cid:2)LD (Q)\u2212 (1 + c) \u02c6LS(Q)(cid:3), thus\n\n4\n\n\fDe\ufb01ne G\u03ba := {EQ f (\u00b7) : Q \u2208 Q(\u03ba)} and, by an abuse of notation, let LD (g) denote Ez\u223cD [g(z)].\nThen we can write supQ\u2208Q(\u03ba)\nstart from bounding the tail probability PS\n(cid:32)\nc > c2 > 0, let c(cid:48) = c\u2212c2\n1+c2\n\n(cid:2)LD (g)\u2212 (1 + c) \u02c6LS(g)(cid:3). We\nLD (g)\u2212 (1 + c) \u02c6LS(g) \u2265 t(cid:1). For \ufb01xed constants\n(cid:33)\n(cid:34)1 + c(cid:48)\n\n(cid:0)supg\u2208G\u03ba\n(cid:32)\n\n(cid:2)LD (Q)\u2212 (1 + c) \u02c6LS(Q)(cid:3) as supg\u2208G\u03ba\n(cid:18)\n\u03b5i \u2212 c(cid:48)\n2 + c(cid:48)\n\n2(1+c2). Then, by [44, Cor. 1], we have\n\nLD (g)\u2212 (1 + c) \u02c6LS(g) \u2265 t\n\nand t(cid:48) = t\n\n\u2264 4PS,\u03b5\n\n\u2265 t(cid:48)\n2\n\n(cid:33)\n\n(cid:19)\n\ng(zi)\n\n. (7)\n\n(cid:35)\n\n\u2211\n\nPS\n\nm\n\nm\n\n2\n\nsup\ng\u2208G\u03ba\n\ni=1\n\ni} are i.i.d. \u201cshifted\u201d Rademacher random variables with\n\n2+c(cid:48) , one can see that {\u03b5(cid:48)\n\ni := \u03b5i \u2212 c(cid:48)\n(cid:20) 1\n2+c(cid:48) . For any g \u2208 G\u03ba, there exists Q \u2208 Q(\u03ba) such that\n\u03b5(cid:48)\niEQ f (zi) = EQ\n\n\u03b5(cid:48)\ni g(zi) =\n\nm\n\nm\n\n1\nm\n\n\u2211\n\ni=1\n\n1\nm\n\n\u2211\n\ni=1\n\n(cid:21)\n\nm\n\n\u2211\n\ni=1\n\nm\n\n\u03b5(cid:48)\ni f (zi)\n\n,\n\n(8)\n\nsup\ng\u2208G\u03ba\nLetting \u03b5(cid:48)\nmean \u2212 c(cid:48)\n\nwhich can be viewed as a linear function of Q. Further, it can be veri\ufb01ed that the set Q(\u03ba) is\niEQ f (zi) is a convex optimization problem. By\n(strongly) convex. Therefore, supQ\u2208Q(\u03ba)\nduality [7, Chp. 5], and, in this particular case, the Legendre transform of Kullback\u2013Leibler diver-\ngence (see, e.g., [18]), we have\n\ni=1 \u03b5(cid:48)\n\nm \u2211m\n1\n\n(cid:34)\n\n(cid:32)\n\n(cid:33)(cid:35)(cid:41)\n\n1\nm\n\nsup\ng\u2208G\u03ba\n\nm\n\n\u2211\n\ni=1\n\n\u03b5(cid:48)\ni g(zi) = sup\nQ\u2208Q(\u03ba)\n\n1\nm\n\nm\n\n\u2211\n\ni=1\n\n\u03b5(cid:48)\niEQ f (zi) = inf\n\u03bb >0\n\n\u03ba\n\u03bb +\n\n1\n\u03bb\n\nlogEP\n\nexp\n\n\u03bb\nm\n\nm\n\n\u2211\n\ni=1\n\n\u03b5(cid:48)\ni f (zi)\n\n.\n\n(9)\n\n(cid:32)\n\nCombining the shifted symmetrization in deviation in Eq. (7) and the dual problem in Eq. (9),\nMarkov\u2019s inequality yields, for every \u03bb > 0,\nEQ[LD ( f )\u2212 (1 + c) \u02c6LS( f )] \u2265 t\n\n\u2264 4e\u03ba\u2212 \u03bbt(cid:48)\n\n2+c(cid:48) ESE\u03b5EP\n\n\u03b5(cid:48)\ni f (zi)\n\n(10)\n\nexp\n\nsup\n\nPS\n\nm\n\n.\n\n(cid:33)(cid:35)\n\n(cid:33)\n\n(cid:32)\n\n(cid:34)\n\n(cid:40)\n\nQ\u2208Q(\u03ba)\n\n\u03bb\nm\n\n\u2211\n\ni=1\n\nWe then exploit the shifted property of \u03b5(cid:48)\nobtain fast rates. In particular, we show that, so long as k \u2265 logcosh(\u03bb /m)\n\n,\n\ni to bound the expectation term on the right-hand side and\n\n(cid:34)\n\n(cid:32)\n\n\u03bb\nm\n\nm\n\n\u2211\n\ni=1\n\n(cid:33)(cid:35)\n\n\u03bb /m\n\nEPESE\u0001\n\nexp\n\n(\u03b5i \u2212 k) f (zi)\n\n\u2264 1.\n\n(11)\n\nIn our case, k = c(cid:48)\n2+c(cid:48) , which leads to constraints relating \u03bb , c, and c2. In particular, when c = 0,\nthe required condition for the above result, k \u2265 logcosh(\u03bb /m)\n, does not hold. Therefore, this approach\nobtains fast rates only if c > 0, i.e., if we shift. Combing Eqs. (10) and (11), there exists a constant\nC(cid:48), depending only on c, c2 and \u03b4 , such that, with probability at least 1\u2212 \u03b4 ,\n\n\u03bb /m\n\nEQ[LD ( f )\u2212 (1 + c) \u02c6LS( f )] \u2264 C(cid:48)\n\nm (\u03ba + log(4/\u03b4 )).\n\nsup\n\nQ\u2208Q(\u03ba)\n\n(12)\n\nFinally, we may apply the same union-bound argument as in the proof of [21, Cor. 7] in order to\ncover all possible values of \u03ba. This completes the proof.\n\n4 New Fast Rate PAC-Bayes Bound based on \u201cFlatness\u201d\n\nThe extended \u201c Rademacher viewpoint\u201d of PAC-Bayes provides a new approach for deriving fast-\nrate PAC-Bayes bounds. In this section, we demonstrate the use of shifted Rademacher processes to\nderive a new fast-rate PAC-Bayes bound using a notion of \u201c\ufb02atness\u201d. This notion is inspired by the\nproposal by Dziugaite and Roy [9] to formalize the empirical connection between \u201c\ufb02at minima\u201d and\ngeneralization using PAC-Bayes bounds, and, in particular, posterior distributions which concentrate\nin these \u201c\ufb02at minima\u201d.\nDe\ufb01nition 4.1 (Notion of \u201cFlatness\u201d). For given h \u2208 [0,1], the \u201ch-\ufb02atness\u201d of Q (w.r.t. S) is\n\n1\nm\n\nm\n\n\u2211\n\ni=1\n\nEQ[ f (zi)\u2212 (1 + h)EQ f (zi)]2.\n\n(13)\n\n5\n\n\fOne way to understand this new notion is to observe that, under zero\u2013one loss, h-\ufb02atness can be\nwritten as the difference between the empirical risk and the quadratic empirical risk:\n\n1\nm\n\nm\n\n\u2211\n\ni=1\n\nEQ[ f (zi)\u2212 (1 + h)EQ f (zi)]2 = \u02c6LS(Q)\u2212 1\u2212 h2\n\nm\n\nm\n\n\u2211\n\ni=1\n\n(EQ f (zi))2.\n\n(14)\n\nNote that, for [0,1]-valued (bounded) loss, equality is replaced by an inequality: the r.h.s. is an upper\nbound of the l.h.s.\nRemark 4.2. To see that optimizing h-\ufb02atness prefers \u201c\ufb02at minima\u201d, consider the following simpli-\n\ufb01ed case: Call a posterior Q \u201ccompletely \ufb02at\u201d if f = g on S a.s., when f ,g \u223c Q. It can be veri\ufb01ed\nthat, if the posterior is \u201ccompletely \ufb02at\u201d, then under the zero\u2013one loss, the \u201ch-\ufb02atness\u201d is h2 \u02c6LS(Q).\nThat is, given a \u201ccompletely \ufb02at\u201d posterior, the \u201ch-\ufb02atness\u201d goes to zero as h \u2192 0. For h > 0, the\n\u201ch-\ufb02atness\u201d is zero when Q is \u201ccompletely \ufb02at\u201d and \u02c6LS(Q) = 0.\n(cid:47)\nThe following PAC-Bayes theorem establishes favorable bounds for h-\ufb02at posteriors:\nTheorem 4.3 (Fast Rate PAC-Bayes using \u201cFlatness\u201d). For any given c > 0 and h \u2208 (0,1), with\nprobability at least 1\u2212\u03b4 over random draws of training set S \u223c D m, for all distributions Q over F ,\n\n(cid:20)\n\n3KL (Q||P) + log\n\n,\n\n(15)\n\n(cid:21)\n1\n\u03b4 + 5\n\nLD (Q) \u2264 \u02c6LS(Q) +\n\nc\nm\n\nwhere C = 2h4c\n\n1+16h2c .\n\nm\n\n\u2211\n\ni=1\n\nEQ[ f (zi)\u2212 (1 + h)EQ f (zi)]2 +\n\n4\nCm\n\nThis bound can be tighter than Catoni\u2019s bound under certain conditions. We delay the comparison\nwith Catoni\u2019s bound to Section 4.1. We now give an outline of the proof of Theorem 4.3, high-\nlighting the technical differences from the proof of Proposition 3.1. The complete proof is given in\nAppendix A.2.\n\nOutline of the proof of Theorem 4.3. By Eq. (14), we can write\n\nEQLD ( f )\u2212 \u02c6LS(Q)\u2212 c\nm\n\nm\n\n\u2211\n\ni=1\n\n= LD (Q)\u2212 (1 + c) \u02c6LS(Q) +\n\nEQ[ f (zi)\u2212 (1 + h)EQ f (zi)]2\n\nc(1\u2212 h2)\n\nm\n\n(EQ f (zi))2.\n\nm\n\n\u2211\n\ni=1\n\n(16)\n\nm \u2211m\n\nThere are at least two new challenges compared with the proof of Proposition 3.1. First, the\nshifted symmetrization in Eq. (7) cannot be applied because of the existence of the quadratic term\nc(1\u2212h2)\ni=1(EQ f (zi))2. This means we need to derive a new shifted symmetrization involving the\nquadratic term. Second, the quadratic term c(1\u2212h2)\ni=1(EQ f (zi))2 cannot be seen as a linear func-\ntion of Q. Therefore, some technical arguments are required in order to apply the Legendre transform\nof Kullback\u2013Leibler divergence.\nFirst, we derive a new shifted symmetrization which involves quadratic terms. The proof is inspired\nby an argument due to Zhivotovskiy and Hanneke [44]. The result extends [44, Cor. 1], which\nis recovered as a special case when h = 1. For \u03ba > 0, recall that we have de\ufb01ned Q(\u03ba) = {Q :\nKL (Q||P) \u2264 \u03ba} and G\u03ba = {EQ f (\u00b7) : Q \u2208 Q(\u03ba)}. Then for any g \u2208 G\u03ba, there exists a Q \u2208 Q(\u03ba)\nsuch that g = EQ f (\u00b7). We can \ufb01rst show a tail bound that for any given c2 > 0 and g \u2208 G\u03ba, if\nt \u2265 (1+c2)(1+c2h2)\n\nm \u2211m\n\n, then\n\n(cid:16)LD (g)\u2212 (1 + c2) \u02c6LS(g) + c2(1\u2212 h2) \u02c6LS(g2) \u2265 t\n\n(cid:17) \u2264 1\n\nPS\n\n(17)\nm} \u2208 D m. For c > c2, by taking\nThen, consider another independent random data set S(cid:48) = {z(cid:48)\nthe difference of LD (g)\u2212 (1 + c) \u02c6LS(g) + c(1\u2212 h2) \u02c6LS(g2) and LD (g)\u2212 (1 + c2) \u02c6LS(cid:48)(g) + c2(1\u2212\n\n1, . . . ,z(cid:48)\n\n2\n\n2\n\n.\n\nmc2h2\n\n6\n\n\fh2) \u02c6LS(cid:48) (g2) and using Eq. (17), we obtain\n\nLD (g)\u2212 (1 + c) \u02c6LS(g) + c(1\u2212 h2) \u02c6LS(g2) \u2265 t\n\n(cid:33)\n\nPS\n\n1\n4\n\u2264 1\n2\n\nsup\ng\u2208G\u03ba\n\n(cid:32)\n\nPS,S(cid:48)\n\nsup\ng\u2208G\u03ba\n\n(1 +\n\nc + c2\n\n2\n\n(cid:32)\n\n(cid:32)\n\n(18)\n\n(cid:33)\n\n.\n\n(19)\n\n(20)\n\n(cid:33)\n\n\u2265 t\n4\n\n,\n\n(21)\n\n(1 + c2) \u02c6LS(cid:48)(g)\u2212 c2(1\u2212 h2) \u02c6LS(cid:48)(g2)\u2212 (1 + c) \u02c6LS(g) + c(1\u2212 h2) \u02c6LS(g2) \u2265 t\n2\n\nNow by writing (1 + c2) \u02c6LS(cid:48)(g)\u2212 c2(1\u2212 h2) \u02c6LS(cid:48) (g2)\u2212 (1 + c) \u02c6LS(g) + c(1\u2212 h2) \u02c6LS(g2) as\n\n2\n\n2\n\n\u02c6LS\n\n)(cid:0) \u02c6LS(cid:48)(g)\u2212 \u02c6LS(g)(cid:1)\u2212 c + c2\n\u2212 c\u2212 c2\n\n(1\u2212 h2)(cid:0) \u02c6LS(cid:48)(g2)\u2212 \u02c6LS(g2)(cid:1)\n\u02c6LS(cid:48)(cid:0)g\u2212 (1\u2212 h2)g2(cid:1) ,\n(cid:0)g\u2212 (1\u2212 h2)g2(cid:1)\u2212 c\u2212 c2\n(cid:33)\n(cid:0)g\u2212 (1\u2212 h2)g2(cid:1)(cid:35)\n(cid:0)(cid:0)1 + c(cid:48)(cid:1)g(zi)\u2212 c(cid:48)(1\u2212 h2)g2(zi)(cid:1)\u2212 c(cid:48)(cid:48)\n\n\u02c6LS\n\n2\n\none can apply the symmetrization argument to get\n\nLD (g)\u2212 (1 + c) \u02c6LS(g) + c(1\u2212 h2) \u02c6LS(g2) \u2265 t\n\nPS\n\n1\n4\n\nsup\ng\u2208G\u03ba\n\n(cid:32)\n\n(cid:34)\n\nm\n\n\u2211\n\n1\nm\n\n\u2264 PS,\u0001\n\ni=1\n\n\u03b5i\nsup\ng\u2208G\u03ba\n2 ,c(cid:48)(cid:48) = c\u2212c2\nwhere c(cid:48) = c+c2\ninvolving a quadratic term.\nRecalling the de\ufb01nition of G\u03ba, we have\n\n2 . Therefore, we have derived the new shifted symmetrization in deviation\n\n1\nm\n\nsup\ng\u2208G\u03ba\n\nm\n\n\u2211\n\ni=1\n\n(cid:0)(cid:0)1 + c(cid:48)(cid:1)g(zi)\u2212 c(cid:48)(1\u2212 h2)g(zi)2(cid:1)\u2212 c(cid:48)(cid:48)\n\n(cid:0)1 + c(cid:48)(cid:1)\u2212 c(cid:48)(cid:48)]EQ f (zi)\u2212(cid:2)\u03b5ic(cid:48) \u2212 c(cid:48)(cid:48)(cid:3) (1\u2212 h2)[EQ f (zi)]2.\n\n\u02c6LS(g\u2212 (1\u2212 h2)g2)\n\nm\n\n\u2211\n\n[\u03b5i\n\n\u03b5i\n\n1\nm\n\ni=1\n\nQ\u2208Q(\u03ba)\n\n= sup\n\n(22)\nNote that there are two shifted Rademacher random variables \u03b5i (1 + c(cid:48))\u2212 c(cid:48)(cid:48) and \u03b5ic(cid:48) \u2212 c(cid:48)(cid:48), which\nnot only involve a shift term \u2212c(cid:48)(cid:48) but also scale terms (1 + c(cid:48)) and c(cid:48), respectively. Furthermore, the\nterm [EQ f (zi)]2 cannot be seen as a linear function of Q. This prevents the use of the key argument\nin [21] to formulate an upper bound using Rademacher complexities of constrained linear classes by\nconsidering the generalization error as a linear function of Q.\nIn order to sidestep this obstruction, de\ufb01ne \u0001 := {\u03b5i}m\ni=1 and suppose \u02c6Q(\u0001,z) achieves\nthe supremum above. (If the supremum cannot be achieved, one can use a carefully chosen sequence\nof { \u02c6Qi(\u0001,z)} to prove the same statement as the supremum can be approximated arbitrarily closely.)\nThe following inequality then holds:\n\ni=1,z := {zi}m\n\n(cid:0)1 + c(cid:48)(cid:1)\u2212 c(cid:48)(cid:48)]EQ f (zi)\u2212(cid:2)\u03b5ic(cid:48) \u2212 c(cid:48)(cid:48)(cid:3) (1\u2212 h2)[EQ f (zi)]2\n(cid:0)1 + c(cid:48)(cid:1)\u2212 c(cid:48)(cid:48)]EQ f (zi)\u2212(cid:2)\u03b5ic(cid:48) \u2212 c(cid:48)(cid:48)(cid:3) (1\u2212 h2)EQ f (zi)E \u02c6Q(\u0001,z) f (zi).\n\n[\u03b5i\n\nsup\n\n1\nm\nQ\u2208Q(\u03ba)\n\u2264 sup\nQ\u2208Q(\u03ba)\n\nm\n\n\u2211\ni=1\n1\nm\n\n[\u03b5i\n\nm\n\n\u2211\n\ni=1\n\n(23)\n\n:= \u03b5ic(cid:48) \u2212 c(cid:48)(cid:48) = \u03b5i\n\nTo see this, note that, on the one hand, if we plug in Q = \u02c6Q(\u0001,z) the inequality is tight; on the other\nhand, by de\ufb01nition, Q = \u02c6Q(\u0001,z) already achieves the supremum of the l.h.s. Note that the r.h.s. can\nbe seen as a linear function of Q, because \u02c6Q(\u0001,z) is a random variable which does not depend on Q.\n2 \u2212 c1\u2212c2\nLet \u03b5(cid:48)(cid:48)\n. Then by keeping the term \u02c6Q(\u0001,z), one can apply the convex\ni\n2\nconjugate of relative entropy to get\n(cid:34)\n\nLD (Q)\u2212 (1 + c) \u02c6LS(Q) +\n\n(EQ f (zi))2 \u2265 t\n\nc(1\u2212 h2)\n\n(cid:32)\n\n(cid:34)\n\n(cid:35)\n\n\u2211\n\nc1+c2\n\nsup\n\nm\n\nP\n\nm\n\n(cid:105)(cid:33)(cid:35)\n\ni=1\n\n(cid:104)(cid:0)\u03b5i + \u03b5(cid:48)(cid:48)\n\ni\n\n(cid:1)\u2212 \u03b5(cid:48)(cid:48)\n\ni (1\u2212 h2)E \u02c6Q(\u0001,z) f (zi)\n\n.\n\n(24)\n\nQ\u2208Q(\u03ba)\n\u2264 4exp\n\n(cid:19)\n\n(cid:18)\n\u03ba \u2212 \u03bbt\n4\n\nESE\u0001EP\n\nexp\n\n\u03bb\nm\n\nm\n\n\u2211\n\ni=1\n\nf (zi)\n\n7\n\n\fTherefore, the problem turns to bounding the expectation of a function involving shifted Rademacher\nprocesses. Although the expectation looks quite complicated since it involves two scaled and shifted\nRademacher variables as well as the unknown \u02c6Q(\u0001,z), fortunately, we are able to show that, for any\nrandom variables Yi \u2208 [0,1], we have\n\u03bb\nm\n\nf (zi)(cid:2)(cid:0)\u03b5i + \u03b5(cid:48)(cid:48)\n\n(cid:1)\u2212 \u03b5(cid:48)(cid:48)\n\ni (1\u2212 h2)Yi\n\n(cid:3)(cid:33)(cid:35)\n\nESE\u0001EP\n\n\u2264 1,\n\n(cid:32)\n\n(cid:34)\n\n(25)\n\ni\n\nexp\n\nm\n\n\u2211\n\ni=1\n\nif h \u2208 (0,1],1 > h2c > c2 > 0 and 0 < \u03bb\n2(1+h2c)(1+c2). This result removes the term \u02c6Q(\u0001,z)\nby letting Yi = E \u02c6Q(\u0001,z) f (zi). Finally, we combine different values of \u03ba by a union bound argument\nsimilar to the proof of Proposition 3.1 to complete the proof.\n\nm < C =\n\nh2c\u2212c2\n\n4.1 Comparison with Catoni\u2019s Bound\n\nm \u2211m\n\nm \u2211m\n\ni=1 EQ[ f (zi)\u2212 (1 + h)EQ f (zi)]2 is very small yet\n\nAs we have shown in Proposition 3.1, using shifted Rademacher processes, we can match Catoni\u2019s\nfast-rate PAC-Bayesian bound (Theorem 2.2) up to constants. We have also presented a new fast-\nrate PAC-Bayes bound based on \"\ufb02atness\". Although both our bound and Catoni\u2019s bound show fast\nO(m\u22121) rates of convergence, our bound can exploit \ufb02atness in the posterior distribution.\nIn particular, our PAC-Bayes bound based on \ufb02atness (Eq. (15)) can be much tighter than\nCatoni\u2019s bound (Eq. (6)) when the posterior is chosen to concentrate on a \u201c\ufb02at minimum\u201d where\n\u02c6LS(Q) is nonzero. It can be veri\ufb01ed that the\nc\nm \u2211m\ni=1 EQ[ f (zi)\u2212 (1 + h)EQ f (zi)]2 in Eq. (15) is smaller than the excess empirical\n\u201c\ufb02atness\u201d term c\nrisk term c \u02c6LS(Q) when 1\u2212h2\ni=1(EQ f (zi))2 is greater than 0, which is precisely when the empirical\nrisk is greater than zero. (See Eq. (14).)\nBased on this observation, we expect our bound to be tighter for suf\ufb01cient \ufb02at posteriors, nonzero\nempirical risk, and suf\ufb01cient training data. In order to see this, note that Catoni\u2019s bound has the\nm (KL(Q(cid:107)P) + log 1\nform (1 + cc) \u02c6LSQ +\nCc\n\u03b4 ), while our bound based on Eq. (14) can be written (1 +\ncr) \u02c6LSQ\u2212 cr(1\u2212h2)\ni=1(EQ f (zi))2 + Cr\nm \u2211m\n\u03b4 +1). Here cc,cr in\ufb02ate the empirical risk\nand Cc, Cr are constants. Let Tm be cr(1\u2212h2)\ni=1(EQ f (zi))2. Note that cc and cr must be \ufb01xed before\nseeing the data. Assuming we equate the in\ufb02ation of the empirical risk, i.e., cc = cr, the proposed\nbound is tighter than Catoni\u2019s bound provided m > 1\nTm\nconverges to a positive number (a reasonable assumption), then our proposed bound will be tighter\nfor suf\ufb01ciently many samples. If we assume cc (cid:54)= cr, our bound can still be tighter than Catoni\u2019s\nbound under more involved conditions.\n\n(cid:0)(Cr \u2212 Cc)(cid:0)KL(Q(cid:107)P) + log 1\n\nm (KL(Q(cid:107)P) +log 1\nm \u2211m\n\n(cid:1). If Tm\n\n(cid:1) + Cr\n\n\u03b4\n\n5 Related Work\n\nThere is a large literature on obtaining fast 1/m convergence rates for generalization error and excess\nrisk using Rademacher processes and their generalizations [4, 22, 27, 29, 44]. As far as we know,\nthis literature does not connect with the PAC-Bayesian literature. There do exist, however, PAC-\nBayesian analyses for speci\ufb01c learning algorithms that achieve fast rates [2, 15, 24]. These speci\ufb01c\nanalyses do not lead to general PAC-Bayes bounds, like those produced by Catoni [8].\nOur new PAC-Bayes bound based on \ufb02atness bears a super\ufb01cial resemblance to a number of bounds\nin the literature. However, our notion of \ufb02atness is not related to the variance of the randomized\nclassi\ufb01er caused by the randomness of the observed data. Therefore, our new bound is fundamentally\ndifferent from existing PAC-Bayes bounds based on this type of variance [15, 24, 41].\nFor example, Tolstikhin and Seldin [41, Thm. 4] presents a generalization bound based on\nthe \u201cempirical variance\u201d, which is distinct from our \"\ufb02atness\". The \u201cempirical variance\u201d is\ni=1 EQ[ f (zi) \u2212 EQ f (zi)]2. Note that\nEQ\nit is possible for \ufb02atness to be zero, even when empirical variance is large.\nTo the best of our knowledge, the closest work to ours in the literature is that by Audibert [2].\nThe bound given in [2, Thm. 6.1] uses a notion similar to our \u201c\ufb02atness\u201d. The bound is, however,\nnot comparable with ours for several reasons: First, [2, Theorem 6.1] holds only for the particular\n\ni=1 f (zi)]2, while our \u201c\ufb02atness\u201d is 1\n\ni=1[ f (zi) \u2212 1\n\nm \u2211m\n1\n\nm \u2211m\n\nm \u2211m\n\n8\n\n\falgorithm proposed by Audibert, and so it is not a general PAC-Bayes bound like ours. Second,\nour notion of \u201c\ufb02atness\u201d is empirical, while the \u201c\ufb02atness\u201d term in [2, Theorem 6.1] is de\ufb01ned by\nan expectation over the data distribution, which is often presumed unknown. Finally, the proof\ntechniques used to establish [2, Theorem 6.1] are specialized to the proposed algorithm and not\nbased on the use of Rademacher processes. Our proof techniques via shifted Rademacher processes\nprovides a blueprint for other approaches to deriving fast-rate PAC-Bayes bounds.\nGr\u00fcnwald and Mehta [17] establish new excess risk bounds in terms of a novel complexity measure\nbased on \u201cluckiness\u201d functions. In the setting of randomized classi\ufb01ers, particular choices of luck-\niness functions can be related to PAC-Bayesian notions of complexity based on \u201cpriors\u201d. Indeed,\nin this setting, their complexity measure can be bounded in terms of a KL divergence, as in PAC-\nBayesian bounds. In a setting with deterministic classi\ufb01ers, the authors show that their complexity\nmeasure can be bounded in terms of Rademacher complexity. Thus, while their framework connects\nwith both PAC-Bayesian and Rademacher-complexity bounds, it is not immediately clear whether\nit produces direct connections, as we have accomplished here. It is certainly interesting to consider\nwhether our bounds can be achieved (or surpassed) by an appropriate use of their framework.\n\n6 Conclusion\n\nIn this paper we exploit the connections between modern PAC-Bayesian theory and Rademacher\ncomplexities. Using shifted Rademacher processes [27, 43, 44], we derive a novel fast-rate PAC-\nBayes bound that depends on the empirical \"\ufb02atness\" of the posterior. Our work provides new\ninsights on PAC-Bayesian theory and opens up new avenues for developing stronger bounds.\nIt is worth highlighting some potentially interesting directions that may be worth further investiga-\ntion:\nWe have \u201crederived\u201d Catoni\u2019s bound via shifted Rademacher processes, up to constants. It is inter-\nesting to ask whether the Rademacher approach can dominate the direct PAC-Bayes bound. In the\nother direction, we have not derived our \ufb02atness bound via a direct PAC-Bayes approach. Whether\nthis is possible and what it achieves might shed light on the relative strengths of these two distinct\napproaches to PAC-Bayes bounds. It may also be interesting to pursue PAC-Bayes bounds via some\nadaptation of Talagrand\u2019s concentration inequalities [42, Ch.3].\nWe have derived PAC-Bayes bounds for zero\u2013one loss. While the extension to bounded loss is\nstraightforward, the problem of extending our approach to unbounded loss relates to a growing\nbody of work on this problem within the PAC-Bayesian framework. (See, for example, [1] and the\nreferences therein). Whether the Rademacher perspective is helpful or not in this regard is not clear\nat this point.\nThere has been a surge of interest in PAC-Bayes bounds and their application to the study of general-\nization in large-scale neural networks. One promising direction is to consider Rademacher-process\ntechniques may aid in the development of PAC-Bayesian analyses of speci\ufb01c algorithms [2, 15,\n24], especially in the case when the algorithms are related to large-scale neural networks trained by\nstochastic gradient descent [30, 36, 37].\nIt would be interesting to perform a careful empirical study of our \ufb02atness bound in the context of\nlarge-scale neural networks, in the vein of the work of Dziugaite and Roy [9]. Preliminary work\nsuggests that the posteriors found by PAC-Bayes bound optimization are not \ufb02at in our sense. After\nsome investigation, we believe the reason is that optimizing the PAC-Bayes bound results in under-\n\ufb01tting, due in part to the distribution-independent prior. It would be interesting to compare various\nPAC-Bayes bounds under strict constraints on the empirical risk.\n\nAcknowledgments\n\nWe would like to also thank Peter Bartlett, Gintare Karolina Dziugaite, Roger Grosse, Yasaman\nMahdaviyeh, Zacharie Naulet, and Sasha Rakhlin for helpful discussions.\nIn particular, the au-\nthors would like to thank Sasha Rakhlin for introducing us to the work of Kakade, Sridharan, and\nTewari [21]. The work bene\ufb01tted also from constructive feedback from anonymous referees. JY was\nsupported by an Alexander Graham Bell Canada Graduate Scholarship (NSERC CGS D), Ontario\nGraduate Scholarship (OGS), and Queen Elizabeth II Graduate Scholarship in Science and Technol-\nogy (QEII-GSST). SS was supported by a Borealis AI Global Fellowship Award, Connaught New\n\n9\n\n\fResearcher Award, and Connaught Fellowship. DMR was supported by an NSERC Discovery Grant\nand Ontario Early Researcher Award.\n\nReferences\n\n[1] P. Alquier and B. Guedj. \u201cSimpler PAC-Bayesian bounds for hostile data\u201d. Machine Learning\n\n[2]\n\n[3]\n\n107.5 (2018), pp. 887\u2013902.\nJ.-Y. Audibert. \u201cFast learning rates in statistical inference through aggregation\u201d. The Annals\nof Statistics 37.4 (2009), pp. 1591\u20131646.\nJ.-Y. Audibert and O. Bousquet. \u201cCombining PAC-Bayesian and Generic Chaining Bounds\u201d.\nJournal of Machine Learning Research 8 (2007), pp. 863\u2013889.\n\n[4] P. L. Bartlett, O. Bousquet, and S. Mendelson. \u201cLocal Rademacher Complexities\u201d. The Annals\n\nof Statistics 33.4 (2005), pp. 1497\u20131537.\n\n[5] P. L. Bartlett and S. Mendelson. \u201cRademacher and Gaussian complexities: Risk bounds and\n\nstructural results\u201d. Journal of Machine Learning Research 3 (2002), pp. 463\u2013482.\n\n[6] L. B\u00e9gin, P. Germain, F. Laviolette, and J.-F. Roy. \u201cPAC-Bayesian bounds based on the R\u00e9nyi\n\ndivergence\u201d. In: Arti\ufb01cial Intelligence and Statistics. 2016, pp. 435\u2013444.\n\n[7] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[8] O. Catoni. PAC-Bayesian Supervised Classi\ufb01cation: The Thermodynamics of Statistical\nLearning. Vol. 56. Lecture Notes \u2013 Monograph Series. Institute of Mathematical Statistics,\n2007.\n\n[9] G. K. Dziugaite and D. M. Roy. \u201cComputing Nonvacuous Generalization Bounds for Deep\n(Stochastic) Neural Networks with Many More Parameters than Training Data\u201d. In: Proceed-\nings of the 33rd Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI). 2017.\n\n[10] G. K. Dziugaite and D. M. Roy. \u201cData-dependent PAC-Bayes priors via differential privacy\u201d.\n\nIn: Advances in Neural Information Processing Systems. 2018, pp. 8430\u20138441.\n\n[11] G. K. Dziugaite and D. M. Roy. \u201cEntropy-SGD optimizes the prior of a PAC-Bayes bound:\nGeneralization properties of Entropy-SGD and data-dependent priors\u201d. In: International Con-\nference on Machine Learning. 2018, pp. 1376\u20131385.\n\n[12] T. van Erven. PAC-Bayes Mini-tutorial: A Continuous Union Bound. 2014. arXiv: 1405 .\n\n1580.\n\n[13] T. van Erven, P. D. Gr\u00fcnwald, N. A. Mehta, M. D. Reid, and R. C. Williamson. \u201cFast Rates in\nStatistical and Online Learning\u201d. Journal of Machine Learning Research 16 (2015), pp. 1793\u2013\n1861.\n\n[14] P. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien. \u201cPAC-Bayesian theory meets\nBayesian inference\u201d. In: Advances in Neural Information Processing Systems. 2016,\npp. 1884\u20131892.\n\n[15] P. Germain, A. Lacasse, F. Laviolette, M. Marchand, and J.-F. Roy. \u201cRisk bounds for the ma-\njority vote: From a PAC-Bayesian analysis to a learning algorithm\u201d. The Journal of Machine\nLearning Research 16.1 (2015), pp. 787\u2013860.\n\n[16] E. Gin\u00e9 and V. Koltchinskii. \u201cConcentration inequalities and asymptotic results for ratio type\n\nempirical processes\u201d. The Annals of Probability 34.3 (2006), pp. 1143\u20131216.\n\n[17] P. D. Gr\u00fcnwald and N. A. Mehta. \u201cA tight excess risk bound via a uni\ufb01ed PAC-Bayesian\u2013\nRademacher\u2013Shtarkov\u2013MDL complexity\u201d. In: Algorithmic Learning Theory. 2019, pp. 433\u2013\n465. arXiv: 1710.07732.\n\n[18] B. Guedj. A primer on PAC-Bayesian learning. 2019. arXiv: 1901.05353.\n[19] S. Hanneke. \u201cRe\ufb01ned error bounds for several learning algorithms\u201d. The Journal of Machine\n\nLearning Research 17.1 (2016), pp. 4667\u20134721.\n\n[20] S. Hanneke and L. Yang. \u201cMinimax analysis of active learning\u201d. The Journal of Machine\n\nLearning Research 16.1 (2015), pp. 3487\u20133602.\n\n10\n\n\f[21] S. M. Kakade, K. Sridharan, and A. Tewari. \u201cOn the complexity of linear prediction: risk\nbounds, margin bounds, and regularization\u201d. In: Advances in Neural Information Processing\nSystems. 2008, pp. 793\u2013800.\n\n[22] V. Koltchinskii. \u201cLocal Rademacher complexities and oracle inequalities in risk minimiza-\n\ntion\u201d. The Annals of Statistics 34.6 (2006), pp. 2593\u20132656.\n\n[23] V. Koltchinskii and D. Panchenko. \u201cEmpirical margin distributions and bounding the gener-\n\nalization error of combined classi\ufb01ers\u201d. The Annals of Statistics 30.1 (2002), pp. 1\u201350.\n\n[24] A. Lacasse, F. Laviolette, M. Marchand, P. Germain, and N. Usunier. \u201cPAC-Bayes bounds for\nthe risk of the majority vote and the variance of the Gibbs classi\ufb01er\u201d. In: Advances in Neural\nInformation Processing Systems. 2007, pp. 769\u2013776.\nJ. Langford. \u201cTutorial on practical prediction theory for classi\ufb01cation\u201d. Journal of Machine\nLearning Research 6.Mar (2005), pp. 273\u2013306.\n\n[25]\n\n[26] G. Lecu\u00e9 and S. Mendelson. Learning subgaussian classes: Upper and minimax bounds.\n\n2013. arXiv: 1305.4825.\n\n[27] G. Lecu\u00e9 and C. Mitchell. \u201cOracle inequalities for cross-validation type procedures\u201d. Elec-\n\ntronic Journal of Statistics 6 (2012), pp. 1803\u20131837.\n\n[28] G. Lever, F. Laviolette, and J. Shawe-Taylor. \u201cTighter PAC-Bayes bounds through\n\ndistribution-dependent priors\u201d. Theoretical Computer Science 473 (2013), pp. 4\u201328.\n\n[29] T. Liang, A. Rakhlin, and K. Sridharan. \u201cLearning with square loss: Localization through\noffset Rademacher complexity\u201d. In: Conference on Learning Theory. 2015, pp. 1260\u20131285.\n[30] B. London. \u201cA PAC-Bayesian analysis of randomized learning with application to stochastic\ngradient descent\u201d. In: Advances in Neural Information Processing Systems. 2017, pp. 2931\u2013\n2940.\n\n[31] P. Massart and \u00c9. N\u00e9d\u00e9lec. \u201cRisk bounds for statistical learning\u201d. The Annals of Statistics\n\n34.5 (2006), pp. 2326\u20132366.\n\n[32] D. A. McAllester. A PAC-Bayesian Tutorial with A Dropout Bound. 2013. arXiv: 1307.2118.\n[33] D. A. McAllester. \u201cPAC-Bayesian Model Averaging\u201d. In: Conference on Learning Theory.\n\n1999, pp. 164\u2013170.\n\n[34] S. Mendelson. \u201cLearning without concentration\u201d. In: Conference on Learning Theory. 2014,\n\npp. 25\u201339.\n\n[35] S. Mendelson. \u201c\u201cLocal\u201d vs. \u201cglobal\u201d parameters\u2013breaking the Gaussian complexity barrier\u201d.\n\nThe Annals of Statistics 45.5 (2017), pp. 1835\u20131862.\n\n[36] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. \u201cExploring generalization in\ndeep learning\u201d. In: Advances in Neural Information Processing Systems. 2017, pp. 5947\u2013\n5956.\n\n[37] B. Neyshabur, S. Bhojanapalli, and N. Srebro. A PAC-Bayesian approach to spectrally-\n\nnormalized margin bounds for neural networks. 2017. arXiv: 1707.09564.\nJ. Shawe-Taylor and R. C. Williamson. \u201cA PAC Analysis of a Bayesian Estimator\u201d. In: Con-\nference on Learning Theory. 1997, pp. 2\u20139.\n\n[38]\n\n[39] S. L. Smith and Q. V. Le. A Bayesian perspective on generalization and stochastic gradient\n\ndescent. 2017. arXiv: 1710.06451.\n\n[40] N. Thiemann, C. Igel, O. Wintenberger, and Y. Seldin. A strongly quasiconvex PAC-Bayesian\n\nbound. 2016. arXiv: 1608.05610.\nI. O. Tolstikhin and Y. Seldin. \u201cPAC-Bayes-empirical-Bernstein inequality\u201d. In: Advances in\nNeural Information Processing Systems. 2013, pp. 109\u2013117.\n\n[41]\n\n[42] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge\n\nUniversity Press, 2019.\n\n[43] M. Wegkamp. \u201cModel selection in nonparametric regression\u201d. The Annals of Statistics 31.1\n\n(2003), pp. 252\u2013273.\n\n[44] N. Zhivotovskiy and S. Hanneke. \u201cLocalization of VC classes: Beyond local Rademacher\n\ncomplexities\u201d. Theoretical Computer Science 742 (2018), pp. 27\u201349.\n\n11\n\n\f", "award": [], "sourceid": 5764, "authors": [{"given_name": "Jun", "family_name": "Yang", "institution": "University of Toronto"}, {"given_name": "Shengyang", "family_name": "Sun", "institution": "University of Toronto"}, {"given_name": "Daniel", "family_name": "Roy", "institution": "Univ of Toronto & Vector"}]}