{"title": "PAC-Bayes under potentially heavy tails", "book": "Advances in Neural Information Processing Systems", "page_first": 2715, "page_last": 2724, "abstract": "We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain a novel optimal Gibbs posterior which enjoys finite-sample excess risk bounds at logarithmic confidence. Our core technique itself makes use of PAC-Bayesian inequalities in order to derive a robust risk estimator, which by design is easy to compute. In particular, only assuming that the first three moments of the loss distribution are bounded, the learning algorithm derived from this estimator achieves nearly sub-Gaussian statistical error, up to the quality of the prior.", "full_text": "PAC-Bayes under potentially heavy tails\n\nMatthew J. Holland\n\nOsaka University\n\nInstitute of Scienti\ufb01c and Industrial Research\n\nmatthew-h@ar.sanken.osaka-u.ac.jp\n\nAbstract\n\nWe derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain\na novel optimal Gibbs posterior which enjoys \ufb01nite-sample excess risk bounds\nat logarithmic con\ufb01dence. Our core technique itself makes use of PAC-Bayesian\ninequalities in order to derive a robust risk estimator, which by design is easy\nto compute.\nIn particular, only assuming that the \ufb01rst three moments of the\nloss distribution are bounded, the learning algorithm derived from this estimator\nachieves nearly sub-Gaussian statistical error, up to the quality of the prior.\n\n1\n\nIntroduction\n\nMore than two decades ago, the origins of PAC-Bayesian learning theory were developed with\nthe goal of strengthening traditional PAC learning guarantees1 by explicitly accounting for prior\nknowledge [17, 12, 5]. Subsequent work developed \ufb01nite-sample risk bounds for \u201cBayesian\u201d learning\nalgorithms which specify a distribution over the model [13]. These bounds are controlled using\nthe empirical risk and the relative entropy between \u201cprior\u201d and \u201cposterior\u201d distributions, and hold\nuniformly over the choice of the latter, meaning that the guarantees hold for data-dependent posteriors,\nhence the naming. Furthermore, choosing the posterior to minimize PAC-Bayesian risk bounds leads\nto practical learning algorithms which have seen numerous successful applications [2].\nFollowing this framework, a tremendous amount of work has been done to re\ufb01ne, extend, and apply\nthe PAC-Bayesian framework to new learning problems. Tight risk bounds for bounded losses are\ndue to Seeger [15] and Maurer [11], with the former work applying them to Gaussian processes.\nBounds constructed using the loss variance in a Bernstein-type inequality were given by Seldin et al.\n[16], with a data-dependent extension derived by Tolstikhin and Seldin [18]. As stated by McAllester\n[14], virtually all the bounds derived in the original PAC-Bayesian theory \u201conly apply to bounded\nloss functions.\u201d This technical barrier was solved by Alquier et al. [2], who introduce an additional\nerror term depending on the concentration of the empirical risk about the true risk. This technique\nwas subsequently applied to the log-likelihood loss in the context of Bayesian linear regression by\nGermain et al. [9], and further systematized by B\u00e9gin et al. [3]. While this approach lets us deal\nwith unbounded losses, naturally the statistical error guarantees are only as good as the con\ufb01dence\nintervals available for the empirical mean deviations. In particular, strong assumptions on all of\nthe moments of the loss are essentially unavoidable using the traditional tools espoused by B\u00e9gin\net al. [3], which means the \u201cheavy-tailed\u201d regime cannot be handled, where all we assume is that a\nfew higher-order moments are \ufb01nite (say \ufb01nite variance and/or \ufb01nite kurtosis). A new technique for\nderiving PAC-Bayesian bounds even under heavy-tailed losses is introduced by Alquier and Guedj\n[1]; their lucid procedure provides error rates even under heavy tails, but as the authors recognize, the\nguarantees are sub-optimal at high con\ufb01dence levels due to direct dependence on the empirical risk,\nleading in turn to sub-optimal algorithms derived from these bounds.2\n\n1PAC: Probably approximately correct [19].\n2See work by Catoni [6], Devroye et al. [8] and the references within for background on the fundamental\n\nlimitations of the empirical mean for real-valued random variables.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this work, while keeping many core ideas of B\u00e9gin et al. [3] intact, using a novel approach we\nobtain exponential tail bounds on the excess risk using PAC-Bayesian bounds that hold even under\nheavy-tailed losses. Our key technique is to replace the empirical risk with a new mean estimator\ninspired by the dimension-free estimators of Catoni and Giulini [7], designed to be computationally\nconvenient. We review some key theory in section 2 before introducing the new estimator in section\n3. In section 4 we apply this estimator to the PAC-Bayes setting, deriving a new robust optimal Gibbs\nposterior. Empirical inquiries into the properties of the new mean estimator are given in section 5.\nAll proofs are relegated to supplementary materials.\n\n2 PAC-Bayesian theory based on the empirical mean\n\nLet us begin by brie\ufb02y reviewing the best available PAC-Bayesian learning guarantees under general\nlosses. Denote by z1, . . . , zn \u2208 Z a sequence of independent observations distributed according to\ncommon distribution \u00b5. Denote by H a model/hypothesis class, from which the learner selects a\ncandidate based on the n-sized sample. The quality of this choice can be measured in a pointwise\nfashion using a loss function l : H \u00d7 Z \u2192 R, assumed to be l \u2265 0. The learning task is to achieve a\nsmall risk, de\ufb01ned by R(h) ..= E\u00b5 l(h; z). Since the underlying distribution is inherently unknown,\nthe canonical proxy is\n\n(cid:98)R(h) ..=\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:90)\n\nH\n\nl(h; zi),\n\nh \u2208 H.\n\n(cid:98)G\u03c1\n\n..= E\u03c1 (cid:98)R =\n\n(cid:90)\n\nn(cid:88)\n\nH\n\ni=1\n\n1\nn\n\nLet \u03bd and \u03c1 respectively denote \u201cprior\u201d and \u201cposterior\u201d distributions on the model H. The so-called\nGibbs risk induced by \u03c1, as well as its empirical counterpart are given by\n\nG\u03c1\n\n..= E\u03c1 R =\n\nR(h) d\u03c1(h),\n\nl(h; zi) d\u03c1(h).\n\nWhen our losses are almost surely bounded, lucid guarantees are available.\nTheorem 1 (PAC-Bayes under bounded losses [13, 3]). Assume 0 \u2264 l \u2264 1, and \ufb01x any arbitrary\nprior \u03bd on H. For any con\ufb01dence level \u03b4 \u2208 (0, 1), we have with probability no less than 1 \u2212 \u03b4 over\nthe draw of the sample that\n\n(cid:114)\n\nG\u03c1 \u2264 (cid:98)G\u03c1 +\n\n\u221a\n\nn\u03b4\u22121)\n\nK(\u03c1; \u03bd) + log(2\n\n2n\n\nuniformly in the choice of \u03c1.\n\nSince the \u201cgood event\u201d where the inequality in Theorem 1 holds is valid for any choice of \u03c1, the\nresult holds even when \u03c1 depends on the sample, which justi\ufb01es calling it a posterior distribution.\nOptimizing this upper bound with respect to \u03c1 leads to the so-called optimal Gibbs posterior, which\ntakes a form which is readily characterized (cf. Remark 13).\nThe above results fall apart when the loss is unbounded, and meaningful extensions become chal-\nlenging when exponential moment bounds are not available. As highlighted in section 1 above, over\nthe years, the analytical machinery has evolved to provide general-purpose PAC-Bayesian bounds\neven under heavy-tailed data. The following theorem of Alquier and Guedj [1] extends the strategy\nof B\u00e9gin et al. [3] to obtain bounds under the weakest conditions we know of.\nTheorem 2 (PAC-Bayes under heavy-tailed losses [1]). Take any p > 1 and set q = p/(p \u2212 1). For\nany con\ufb01dence level \u03b4 \u2208 (0, 1), we have with probability no less than 1 \u2212 \u03b4 over the draw of the\nsample that\n\nG\u03c1 \u2264 (cid:98)G\u03c1 +\n\n(cid:32)\n\nE\u03bd |(cid:98)R \u2212 R|q\n\n(cid:33) 1\nq(cid:18)(cid:90)\n\n(cid:18) d\u03c1\n\n(cid:19)p\n\n(cid:19) 1\n\np\n\nd\u03bd\n\n\u03b4\n\nH\n\nd\u03bd\n\nuniformly in the choice of \u03c1.\nFor concreteness, consider the case of p = 2, where q = 2/(2 \u2212 1) = 2, and assume that the variance\nof the loss is var\u00b5 l(h; z) is \u03bd-\ufb01nite, namely that\n\n(cid:90)\n\nH\n\nV\u03bd\n\n..=\n\nvar\u00b5 l(h; z) d\u03bd(h) < \u221e.\n\n2\n\n\fFrom Proposition 4 of Alquier and Guedj [1], we have E\u03bd |(cid:98)R \u2212 R|2 \u2264 V\u03bd/n. It follows that on the\n\nhigh-probability event, we have\n\nG\u03c1 \u2264 (cid:98)G\u03c1 +\n\n(cid:32)(cid:90)\n\n(cid:118)(cid:117)(cid:117)(cid:116) V\u03bd\n\nn \u03b4\n\n(cid:18) d\u03c1\n\n(cid:19)2\n\n(cid:33)\n\nd\u03bd\n\nH\n\nd\u03bd\n\n\u221a\n\n(cid:18)\n\nG\u03c1 \u2264 (cid:98)G\u03c1,\u03c8 +\n\nWhile the\nn rate and dependence on a divergence between \u03bd and \u03c1 are similar, note that the\ndependence on the con\ufb01dence level \u03b4 \u2208 (0, 1) is polynomial; compare this with the logarithmic\ndependence available in Theorem 1 above when the losses were bounded.\nFor comparison, our main result of section 4 is a uniform bound on the Gibbs risk: with probability\nno less than 1 \u2212 \u03b4, we have\n1\u221a\nn\n\n+ M2 + \u03bd\u2217\nn(H) is a term depending on the quality\nof prior \u03bd, and the key constants are bounds such that for all h \u2208 H we have M2 \u2265 E\u00b5 l(h; z)2.\nAs long as the \ufb01rst three moments are \ufb01nite, this guarantee holds, and thus both sub-Gaussian and\nheavy-tailed losses (e.g., with in\ufb01nite higher-order moments) are permitted. Given any valid M2,\nthe PAC-Bayesian upper bound above can be minimized in \u03c1 based on the data, and thus an optimal\nGibbs posterior can also be computed in practice. In section 4, we characterize this \u201crobust posterior.\u201d\n\nwhere (cid:98)G\u03c1,\u03c8 is an estimator of G\u03c1 de\ufb01ned in section 3, \u03bd\u2217\n\n(cid:19)\nn(H) \u2212 1\n\nlog(8\u03c0M2\u03b4\u22122)\n\n(cid:18) 1\n\nK(\u03c1; \u03bd) +\n\n(cid:19)\n\n+ O\n\nn\n\n2\n\n3 A new estimator using smoothed Bernoulli noise\n\nNotation In this section, we are dealing with the speci\ufb01c problem of robust mean estimation, thus\nwe specialize our notation here slightly. Data observations will be x1, . . . , xn \u2208 R, assumed to be\nindependent copies of x \u223c \u00b5. Denote the index set [k] ..= {1, 2, . . . , k}. Write M1\n+(\u2126,A) for the set\nof all probability measures de\ufb01ned on the measurable space (\u2126,A). Write K(P, Q) for the relative\nentropy between measures P and Q (also known as the KL divergence; de\ufb01nition in appendix). We\nshall typically suppress A and even \u2126 in the notation when it is clear from the context. Let \u03c8 be a\nbounded, non-decreasing function such that for some b > 0 and all u \u2208 R,\n\n\u2212 log(cid:0)1 \u2212 u + u2/b(cid:1) \u2264 \u03c8(u) \u2264 log(cid:0)1 + u + u2/b(cid:1) .\n\n(1)\nAs a concrete and analytically useful example, we shall use the piecewise polynomial function of\nCatoni and Giulini [7], de\ufb01ned by\n\n\uf8f1\uf8f2\uf8f3u \u2212 u3/6, \u2212\u221a\n\n\u221a\n\u221a\n2\n2/3,\n\u22122\n\n2/3,\n\n2 \u2264 u \u2264 \u221a\n\u221a\nu < \u2212\u221a\n\nu >\n\n2\n\n2\n\n\u03c8(u) ..=\n\n2\n\n(2)\n\nwhich for b = 2 satis\ufb01es (1). Slightly looser bounds hold with b = 1 for an analogous procedure\nusing a Huber-type in\ufb02uence function.\n\nEstimator de\ufb01nition We consider a straightforward procedure, in which the data are subject to a\n\nsoft truncation after re-scaling, de\ufb01ned by(cid:98)x ..=\n\nn(cid:88)\n\ni=1\n\ns\nn\n\n(cid:16) xi\n\n(cid:17)\n\ns\n\n\u03c8\n\n(3)\n\nwhere s > 0 is a re-scaling parameter. Depending on the setting of s, this function can very closely\napproximate the sample mean, and indeed modifying this scaling parameter controls the bias of this\nestimator in a direct way, which can be quanti\ufb01ed as follows. As the scale grows, note that\n\nwhich implies that taking expectation with respect to the sample and s \u2192 \u221e, in the limit this\nestimator is unbiased, with\n\n(cid:17)\n\n(cid:16) x\nn(cid:88)\n\ns\n\ni=1\n\ns\u03c8\n\n(cid:32)\n\ns\nn\n\nE\n\n\u03c8\n\n= x \u2212 x3\n\n6s2 \u2192 x,\n(cid:17)(cid:33)\n\n(cid:16) xi\n\ns\n\n3\n\nas s \u2192 \u221e\n\n= E\u00b5 x \u2212 E\u00b5 x3\n\n6s2 \u2192 E\u00b5 x.\n\n\fFigure 1: Graph of the Catoni function \u03c8(u) over \u00b1\u221a\n\n2 \u00b1 2.5.\n\nOn the other hand, taking s closer to zero implies that more observations will be truncated. Taking s\nsmall enough,3 we have\n\nn(cid:88)\n\ni=1\n\ns\nn\n\n(cid:16) xi\n\n(cid:17)\n\ns\n\n\u03c8\n\n=\n\n\u221a\n2\n\n3n\n\n2s\n\n(|I+| \u2212 |I\u2212|) ,\n\nwhich converges to zero as s \u2192 0. Here the positive/negative indices are I+\n..= {i \u2208 [n] : xi > 0}\nand I\u2212 ..= {i \u2208 [n] : xi < 0}. Thus taking s too small means that only the signs of the observations\nmatter, and the absolute value of the estimator tends to become too small.\n\nHigh-probability deviation bounds for(cid:98)x We are interested in high-probability bounds on the\ndeviations |(cid:98)x \u2212 E\u00b5 x| under the weakest possible assumptions on the underlying data distribution. To\n(cid:98)x de\ufb01ned in (3) can be related to an estimator with smoothed noise as follows. Let \u00011, . . . , \u0001n be\n\nobtain such guarantees in a straightforward manner, we make the simple observation that the estimator\nan iid sample of noise \u0001 \u2208 {0, 1} with distribution Bernoulli(\u03b8) for some 0 < \u03b8 < 1. Then, taking\nexpectation with respect to the noise sample, one has that\n\n(cid:32)\n\n(cid:98)x =\n\n1\n\u03b8\n\nE\n\ns\nn\n\n(cid:16) xi \u0001i\n\n(cid:17)(cid:33)\n\n\u03c8\n\ns\n\nn(cid:88)\n\ni=1\n\n.\n\n(4)\n\nThis simple observation becomes useful to us in the context of the following technical fact.\nLemma 3. Assume we are given some independent data x1, . . . , xn, assumed to be copies of the\nrandom variable x \u223c \u00b5. In addition, let \u00011, . . . , \u0001n similarly be independent observations of \u201cstrategic\nnoise,\u201d with distribution \u0001 \u223c \u03c1 that we can design. Fix an arbitrary prior distribution \u03bd, and consider\nf : R2 \u2192 R, assumed to be bounded and measurable. Write K(\u03c1; \u03bd) for the Kullback-Leibler\ndivergence between distributions \u03c1 and \u03bd. It follows that with probability no less than 1 \u2212 \u03b4 over the\nrandom draw of the sample, we have\n\n(cid:32)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nE\n\nf (xi, \u0001i)\n\n(cid:33)\n\n(cid:90)\n\n\u2264\n\nlog E\u00b5 exp(f (x, \u0001)) d\u03c1(\u0001) +\n\nK(\u03c1; \u03bd) + log(\u03b4\u22121)\n\nn\n\n,\n\nuniform in the choice of \u03c1, where expectation on the left-hand side is over the noise sample.\n\nThe special case of interest here is f (x, \u0001) = \u03c8(x\u0001/s). Using (1) and Lemma 3, with prior \u03bd =\nBernoulli(1/2) and posterior \u03c1 = Bernoulli(\u03b8), it follows that on the 1 \u2212 \u03b4 high-probability event,\n\n3More precisely, taking s \u2264 min{|xi| : i \u2208 [n]}/\n\n\u221a\n2.\n\n4\n\n42024432101234upsi(u)upper boundlower bound\funiform in the choice of 0 < \u03b8 < 1, we have\n\n(cid:18) \u03b8\n\ns\n\n(cid:19)(cid:98)x \u2264\n\n(cid:90) (cid:18) \u0001 E\u00b5 x\n\n+\n\n\u00012 E\u00b5 x2\n\n2s2\n\n\u03b8 E\u00b5 x2\n\n2s2 +\n\n1\nn\n\ns\n\n+\n\n=\n\n\u03b8 E\u00b5 x\n\ns\n\n(cid:19)\n(cid:0)\u03b8 log(2\u03b8) + (1 \u2212 \u03b8) log(2(1 \u2212 \u03b8)) + log(\u03b4\u22121)(cid:1)\n\nK(\u03c1; \u03bd) + log(\u03b4\u22121)\n\nd\u03c1(\u0001) +\n\nn\n\nwhere we have used the fact that E \u00012 = E \u0001 = \u03b8 in the Bernoulli case. Dividing both sides by (\u03b8/s)\nand optimizing this as a function of s > 0 yields a closed-form expression for s depending on the\nsecond moment, the con\ufb01dence \u03b4, and \u03b8. Analogous arguments yield lower bounds on the same\nquantity. Taking these facts together, we have the following proposition, which says that assuming\nonly \ufb01nite second moments E\u00b5 x2 < \u221e, the proposed estimator achieves exponential tail bounds\nscaling with the second non-central moment.\nProposition 4 (Concentration of deviations). Scaling with s2 = n E\u00b5 x2/2 log(\u03b4\u22121), the estimator\nde\ufb01ned in (3) satis\ufb01es\n\n(cid:114)\n\n|(cid:98)x \u2212 E\u00b5 x| \u2264\n\n2 E\u00b5 x2 log(\u03b4\u22121)\n\nn\n\nwith probability at least 1 \u2212 2\u03b4.\nRemark 5. While the above bound (6) depends on the true second moment, the result is easily\nextended to hold for any valid upper bound on the moment, which is what will inevitably have to be\nused in practice.\n\nCentered estimates Note that the bound (6) depends on the second moment of the underlying data;\nthis is in contrast to M-estimators which due to a natural \u201ccentering\u201d of the data typically have tail\nbounds depending on the variance [6]. This results in a sensitivity to the absolute value of the location\nof the distribution, e.g., on a distribution with unit variance and E\u00b5 x = 0 will tend to be much better\nthan a distribution with E\u00b5 x = 104. Fortunately, a simple centering strategy works well to alleviate\nthis sensitivity, as follows. Without loss of generality, assume that the \ufb01rst 0 < k < n estimates are\nused for constructing a shifting device, with the remaining n \u2212 k > 0 points left for running the usual\nroutine on shifted data. More concretely, de\ufb01ne\n\n(5)\n\n(6)\n\ns\nk\nFrom (6) in Proposition 4, we have\n\nx\u03c8 =\n\n\u03c8\n\n, where s2 =\n\nk E\u00b5 x2\n2 log(\u03b4\u22121)\n\n.\n\n(7)\n\nk(cid:88)\n\ni=1\n\n(cid:16) xi\n\n(cid:17)\n\ns\n\n(cid:114)\n\n|x\u03c8 \u2212 E\u00b5 x| \u2264 \u03b5k\n\n..=\n\n2 E\u00b5 x2 log(\u03b4\u22121)\n\nk\n\n(cid:98)x(cid:48) =\n\ns\n\nn(cid:88)\n\n(cid:18) x(cid:48)\n\n(cid:19)\n\ni\ns\n\non an event with probability no less than 1 \u2212 2\u03b4, over the draw of the k-sized sub-sample. Using\n..= xi \u2212 x\u03c8. Note that the second moment of this data is\nthis, we shift the remaining data points as x(cid:48)\nbounded as E\u00b5(x(cid:48))2 \u2264 var\u00b5 x + \u03b52\nk. Passing these shifted points through (3) with analogous second\nmoment bounds used for scaling, we have\n\ni\n\n(n \u2212 k)(var\u00b5 x + \u03b52\nk)\n\n(n \u2212 k)\n\nShifting the resulting output back to the original location by adding and shifting(cid:98)x(cid:48) back to the original\n\nlocation by adding x\u03c8, conditioned on x\u03c8, we have by (6) again that\n\n2 log(\u03b4\u22121)\n\ni=k+1\n\n, where s2 =\n\n(8)\n\n\u03c8\n\n.\n\n|((cid:98)x(cid:48) + x\u03c8) \u2212 E\u00b5 x| = |(cid:98)x \u2212 E\u00b5(x \u2212 x\u03c8)| \u2264\n\n2(var\u00b5 x + \u03b52\nn \u2212 k\n\nk) log(\u03b4\u22121)\n\nwith probability no less than 1 \u2212 2\u03b4 over the draw of the remaining n \u2212 k points. De\ufb01ning the\n\ncentered estimator as(cid:98)x = (cid:98)x(cid:48) + x\u03c8, and taking a union bound over the two \u201cgood events\u201d on the\n\nindependent sample subsets, we may thus conclude that\n\n(cid:115)\n\nP{|(cid:98)x \u2212 E\u00b5 x| > \u03b5} \u2264 4 exp\n\n(cid:18) \u2212(n \u2212 k)\u03b52\n\n(cid:19)\n\n2(var\u00b5 x + \u03b52\nk)\n\nwhere probability is over the draw of the full n-sized sample. While one takes a hit in terms of the\nsample size, the variance works to combat sensitivity to the distribution location (see section 5 for\nempirical tests).\n\n(9)\n\n5\n\n\f4 PAC-Bayesian bounds for heavy-tailed data\n\nAn import and in\ufb02uential paper due to D. McAllester gave the following theorem as a motivating\nresult. To get started, we give a slightly modi\ufb01ed version of his result.\nTheorem 6 (McAllester [12], Preliminary Theorem 2). Let \u03bd be a prior probability distribution over\nH, assumed countable, and to be such that \u03bd(h) > 0 for all h \u2208 H. Consider the pattern recognition\ntask with z = (x, y) \u2208 X \u00d7 {\u22121, 1}, and the classi\ufb01cation error l(h; z) = I{h(x) (cid:54)= y}. Then with\nprobability no less than 1 \u2212 \u03b4, for any choice of h \u2208 H, we have\n\n(cid:114)\n\nn(cid:88)\n\ni=1\n\nR(h) \u2264 1\nn\n\nl(h; zi) +\n\nlog (1/\u03bd(h)) + log (1/\u03b4)\n\n2n\n\nOne quick glance at the proof of this theorem shows that the bounded nature of the observations plays\na crucial role in deriving excess risk bounds of the above form, as it is used to obtain concentration\ninequalities for the empirical risk about the true risk. While analogous concentration inequalities hold\nunder slightly weaker assumptions, when considering the potentially heavy-tailed setting, one simply\ncannot guarantee that empirical risk is tightly concentrated about the true risk, which prevents direct\nextensions of such theorems. With this in mind, we take a different approach, that does not require\nthe empirical mean to be well-concentrated.\n\nOur motivating pre-theorem The basic idea of our approach is very simple: instead of using the\nsample mean, bound the off-sample risk using a more robust estimator which is easy to compute\ndirectly, and which allows risk bounds even under unbounded, potentially heavy-tailed losses. De\ufb01ne\n\na new approximation of the risk by(cid:98)R\u03c8(h) ..=\n\n(cid:18) l(h; zi)\n\n(cid:19)\n\n\u03c8\n\ns\n\nn(cid:88)\n\ni=1\n\ns\nn\n\n,\n\n(10)\n\nfor s > 0. Note that this is just a direct application of the robust estimator de\ufb01ned in (3) to the case of\na loss which depends on the choice of candidate h \u2208 H. As a motivating result, we basically re-prove\nMcAllester\u2019s result (Theorem 6) under much weaker assumptions on the loss, using the statistical\nproperties of the new risk estimator (10), rather than relying on classical Chernoff inequalities.\nTheorem 7 (Pre-theorem). Let \u03bd be a prior probability distribution over H, assumed countable.\nAssume that \u03bd(h) > 0 for all h \u2208 H, and that m2(h) ..= E l(h; z)2 < \u221e for all h \u2208 H. Setting the\nh = n m2(h)/2 log(\u03b4\u22121), then with probability no less than 1 \u2212 2\u03b4, for any choice\nscale in (10) to s2\nof h \u2208 H, we have\nR(h) \u2264 (cid:98)R\u03c8(h) +\n\n2m2(h) (log(1/\u03bd(h)) + log(1/\u03b4))\n\n(cid:114)\n\n.\n\nn\n\nRemark 8. We note that all quantities on the right-hand side of Theorem 7 are easily computed\nbased on the sample, except for the second moment m2, which in practice must be replaced with an\nempirical estimate. With an empirical estimate of m2 in place, the upper bound can easily be used to\nderive a learning algorithm.\n\nUncountable model case Next we extend the previous motivating theorem to a more general result\non a potentially uncountable H, using stochastic learning algorithms, as has become standard in the\nPAC-Bayes literature. We need a few technical conditions, listed below:\n\nE\u00b5 l(h; z)3 \u2264 M3 < \u221e.\n\n1. Bounds on lower-order moments. For all h \u2208 H, we require E\u00b5 l(h; z)2 \u2264 M2 < \u221e,\n\n2. Bounds on the risk. For all h \u2208 H, we require R(h) \u2264(cid:112)nM2/(4 log(\u03b4\u22121)).\n\n3. Large enough con\ufb01dence. We require \u03b4 \u2264 exp(\u22121/9) \u2248 0.89.\n\nThese conditions are quite reasonable, and easily realized under heavy-tailed data, with just lower-\norder moment assumptions on \u00b5 and say a compact class H. The new terms that appear in our\nn(R \u2212\n\nbounds that do no appear in previous works are (cid:98)G\u03c1,\u03c8\n(cid:98)R\u03c8))/ E\u03bd exp(R \u2212 (cid:98)R\u03c8). The former is the expectation of the proposed robust estimator with respect\n\n..= E\u03c1 (cid:98)R\u03c8 and \u03bd\u2217\n\n\u221a\nn(H) = E\u03bd exp(\n\nto posterior \u03c1, and the latter is a term that depends directly on the quality of the prior \u03bd.\n\n6\n\n\fTheorem 9. Let \u03bd be a prior distribution on model H. Let the three assumptions listed above hold.\nSetting the scale in (10) to s2 = n M2/2 log(\u03b4\u22121), then with probability no greater than 1 \u2212 \u03b4 over\nthe random draw of the sample, it holds that\n\nK(\u03c1; \u03bd) +\n\nlog(8\u03c0M2\u03b4\u22122)\n\n2\n\n+ M2 + \u03bd\u2217\n\n(cid:19)\nn(H) \u2212 1\n\n+ O\n\n(cid:18) 1\n\n(cid:19)\n\nn\n\nG\u03c1 \u2264 (cid:98)G\u03c1,\u03c8 +\n\n(cid:18)\n\n1\u221a\nn\n\n\u221a\n\n\u221a\n\nn(H)/\n\nfor any choice of probability distribution \u03c1 on H, since G\u03c1 < \u221e by assumption.\nRemark 10. As is evident from the statement of Theorem 9, the convergence rate is clear for all terms\nbut \u03bd\u2217\nn(H). Since the random variable R \u2212 (cid:98)R\u03c8 is bounded over the\nn. In our proof, we use a modi\ufb01ed version of the elegant and now-standard strategy\nformulated by B\u00e9gin et al. [3]. A glance at the proof shows that under this strategy, there is essentially\nno way to avoid dependence on \u03bd\u2217\nrandom draw of the sample and h \u223c \u03bd, the bounds still hold and are non-trivial. That said, \u03bd\u2217\nn(H)\nn(H) presents no troubles if R \u2212 (cid:98)R\u03c8 \u2264 0 on a high-probability event, but note that\nmay indeed increase as n \u2192 \u221e, potentially spoiling the\nn rate, and even consistency in the worst\ncase. Clearly \u03bd\u2217\nthis essentially amounts to asking for a prior that on average realizes bounds that are better than\nwe can guarantee for any posterior though the above analysis. Such a prior may indeed exist, but\nif it were known, then that would eliminate the need for doing any learning at all. If the deviations\nn rate can be easily obtained. However, impossibility\nresults from Devroye et al. [8] suggest that under just a few \ufb01nite moment assumptions, such an\nestimator cannot be constructed. As such, here we see a clear limitation of the established PAC-Bayes\nanalytical framework under potentially heavy-tailed data. Since the change of measures step in the\nproof is fundamental to the basic argument, it appears that concessions will have to be made, either in\nthe form of slower rates, deviations larger than the relative entropy, or weaker dependence on 1/\u03b4.\nRemark 11. Note that while in its tightest form, the above bound requires knowledge of E\u00b5 l(h; z)2,\n\nwe may set s > 0 used to de\ufb01ne (cid:98)R\u03c8 using any valid upper bound M2, under which the above bound\n\nR \u2212 (cid:98)R\u03c8 are truly sub-Gaussian [4], then the\n\nstill holds as-is, using known quantities. Furthermore, for reference the content of the O(1/n) term\nin the above bound takes the form\n\n\u221a\n\n(cid:18)\n2(cid:112)V log(\u03b4\u22121) +\n\n1\nn\n\n(cid:19)\n\nM3 log(\u03b4\u22121)\n\n\u221a\n\n3M2\n\nn\n\nwhere V is an upper bound on the variance var\u00b5 l(h; z) \u2264 V < \u221e over h \u2208 H.\nAs a principled approach to deriving stochastic learning algorithms, one naturally considers the\nchoice of posterior \u03c1 in Theorem 9 that minimizes the upper bound. This is typically referred to as\nthe optimal Gibbs posterior [9], and takes a form which is easily characterized, as we prove in the\nfollowing proposition.\nProposition 12 (Robust optimal Gibbs posterior). The upper bound of Theorem 9 is optimized by a\n\ndata-dependent posterior distribution(cid:98)\u03c1, de\ufb01ned in terms of its density function with respect to the\n\nprior \u03bd as\n\n(cid:16)\u2212\u221a\n(cid:17)\nn(cid:98)R\u03c8(h)\n(cid:17) .\n(cid:16)\u2212\u221a\nn(cid:98)R\u03c8\n\nexp\n\n(h) =\n\nE\u03bd exp\n\n(cid:19)\n\n(cid:18) d(cid:98)\u03c1\n(cid:16)\u221a\nn(cid:98)R\u03c8\n\nd\u03bd\n\n(cid:19)\nFurthermore, the risk bound under the optimal Gibbs posterior takes the form\nn(H) \u2212 1\n\nlog(8\u03c0M2\u03b4\u22121)\n\n+ M2 + \u03bd\u2217\n\nlog E\u03bd exp\n\nG(cid:98)\u03c1 \u2264 1\u221a\n\n(cid:18)\n\n(cid:17)\n\n+\n\n+ O\n\n2\n\nn\n\n(cid:18) 1\n\n(cid:19)\n\nn\n\nwith probability no less than 1 \u2212 \u03b4 over the draw of the sample.\nRemark 13 (Comparison with traditional Gibbs posterior). In traditional PAC-Bayes analysis [9,\n\nEquation 8], the optimal Gibbs posterior, let us write(cid:98)\u03c1emp, is de\ufb01ned by\nwhere (cid:98)R(h) = n\u22121(cid:80)n\nthe latter case should be done with s \u221d \u221a\n\ni=1 l(h; zi) is the empirical risk. We have n(cid:98)R and\n\n(cid:17)\n(cid:16)\u2212n(cid:98)R(h)\n(cid:16)\u2212n(cid:98)R\n(cid:17)\n\n(cid:18) d(cid:98)\u03c1emp\n\nE\u03bd exp\n\n(cid:19)\n\n(h) =\n\nexp\n\nd\u03bd\n\n\u221a\n\nn(cid:98)R\u03c8, but since scaling in\n\nn, so in both cases the 1/n factor cancels out. In the special\n\n7\n\n\fcase of the negative log-likelihood loss, Germain et al. [9] demonstrate that the optimal Gibbs posterior\ncoincides with the classical Bayesian posterior. As noted by Alquier et al. [2], the optimal Gibbs\nposterior has shown strong empirical performance in practice, and variational approaches have been\nproposed as ef\ufb01cient alternatives to more traditional MCMC-based implementations. Comparison of\nboth the computational and learning ef\ufb01ciency of our proposed \u201crobust Gibbs posterior\u201d with the\ntraditional Gibbs posterior is a point of signi\ufb01cant interest moving forward.\n\n5 Empirical analysis\n\nIn this section, we use tightly controlled simulations to investigate how the performance of(cid:98)x (cf. (3)\n\nand Proposition 4) compares with the sample mean and other robust estimators. We pay particular\nattention to how performance depends on the underlying distribution family, the value of second\nmoments, and the sample size.\n\nExperimental setup For each experimental setting and each independent trial, we generate a\nsample x1, . . . , xn of size n, compute some estimator {xi}n\n\ni=1 (cid:55)\u2192 (cid:98)x, and record the deviation\n|(cid:98)x \u2212 E\u00b5 |. The sample sizes range over n \u2208 {10, 20, 30, . . . , 100}, and the number of trials is 104.\n\nWe draw data from two distribution families, the Normal family with mean a and variance b2, and the\nlog-Normal family, with log-mean alog and log-variance b2\nlog, under multiple parameter settings. In\nparticular, we consider the impact of shifting the distribution location over [\u221240.0, 40.0], with small\nand large variance settings. Regarding the variance, we have \u201clow,\u201d \u201cmid,\u201d and \u201chigh\u201d settings, which\ncorrespond to b = 0.5, 5.0, 50.0 in the Normal case, and blog = 1.1, 1.35, 1.75 in the log-Normal\ncase. Over all settings, the log-location parameter of the log-Normal data is \ufb01xed at alog = 0. Shifting\nthe Normal data is trivially accompished by taking the desired a \u2208 [\u221240.0, 40.0]. Shifting the\nlog-Normal data is accomplished by subtracting the true mean (pre-shift) equal to exp(alog + b2\nlog/2)\nto center the data, and subsequently adding the desired location.\nThe methods being compared are as follows: mean denotes the empirical mean, med the empirical\nmedian,4 mult_g is the estimator of Holland [10] using smoothed Gaussian noise, mult_b the\n\nproposed estimator(cid:98)x de\ufb01ned in (3) using smoothed Bernoulli noise, and \ufb01nally mult_bc the centered\nversion of(cid:98)x, see the discussion culminating in (9). The latter methods are given access to the true\n\nvariance or second moment as needed for scaling purposes, and all algorithms are run with con\ufb01dence\nparameter \u03b4 = 0.01.\n\nthe standard deviation is not much larger than the mean, we can see substantial improvement over\n\nImpact of distribution family In Figure 2, we give histograms of the deviations for each method\nof interest under high variance settings. Colored vertical rules correspond to the error bounds for\n\n(cid:98)x under Gaussian noise and Bernoulli noise (bound via Proposition 4), with probability \u03b4. When\ntraditional estimators. The bias introduced by the different(cid:98)x choices is clearly far smaller on average\nThe centered version of(cid:98)x has a deviation distribution somewhere between that of the empirical mean\nand that of the other(cid:98)x choices.\n\nthan the median, with substantially improved sensitivity to outliers when compared with the mean.\n\nImpact of distribution location In Figure 3 (a), we plot the graph of average/median deviations\nover trials, taken as a function of the true location E\u00b5 x. From these results, two clear observations\ncan be made. First, note that the performance of the Gaussian-type (mult_g) and Bernoulli-type\n(mult_b) estimators methods tend to differ greatly as a function of the true mean; in particular, we\nsee that the bias of the Gaussian case is far more sensitive to the true location, providing strong\nevidence for use of our proposed Bernoulli version, which is less expensive, essentially uniformly\nbetter than the Gaussian version (as we would expect from the tighter bounds), with error growing\nslower as a function of the true mean value. Second, the fact that the centering procedure works very\nwell to mitigate the effect of the second moment value is lucid, also a price is paid in overall accuracy\ndue to the naive sample-splitting technique discussed used.\n\nImpact of sample size\nIn Figure 3 (b), we show the graph of average/median deviations taken over\nall trials, viewed as a function of the sample size n. The most distinct observation that can be made\n\n4After sorting, this is computed as the middle point when n is odd, or the average of the two middle points\n\nwhen n is even.\n\n8\n\n\fFigure 2: Histograms of deviations |(cid:98)x \u2212 E\u00b5 x| for different distributions and estimators, with\n\naccompanying error bounds. Sample size is n = 10. Distributions centered such that mean is equal\nto \u201clow\u201d level standard deviation. Top: Normal data. Bottom: log-Normal data.\n\n(a)\n\n(b)\n\nas a function of the sample size n. In both sub-\ufb01gures, left is Normal data, right is log-Normal data.\n\nFigure 3: (a) Deviations |(cid:98)x \u2212 E\u00b5 x| as a function of the true mean E\u00b5 x. (b) Deviations |(cid:98)x \u2212 E\u00b5 x|\nhere is that the estimator(cid:98)x (3) considered here has learning ef\ufb01ciency which is far superior to the\nempirical mean and median, though as expected the centered version of(cid:98)x has poorer ef\ufb01ciency, a\n\ndirect result of the sample-splitting scheme used in its de\ufb01nition. As discussed before, this comes\nwith the caveat that the mean cannot be too much larger than the standard deviation; when the second\nmoment is exceedingly large, this leads to a rather large bias as seen in Figure 3 (a) previously.\n\n6 Conclusions\n\nThe main contribution of this paper was to develop a novel approach to obtaining PAC-Bayesian\nlearning guarantees, which admits deviations with exponential tails under weak moment assumptions\non the underlying loss distribution, while still being computationally amenable. In this work, our\nchief interest was the fundamental problem of obtaining strong guarantees for stochastic learning\nalgorithms which can re\ufb02ect prior knowledge about the data-generating process, from which we\nderived a new robust Gibbs posterior. Moving forward, a deeper study of the statistical nature of this\nnew stochastic learning algorithm, as well as computational considerations to be made in practice are\nof signi\ufb01cant interest.\n\nAcknowledgments\n\nThis work was partially supported by the JSPS KAKENHI Grant Number 18H06477.\n\nReferences\n[1] Alquier, P. and Guedj, B. (2018). Simpler PAC-Bayesian bounds for hostile data. Machine\n\nLearning, 107(5):887\u2013902.\n\n9\n\n050100Deviation0200400600Frequencymean050100Deviation0200400600Frequencymed050100Deviation0200400600Frequencymult_g050100Deviation0200400600Frequencymult_b050100Deviation0200400600Frequencymult_bc0100200Deviation0100200300400500Frequencymean02040Deviation0200400600Frequencymed02040Deviation0100200300400500Frequencymult_g02040Deviation0100200300400500Frequencymult_b02040Deviation0100200300400500Frequencymult_bc25025True mean8101214DeviationAverages (var = high)25025True mean246DeviationAverages (var = high)255075100Sample size5.07.510.012.515.0DeviationAverages (var = high)255075100Sample size1.01.52.02.53.03.5DeviationAverages (var = high)\f[2] Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approximations\n\nof Gibbs posteriors. Journal of Machine Learning Research, 17(1):8374\u20138414.\n\n[3] B\u00e9gin, L., Germain, P., Laviolette, F., and Roy, J.-F. (2016). PAC-Bayesian bounds based on the\n\nR\u00e9nyi divergence. In Proceedings of Machine Learning Research, volume 51, pages 435\u2013444.\n\n[4] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: a nonasymptotic\n\ntheory of independence. Oxford University Press.\n\n[5] Catoni, O. (2004). Statistical learning theory and stochastic optimization: Ecole d\u2019Et\u00e9 de\nProbabilit\u00e9s de Saint-Flour XXXI-2001, volume 1851 of Lecture Notes in Mathematics. Springer.\n\n[6] Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study.\n\nAnnales de l\u2019Institut Henri Poincar\u00e9, Probabilit\u00e9s et Statistiques, 48(4):1148\u20131185.\n\n[7] Catoni, O. and Giulini, I. (2017). Dimension-free PAC-Bayesian bounds for matrices, vectors,\n\nand linear least squares regression. arXiv preprint arXiv:1712.02747.\n\n[8] Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-gaussian mean estimators.\n\nAnnals of Statistics, 44(6):2695\u20132725.\n\n[9] Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. (2016). PAC-Bayesian theory meets\nBayesian Inference. In Advances in Neural Information Processing Systems 29, pages 1884\u20131892.\n\n[10] Holland, M. J. (2019). Robust descent using smoothed multiplicative noise. In 22nd Interna-\ntional Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of\nMachine Learning Research, pages 703\u2013711.\n\n[11] Maurer, A. (2004). A note on the PAC Bayesian theorem. arXiv preprint arXiv:cs/0411099.\n\n[12] McAllester, D. A. (1999). Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363.\n\n[13] McAllester, D. A. (2003). PAC-Bayesian stochastic model selection. Machine Learning,\n\n51(1):5\u201321.\n\n[14] McAllester, D. A. (2013). A PAC-Bayesian tutorial with a dropout bound. arXiv preprint\n\narXiv:1307.2118.\n\n[15] Seeger, M. (2002). PAC-Bayesian generalisation error bounds for Gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 3(Oct):233\u2013269.\n\n[16] Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., and Auer, P. (2012). PAC-Bayesian\n\ninequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086\u20137093.\n\n[17] Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1996). A framework\nfor structural risk minimisation. In Proceedings of the 9th Annual Conference on Computational\nLearning Theory, pages 68\u201376. ACM.\n\n[18] Tolstikhin, I. and Seldin, Y. (2013). PAC-Bayes-Empirical-Bernstein inequality. In Advances in\n\nNeural Information Processing Systems 26, pages 109\u2013117.\n\n[19] Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11):1134\u2013\n\n1142.\n\n10\n\n\f", "award": [], "sourceid": 1560, "authors": [{"given_name": "Matthew", "family_name": "Holland", "institution": "Osaka University"}]}