Review for NeurIPS paper: PAC-Bayes Analysis Beyond the Usual Bounds

NeurIPS 2020

PAC-Bayes Analysis Beyond the Usual Bounds

Review 1

Summary and Contributions: The paper presents a KL-based PAC-Bayes generalization bound for any convex function between the expected and empirical Gibbs risks. While the initial results are for data-dependent priors, the authors also provide a general way to change a PAC-Bayes bound from a data-free prior to a data-dependent prior, by assuming a differential privacy property on the distributions. The paper also presents an example of unbounded loss, i.e., squared loss in a regression problem.

Strengths: The theoretical results seem sound.

Weaknesses: I am unclear about novelty and relevance. (Please see my additional feedback below.)

Correctness: The main claims seem correct.

Clarity: The paper is mostly clear. Few comments to improve clarity: Line 167 "DP(\epsilon) property" should better refer to "\epsilon-differential privacy" in the main text for clarity. I could not believe that Equation 8 holds with probability one, until I noticed that \lambda_i and \hat{lambda_i} depend on S. I suggest using the notation \lambda^S_i and \hat{lambda^S_i} to make the dependence clear. I know that this is somewhat clarified later in Lines 203-207. Line 18 "(e.g. \mathcal{Z} = R^d \times \mathcal{Y})" does not explain what \mathcal{Y} is in the example.

Relation to Prior Work: Relation to prior work is partially addressed. (Please see my additional feedback below.)

Reproducibility: Yes

Additional Feedback: The authors discussed several prior works for unbounded losses in Lines 320-328. While previous results rely on concentration of the losses (e.g., bounded moments), the current result also relies on concentration of the covariance eigenvalues. Note that the loss 1/2 E[(w^T X - y)^2] = 1/2 w^T E[X X^T] w - E[X y]^T w + 1/2 E[y^2], and thus, the loss includes the covariance matrix E[X X^T]. Is there any relationship between the previous assumptions regarding concentration of the losses (e.g., bounded moments) and the concentration of the covariance eigenvalues? I am unclear about novelty and relevance since the authors provide only one example (regression). I know that the authors made an effort to explain throughout the paper and in Section 4. I will appreciate if the authors can elaborate more on novelty and relevance. For instance: What new results can we obtain now that were not possible before. What better rates can we obtain for problems where we already had results. === AFTER REBUTTAL I am a bit disappointed by the authors response: "We will add discussions about the new kinds of results one can obtain now that were not possible before, including rates where relevant." It would have been more beneficial to use the rebuttal to provide such a case. As the authors agree, there is a relationship between the concentration of the loss and the concentration of the covariance eigenvalues. This decreases the novelty of the example provided by the authors. Given the above, I lower my evaluation from 6 to 5.

Review 2

Summary and Contributions: The paper extends the PAC-Bayesian analysis of data-dependent priors.

Strengths: This work study very important aspects of the PAC-Bayes theory. The introduced general framework to study data-dependant prior appears to be a promising theoretical tool to obtain new results, The proposed specialization to three types of data-dependant prior shows the versatility of the approach.

Weaknesses: Some results should be discussed in more details. (1) One asset of the "classical" PAC-Bayes bounds is the fact that they are valid uniformly for all posterior Q (as for the bounds of Eq. 1, 2 and 3). It does not seem to be the case for Theorem 1(i). Am I right? And what about the other bounds of the paper? This point deserve to be properly mentioned for all results. (2) Theorems 1 and 2 proposes general theorems that are possibly data-dependant, but borrows similarities from previous works. Theorem 1(i) looks akin the PAC-Bayes "pointwise" bound of Blanchard and Fleuret ("Occam’s Hammer", COLT 2007), Theorems 1(ii) and 2 resemble to many PAC-Bayesian general statements of the literature cited elsewhere in the manuscript. The function «F» domain might be in R^k instead of R^2, but it does not seem to be used or discussed in the paper.

Correctness: The theoretical statements are sounds and the mathematical results seem to have been obtained rigorously. However, I haven't check all the details in supplementary material.

Clarity: The clarity can be improved. The first pages contain many important definitions in footnotes, which I find unpleasant to read. Space could be saved by shortening, or moving to appendix, the final "Additional discussion and related literature" section, that is a very general and mostly uninformative list of general PAC-Bayes references. This should also allow to comment more on the results (see above " Weaknesses" section).

Relation to Prior Work: Previous work are cited but not always in an oprtimal manner (see comments above).

Reproducibility: Yes

Additional Feedback: Minor comments: Line 84: I don't think "by a stronger form of Pinsker's inequality" is appropriate to say, as (p-q)^2/(2q) < 2(p-q)^2 for q > 0.25. Thus, it only provides tighter risk bounds in some appropriate scenarios. Line 502: Z_i -> Z'_1 Line 510: DP is used before being defined (next page). ================== Following the rebuttal and the discussion, I raised my score from 6 to 7 because, as mentioned by R3, "the distinction between distribution-dependent bounds and data-dependent bounds was often poorly understood in the literature", and the paper could help the community to go further in this direction. I strongly encourage the authors to carefully revise their paper to include all the changes they commit to do in their rebuttal. Note, however, that they must be careful regarding the claim "each theorem holds for an arbitrary kernel, so in that sense it holds 'for all posteriors' as per the usual formulation in the literature." I doubt that the bound is valid *uniformly* for all kernel, i.e., I doubt that the following holds: P( \forall kernel : risk < bound) < 1-\delta. In contrast, the usual PAC-Bayes bounds are valid *uniformly* "for all posterior": P( \forall posterior: risk < bound) < 1-\delta.

Review 3

Summary and Contributions: The paper starts with an introduction to PAC-Bayes bounds: 1) what they are useful for: to upper-bound the generalization error of randomized or aggregated predictors 2) the various type of bounds (data dependent, distribution dependent...) Then, the authors proposes a new key lemma (Theorem 1) in PAC-Bayes analysis. Under the "usual assumptions", that is, a) bounded loss b) data-free prior and c) i.i.d obserbations, this key lemma allows to recover known PAC-Bayes bounds by McAllester (1998), Seeger (2002), Catoni (2007)... However, the important point is that the lemma still holds without these assumptions: there is a key quantity in the upper bound, \xi(Q^0). In any setting where one is able to upper-bound \xi(Q^0), one obtains a new PAC-Bayes inequality. The authors apply Theorem 1 to data-dependent priors (that is, b) above is violated), and then, to least-square regression (where a) above is violated). Note that in the first case, they obtain known essentially known results: one by Catoni (2007), the other by Dziugaite and Roy (2018) -- with an improvement in the constants, though.

Strengths: PAC-Bayes bounds are known to be quite "difficult to read", e.g. the monograph by Catoni (2007) is very technical. Despite the recent regain of interest in PAC-Bayes bounds, a good and easy-to-follow introduction is still needed. The introduction (pages 1 to 3) is the clearest I've never read on the topic. All the quantities are defined rigorously, and explained intuitively. Moreover, the distiction between distribution-dependent bounds and data-dependent bounds was often poorly understood in the literature: the explanations by the authors is great. Both types of bounds are useful, distribution-dependent bounds showing that "some data-generating distributions might give faster convergence rates than others" while data-dependent bounds are to be seen as "a risk certificate that is valid on unseen examples". For this beautiful introduction, I want to first congratulate the authors. The main contribution is Theorem 1, that is new to my knowledge. It provides a unified approach to many PAC-Bayes bounds, and clarifies the role of the "usual assumptions" mentioned above. It also provides a way to avoid them and to derive new bounds.

Weaknesses: One can argue that, even in the quadratic loss and data dependent prior case, the results were essentially already known. So, the paper mostly gives a unified approach to prove existing results. Indeed, I would not say that the paper is groundbreaking. Even though, this criticism should be mitigated: 1) the constants in the bound by Dziugaite and Roy (2018) are improved. Even when the improvement in the constants is small, I think this is always important in data-dependent bounds to obtain the sharpest possible constants. 2) the approach proposed by the authors can be applied to other examples, and lead to newer bounds.

Correctness: The main arguments in the proofs seem to be correct.

Clarity: As already written above (in the "strength" section), the clarity level of the paper is excellent.

Relation to Prior Work: Overall, the references to prior works is very good. The authors made a great effort to mention all classical PAC-Bayes works: McAllester (1998), Seeger (2002), Catoni (2007)... and to mention the most recent works (2018/2019), explaining well the contributions of each. I would like to raise 4 important points, though. I hope that the authors can take them into account: 1) The introduction (line 106-115) mention many works with data-dependent priors, but do no mention Catoni (2007). However, the fact that Catoni used data-dependent priors is menioned later, in Section 2.1. It might be worth mentioning it in the introduction, too. One more thing about data-dependent priors: the application of PAC-Bayes bounds with data-free priors, like (1), (2) and (3), leads to bounds in sqrt(log(n)/n) in classification. Using the Gibbs data-dependent prior, one obtains (6), where the bound is in sqrt(1/n). Thus, one gets rid of the log(n) term, which is to my opinion one of the most important contributions in Catoni (2007). Given the pedagogical qualities of the paper, it would be worth mentioning this explicitely. 2) I mentioned above 3 "classical assumptions": a) bounded loss b) data-free prior and c) i.i.d obserbations. The authors claim that their work allows to extend over past results by removing a) or b), but they never mention c). If one checks the proofs in [Alquier and Wintenberger, Model selection for weakly dependent time series forecasting, Bernoulli, 2012], it appears that the proofs are based on PAC-Bayes bounds for non i.i.d data. Namely, the authors apply Rio's version of Hoeffding's inequality for time series, derive PAC-Bayes bounds, and use them to prove model selection results for time series. The interesting point is that: I) Theorem 1 does not require assumption a) nor b), but it does not either require assumption c)!! and II) one can plug Rio's inequality to upper bound \xi(Q^0) when the data is not independent. To mention this would considerably extend the scope of the paper. 3) The authors write "To the best of our knowledge, ours is the first work to extend PAC-Bayes analysis to stochastic kernels". This statement is wrong. In Catoni (2007), the notion of kernel is defined page 5/6 and the following PAC-Bayes bound, Theorem 1.2.1, is stated for a stochastic kernel. (Note that there, it's not called a "kernel" but a "regular conditional probability measure", but the definition is exactly the same). The same framework was used for example in [Alquier, PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers, Mathematical Methods of Statistics, 2008] Definition 1.5 page 282. 4) In Section 2.3, the authors study least-square regression and provide a very general result. However, line 205/206, it seems that one has to require that the noise is sub-Gaussian to obtain explicit rates. The authors mention Vershynin (2011) to remove this assumption, is it possible to have more details? I think another possibility is to use the work of Holland (2019), already cited by the authors. The idea of Holland is to use in the proofs an alternative loss function, that allows to deal with heavy-tailed noise. My guess is that the authors could use Holland's idea in their Theorem 1, by adapting the function f(S,H) so that it matches the loss function of Holland. If this is the case, this would be another interesting application of Theorem 1. (Of course, I understand that space is limited, so I'm not asking for complete statements of new results, but at least a short discussion on these points would be nice).

Reproducibility: Yes

Additional Feedback: None. ************** I thank the authors for their detailed feedback. My positive opinion on the paper remains unchanged, so I don't change my scores...

Review 4

Summary and Contributions: The paper handles PAC-Bayes bounds where the prior is data-dependent, or when the loss function is unbounded. The authors use the proposed method to provide several examples where this bound can be useful.

Strengths: The paper develops new bounds, some of them eliminate previous assumptions, thus allowing more holistic bounds. Specifically, this is the first attempt to addresses stochastic kernels using PAC-Bayes bounds. The main claims are sound. The derivation from the main claim to specific cases strengthens their usefulness.

Weaknesses: The paper is missing a thorough comparison to previous works. While the paper states previous bounds, it is unclear how the new bounds improve them, if at all. While the benefits of comparing bounds with data-dependent priors to non-data-dependent priors can be argued, it is at least meaningful to compare to previous data-dependent works. The paper might be organized differently for better readability: Previous works can be aggregated into one section, instead of being separated along with the paper; Notations and preliminaries could be defined separately outside the introduction; The implication of the two main theorems (sections 2.1-2.3) could be after the main theorems (section 3).

Correctness: The proofs inside the paper seem correct. I did not review the proofs in the appendix.

Clarity: The paper is well written. However, more organizations can be made for better readability (see details above).

Relation to Prior Work: The authors provide a comprehensive survey of previous works and explain how their assumptions differ from the previous works. Yet, a comparison between the results is still needed to put the bounds in context.

Reproducibility: Yes

Additional Feedback: In theorem 2, please better explain why the convexity is necesary and if and where it limits the general claim. Very minor issues: - Please explain the difference between “kl” (lower case) and KL (upper case) in the preliminaries. - Line 327: You wrote “or results” instead of “our results.”