NeurIPS 2020

Can Implicit Bias Explain Generalization? Stochastic Convex Optimization as a Case Study


Meta Review

## Summary This paper investigates whether implicit bias can explain generalization in stochastic convex optimization (SCO). This responds to some recent work, which posited that generalization occurs because SGD possesses some implicit regularization property. By construction, the authors show that: (1) no distribution-independent implicit regularizer can be responsible for generalization; and (2) there are distributions for which no distribution-dependent implicit regularizer can explain generalization. Thus, implicit regularization can be ruled out for many instances of SCO. ## Strengths * The paper is well written and organized. The authors have clearly taken care to help the reader understand the results, supplying intuition and high-level takeaways of results as needed. * The results are relevant to contemporary discourse and will further our understanding of SCO and implicit regularization; they may change the direction of these discussions. ## Weaknesses * Understanding SCO may or may not bring us closer to understanding stochastic non-convex optimization, which is the more interesting setting in light of deep learning's success. Nonetheless, it is important to understand convex problems. * The paper's main contributions are negative results, with no recommendations for more promising directions to explain generalization. Perhaps that is a matter for future work, but it would have been nice if the results offered some advice. * The proofs are extraordinarily long and complicated, which makes them difficult to understand and verify. Perhaps with a few days (or weeks) of dedicated effort, one could thoroughly digest the proofs, but given the realistic constraints of life/work, most readers will just have to take the authors' word for it. (This will become important later on in my meta-review.) * Writing nit: there is no conclusion/discussion section, which makes the ending very abrupt. I guess there just wasn't enough space. I recommend shortening some of the background sections (2 & 3) to make room at the end for conclusions. ## Justification for Recommendation The only negative review came from R3, whose primary objection was that SGD and GD should be specified as _projected_ SGD and GD, respectively; otherwise, their iterates may end up outside the feasible set. The authors responded that this is a simple fix, which does not change their results, but R3 was not convinced. R3's argument goes as follows (paraphrasing the discussion): > In the setup, all models are assumed to belong to the unit ball, denoted W_1. Though, without projection, there is no guarantee that SGD or GD will keep the iterates in W_1. Th 1 claims that there exists a point inside W_1 that shows that SGD is Pareto-inefficient. The proof of Th 1 relies on Th 8, in which it is shown (p. 15 of supplemental) that the iterates are contained in a ball of radius 5 (W_5), not 1, and no projection is used. While over W_1, it was shown that the regularized ERM would have a lower regularization penalty and empirical error than that of the SGD solution, such a claim cannot be made over W_5; a regularized ERM solution over W_5 may have a larger regularization penalty than that of the W_1 solution; hence, the SGD solution may no longer be Pareto-suboptimal w.r.t. the regularized ERM over W_5. The proofs of all of these results are unfortunately very long and technical, and I must admit that I have not verified them in detail. Nonetheless, I believe that the issues R3 raises are not all that severe, and I will now attempt to explain why (using arguments borrowed from the discussion). First, it is clear that the iterates of SGD (as considered in the paper) never escape a constant-sized ball. In the construction used to prove Th 8, the radius is 5. So, if the feasible set is defined as a ball of radius 5, then the iterates of projected SGD will never require projection; they will naturally stay inside the feasible set. From this, we conclude that modifying the paper to explicitly state "projected SGD" is merely a technicality. However, since we have changed the problem domain from a 1-ball to a 5-ball, it is unclear whether the rest of the proof goes through with the 5-ball. To this I ask, can we not just change the construction used in Th 8 to ensure that the iterates never escape the 1-ball? 5 is not a magic number; surely some scaling of the construction (e.g., restricting \theta_1 to a different range) would result in the desired result. We are quibbling over constants that do not appear to be fundamentally important. The problem is, it will take me a very long time and much effort to verify this. So, lacking this effort, I am left with uncertainty. The thing is, even R3 isn't sure whether the results won't hold with the proposed fix; they have cast doubt on the proof, but have not provided their own proof of why the result is flawed. I will also note: as a result of the discussion, R7 decided to lower their score (from 7 to 5), though I believe that their original assessment was correct. (It is worth noting, the other 2 reviewers were not swayed to lower their scores.) I am thus going against the final scores and recommending acceptance. I think this is a good paper, and that it would be a shame to reject it due to a possible misunderstanding. ## Feedback for the Authors I _strongly_ encourage the authors to incorporate all of the feedback when revising the paper. In particular, I wish they could find a way to simplify the proofs in the appendix to make them more easily digestible. At the very least, adding more high-level intuitions (and maybe figures?), as they did in the main text, would go a long way.