NeurIPS 2020

### Review 1

Summary and Contributions: This paper adjusts the batch norm statistics during test time and yields large gains on many distribution shifts. The adjustment depends on the number of examples, which is new. Evaluation is mainly on corrupted images of various types (ImageNet-C).

Strengths: Empirical evaluation is sound and definitely thorough and on large-scale images. The performance improvements are substantial and makes efficient use of previously unexploited information. The paper also shows when this technique might not be as helpful (ResNeXt-WSL). This is relevant to NeurIPS given the increased interest in OOD generalization.

Weaknesses: Using Adaptive BN for distribution shift robustness has been proposed in several _parallel_ works, though the sample-size dependent adjustment is distinct even from these parallel works.

Correctness: "Ford et al. [2] report a decrease in performance when the compressed JPEG files are used as opposed to applying the corruptions directly in memory without compression artifacts." This was overstated and because they accidentally used 299x299 images. The performance difference is small.

Clarity: The paper is well-organized and figures look fine.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Is Eqn (4) a good way to integrate new statistics? Would it be better if the reliance on the original statistics decreases exponentially as new samples increase? (You're proposing a weighted average instead of an exponential moving average.) It would be good to mention _Identifying Statistical Bias in Dataset Replication_ in situating the ImageNet-V2 results. The appendix figure comparing mCE to # of parameters was informative. This type of high-level analysis is useful. ImageNet-R comparisons would be good to see in a camera-ready. Update: I have read the paper and appreciate the expanded results.

### Review 2

Summary and Contributions: The authors posit that "common corruptions" (= ImageNet-C) to images translate to (first and second second moment) shifts in the activations for batch-norm trained models. They suggest to correct for this by estimating the batch statistics from the test set (or interpolate between those and the training ones) and confirm that this helps across several ImageNet SOTA models. They further show that the gains vanish when one uses models that have been pre-trained on large weakly supervised models. Finally, they draw a connection between the Wasserstein metric and model performance on ImageNet-C, by showing that they are correlated. The idea of adapting the batch-norm statistics under a distributional shift is not new, as the authors point out (paragraph starting on l. 249). Hence, the main contribution of this paper is the hypothesis that synthetic perturbations result in such corruptions, and the analysis of the proposed technique (that fixes only the first two moments). It's a bit unfortunate that the authors test their hypothesis on a single dataset only. Nevertheless, I believe this is an interesting technique that can be useful to construct strong baselines on this ever more important problem. After response: -- Thanks for your response! It is encouraging to see that you show benefits on another dataset, and thanks for fixing the presentation errors / typos. I wish you had touched upon the Wasserstein-related question, I was really curious about its importance for this problem.

Strengths: - Analyzing the model behavior under covariate shifts is an important problem that the NeurIPS community is excited about. - The technique is simple and shows and can be a very strong baseline for future developments. - They successfuly correct several different models to obtain SOTA results.

Weaknesses: - The authors did not try analyze if their hypothesis holds on other datasets, e.g. CIFAR-C, or any dataset that is derived using synthetic perturbations. - There are several issues with the clarity of the paper and the precision of the theoretical claims. - The technique itself to correct the BN statistics has been known. This limits the novelty as it sets the focus on the site on common corruptions, which is evaluated on a single dataset only.

Correctness: - The proposal to optimize a lower bound on the Wasserstein loss (line 227) is questionable, why not focus simply on the upper bound and use the lower only to quantify uncertainty? - Proposition 1 holds only for Gaussian distributions, the Wasserstein metric has no closed solution as a function of the first two moments only otherwise. - Equation (3) is wrong, you have to change the argument to p_s(x) and the right side rather than the left side of both conditionals. You could also define a bijection A=Covariance^{-1/2}(x-Mean) and work with x'=Ax.

Clarity: The paper itself is clearly written and easy to follow. One thing that could be improved is to note that the mean / variance scalings are not made for each layer independently, as they have cascading effects. This is clear from the appendix, but it should be stressed in the text. There are however several issues with the figures, e.g. - Figure 2 can be hard to read, I'd suggest changing the marker style to something else rather than filled-in and blank. - The colors in Figure 4 (middle) are completely incosistent. Further. please note in the paper itself why there are three ImageNet-v2 variants, it can be very confusing to some readers. Further, shouldn't the baseline match the "Matched" ImageNet-v2 variant? I'm either confused by the colors, or it's matching "Top". - Why are there two horizontal lines in Fig 4 (right), but only one in the legend? - Why is the baseline marked differently in the left and middle panels of Fig 4? - I'd spell out ResNet and AssembleNet fully in Fig 1 (or in the caption).

Relation to Prior Work: The authors positioned both the problem, current approaches, as well as their proposal for batch norm correction well with regard to prior work.

Reproducibility: Yes

Additional Feedback: - Why did you focus on the Wasserstein distance? I would be very curious how the plot looks like for other divergences (you mention Jensen-Shannon in the appendix, but not in the main text), as many have closed forms for Gaussians (or are easily estimated, given that they are 1-d distributions).

### Review 3

Summary and Contributions: The key motivation of this work is the sensitivity of models to covariate shift. The main contribution of the work is around building an empirical understanding of adaptive Batch Normalization across a variety of models and datasets, in the presence of covariate shift.

Strengths: - As pointed out in S3, the core method that the paper studies is very close to prior work on adaptive Batch Norm. The additional idea of interpolating the train and test statistics is useful when the number of target samples is limited, and results suggest that is possible to adapt with a limited number of samples. - The work does a good job of making a case for the inclusion of evaluation settings where the model is allowed to adapt to the target domain. There seems to be a substantial performance difference with and without adaptation (on IN-C), and the additional complexity of evaluation with adaptation seems minor. I was able to look over the submitted code, and I think it would be an easy addition to existing benchmarks. - The empirical study is has some interesting takeaways: (i) adaptation seems to improve performance across the board for IN-C; (ii) pretraining may help mitigate internal covariate shift; (iii) BN with adaptation appears to be more robust than alternatives like GroupNorm or Fixup. Overall the simplicity of the method is appealing, and it provides a substantial improvement for little extra effort.

Weaknesses: - Since the contribution of the work is largely empirical, I think S5/S6 should be much more carefully written to explain results (see detailed comments on clarity and suggestions for improvement). - Parts of the paper appear to be somewhat uncautiously written: e.g. while lines 184-190 (correctly) suggest that pretraining alleviates the need for adaptation based on Table 2, lines 285-287 make a stronger claim about the covariate shift under the IG-3.5B model, which is not something that has been explicitly shown empirically.

Correctness: Results seem correct (errata are included in Appendix which I appreciate), and I've already raised other concerns about S6 elsewhere.

Clarity: I think the writing can be improved substantially. - Given that the work is largely empirical, the introduction feels unfocused, and does very little in terms of presenting and signposting the central results of the paper. Currently, the reader has very little idea of what to expect in S5/6 until they read till the end of the work. - The experiments themselves are somewhat haphazard, and I would ideally like to see more detail on exactly what the experimental hypotheses are (at the beginning of S5), and why the experiments being run are the right way to test those hypotheses. - I think the experiments could benefit from explaining methodology a little more. As pointed out elsewhere, many of the figures require additional clarification.

Relation to Prior Work: The relationship to prior work was clearly stated.

Reproducibility: Yes

Additional Feedback: I think there is scope for improving the clarity, results and presentation of the paper. - Lines 45-48 suggest that we will see an experiment that validates that distributional shifts under corruption are largely due to the shifts in the first and second moments of the internal activations. However, I don't think this is directly addressed by an experiment in S5/6. While it seems that the BN adaptation can compensate for performance degradation due to these moments, it's not clear that there isn't a difference in higher order moments that is not addressed by this technique. It would be good to address this claim. - Defn 1 seems to be missing some statement about how p_s and p_t are different. - I don't understand Eqn. (3). Aren't p_s and p_t substantially different due to the covariate shift? - It would be nice to see an experiment that quantifies the degree to which the adaptive BN technique can compensate for covariate shift, as the amount of such shift is varied. Concretely, if the IN-C corruptions were sorted in order of the covariate shift induced, does adaptive BN compensate more for the corruptions with larger shift? - Line 149-150 make reference to a full green line with stars: but Figure 1 (i) contains no such line. What is "N best" in Figure 1? It would also be nice to put $n$ on the x-axis of the plot. - In Figure 3 (i), each corruption has multiple points scattered on the plot: what do these correspond to? - I'm confused by Figure 3 (ii). For layer j > i, if W_j < W_i (i.e. there is less divergence at a later layer), does that not indicate that the model has compensated for the distributional shift? How was this figure generated? I don't see any details for this in the Appendix. - For pretraining, how does the distribution of internal activations change? The pretraining section suggests that these models don't require adaptation, so it would be nice to see more analysis on why this is the case. - The theoretical analysis and accompanying empirical study (end of S6) seem like an afterthought. I found this portion largely unclear (see below). - Figure 5 merits more explanation and I found it hard to interpret, nor did I understand its significance. What empirical observation is explained by Fig 5? - Why focus on the W2 distance in Prop 1 rather than the parameter estimation error? - In Prop 1, what is the difference between \hat{\mu}_t and \bar{\mu}_t? - Why would choosing N by minimizing L make sense? Isn't L a lower bound? We would presumably want to minimize U to reduce the divergence between the true statistics and the estimated statistics? - How do you calculate min_N U? Doesn't that require knowledge of the target statistics (mu_t, sigma_t), which is what you're trying to estimate? - Page 2 of the Appendix appears to be incomplete. - The authors clearly made substantial effort in making sure the work is reproducible. I went through the submitted code, as well as the appendix, and I believe that the results should be reproducible in a reasonable amount of time. I encourage the authors to package and release their code for use in future robustness research. -------- Updated (8/20) --------- I've read through the rebuttal. I was mainly concerned by readability issues and quality of presentation, as well as problems with understanding the utility of the theoretical results in the work. Some of these concerns were echoed by other reviewers. I think the authors' did a good job addressing concerns about clarity. I was glad to see that they explicitly wrote out the relationship of experiments to a set of concrete experimental hypotheses. They also updated and have fixed the issues around readability of figures and tables. Some of the other reviewers had valid concerns about generalizing to other datasets, which they made effort to respond to. I'm still somewhat concerned about the presentation and significance of the theoretical results, which remain largely an afterthought. I had specific questions related to the applicability of the results in practice. However, given the space constraints in the rebuttal, the effort made by the authors, and that this is not the central contribution of the work, I'm willing to overlook these issues. As I stated in my review, the simplicity of the work is appealing. Given this, I've updated my score from 5 -> 7 to reflect my updated evaluation of their work. However, whether the paper is accepted or not, I do urge the authors to consider some of the points raised about the theoretical results and communicate them more clearly.

### Review 4

Summary and Contributions: The paper suggests to update the batch normalization statistics during evaluation time, in an unsupervised manner, in order to improve robustness metrics on different types of corruptions. This simple adaptation method shows that the mean corruption error in ImageNet-C can be largely reduced, and gives boosts to many published types of neural networks. This suggests that the estimate of the mean corruption error, at least in datasets using ad-hoc distortions such as ImageNet-C, is very likely over-estimated.

Strengths: The paper adopts the proposed simple technique on dozens of pre-trained models on ImageNet and shows that it can greatly reduce the mean corrupted error, when evaluated on ImageNet-C. They also perform experiments with state-of-the-art methods on the same evaluation dataset showing similar improvements. For all these experiments, they provide sufficient details to reproduce the results. Notice that many training details are not mentioned in the paper, but that is ok, since they downloaded the models directly from torchvision. However, I would suggest to add the commit number, or at least the date on which the models were retrieved, just in case there are updates of these models in the future. The authors also made the effort to evaluate this approach on other robustness benchmarks such as ImageNet-A, ImageNet-V2 and ObjectNet, although the results are not that positive in these (see comments later). The paper includes different ablation studies (in addition to the different benchmark datasets), such as pre-training on large datasets, and using alternatives to BatchNorm that try to alleviate the same problem, such as GroupNorm and Fixup Initialization. At least on ImageNet-C, their approach offers better results than using GroupNorm/FixUp, but when using large pre-training datasets, the improvements are not that large.

Weaknesses: The main weakness is that the work seems to highlight (and partially solve) a specific problem of the ImageNet-C benchmark, rather than showing that it is a good approach for training robust neural network-based image classifiers in general. This is shown by the fact that the improvements shown in ImageNet-C are not observed in other benchmarks such as ImageNet-V2, and the more recent ObjectNet benchmark. Given that, as the authors highlight, the distortions in the ImageNet-C benchmark are built ad-hoc, and (perhaps) are not representative of the typical distortions found in more realistic scenarios, one wonders what is the applicability of the proposed approach out of the ImageNet-C benchmark. In addition, the idea of adapting batch normalization statistics is not new in the area. The authors cite four works in which they use the same or similar ideas for other goals (task adaptation, multi-task, ...), see citations [4-7].

Correctness: The empirical methodology is correct, but as mentioned above the results on other benchmarks different from ImageNet-C suggest that the proposed approach only works in this particular (and unrealistic) scenario. In the comparison against GroupNorm and FixUp, the authors only performed experiments on ImageNet-C, while ignoring the other datasets. For instance, on ObjectNet, a R50 using GroupNorm instead of BatchNorm, trained on LSVRC2012, performs much better than the best results reported in this paper: ~58% top-1 error (this paper) vs ~48% top-1 error (https://arxiv.org/pdf/1912.11370.pdf).

Clarity: The paper is generally well written and structured. However, the presentation of some figures could be improved. For instance, figure 4 shows two baseline lines for ObjectNet, but there differences among the two are not stated anywhere. Likewise, the ImageNet-V2 plot shows lines for "Matched", "Threshold" and "Top", but there's no reference to these terms in the text. Also, figure 5 is too small and can be barely read without the use of the magnifier tool in the PDF reader application (not to mention if the paper is printed). It is unclear why some numbers are missing in Table 1. In particular, there are several results missing for Assemble Net. Finally, when referring to the sections in the appendix, it would be preferable to cite the particular subsection, since some sections contain many of those. For instance, the proof (sketch) of Proposition 1 is in Appendix D.8, but it is only referenced as Appendix 8, which contains 11 subsections across 10 pages.

Relation to Prior Work: In the related work section, the paper includes references to several relevant works in robustness and unsupervised domain adaptation, but in some cases it is not explicitly stated what are the differences with respect to this work. For instance, line 252 states: "The idea of adapting activation statistics was originally developed in [44]", but how does it differs from the presented here? The authors also cite alternatives such as GroupNorm and FixUp initialization, although they do so in the experiments, rather than related work section. However, as explained above these alternatives are not fairly compared, since only ImageNet-C is used in this comparison, and good results using these techniques on the other benchmarks are not discussed.

Reproducibility: Yes