Review for NeurIPS paper: Curriculum By Smoothing

NeurIPS 2020

Curriculum By Smoothing

Review 1

Summary and Contributions: Update: The paper has several positives, as noted in all the reviews, but one of my initial concerns was the lack of strong comparisons. The paper mentions several works that can be potentially compared with. The initial reviews also suggested additional methods for comparison. I find the author response on this front partially convincing. The proposed approach can be seen as being orthogonal to some of these, and the comparison shown in the author response is worth including in the main paper. This is a step in the right direction that the final version of the paper, if accepted, should build upon. A discussion wrt [R4] will be another useful addition to the paper. In summary, I find the paper interesting and am leaning towards an accept. This paper presents a way to mitigate the impact of noisy features can have when training a CNN, especially during the initial stages of training. This is achieved by proposing a curriculum based approach that smoothes the features with a Gaussian kernel. The kernel moderates the amount of high frequency information that is propagated during the initial stages of training a CNN. This idea takes inspiration from recent work [21], which proposes a curriculum for learning GANs. The approach presented in the paper is evaluated on several tasks, including: transfer learning, generative models, image classification, feature learning, etc.

Strengths: * The paper presents an effective approach clearly. The ideas are simple, which can potentially make them popular. * The approach is evaluated empirically on several tasks, showing that it is applicable in a varied scenarios. * The details in the paper and the source code that is provided along with submission are sufficient information for reproducibility.

Weaknesses: While the paper presents an effective idea, there are a few questions that need to be addressed. * The choice of using a single kernel size for all the layers of the network (which is changed over the training iterations) seems a bit arbitrary. Would a layer-dependent kernel size not be more appropriate? This choice could also be potentially influenced by the architecture used as well. * The paper mentions several works that have attempted to improve the issues in learning a CNN, but none of them are compared in the empirical evaluation. This is the biggest issue with this paper, as it falls into the category of papers that require a very strong empirical evaluation. This is essential for at least one of the tasks considered in the paper.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: There are a few minor typos in the paper.

Review 2

Summary and Contributions: Convolutional neural networks have met tremendous success in many tasks of imaging and signal processing. The paper proposes a new training setup, in which only low frequency information is propagated through the network in the earlier stages of optimization. This is achieved by a smoothing of the CNN features in each layer by a Gaussian kernel. Along training, the width of the kernel is diminished so that higher frequency information is allowed to go through the network. The increased stability of the learning process is particularly useful in the context of GANs.

Strengths: The idea seems elegant; it is difficult for me to evaluate the relevance of the empirical validation. No theoretical result is given.

Weaknesses: The authors claim that no trainable parameter is added, which is true, however the initial choice of sigma and its decrease rate are not theoretically studied; this would make the contribution less heuristic. The impact of curriculum by smoothing is compared to somehow "vanilla" training (sgd with the parameters of the original paper). However there exist many techniques beyond SGD to improve training (eg, as mentioned by the authors, batch normalization), against which the authors do not benchmark.

Correctness: Empirical methodology is limited

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Typos: backpropogation (multiple times) a Gaussian Kernel layers perforn Adaptpation (these typos can easily be avoided by using a spellchecker on your test editor)

Review 3

Summary and Contributions: The paper presents a training approach for CNN based on a progressive smoothing of the network by gaussian kernels. The idea is clearly presented and numerous experiments clearly illustrate the approach.

Strengths: The proposed approach is simple, clearly presented, can easily be added to any architecture. No hyperparameter tuning is added, which is a clear benefit of the approach. The experiments are well motivated and illustrate clearly the approach. The idea is well motivated, and quite interesting. Numerous appears considered some kinds of smoothing for deep neural architectures. This paper uses a smoothing that is adapted to the nature of the data (convolutions for images or sounds). The code is well-written and seems easy to use.

Weaknesses: Future directions are possible, e.g. analyzing other smoothing and compare the results to understand the choice of the Gaussian kernel. However the present paper clearly does what it searched for.

Correctness: The experiments present clear comparisons with error bars

Clarity: The paper is well motivated and well-written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: What is the behavior of the optimization process? Does the smoothing accelerate the optimization ? or hurt it ? Minor comments: Rather than insisting on the elegance of the approach (which is normally left to the reader's judgment), the authors may insist on the fact that their method does not add any additional parameters to tune which is very important advantage. l. 248, 307 regiment -> regime ?

Review 4

Summary and Contributions: The authors present an approach for training convolutional neural networks using Gaussian kernels, which are gradually modified from a high-level of blur to a low-level of blur, simulating a kind of curriculum learning. The authors present experiments on multiple tasks, showing improvements over the standard convolutional neural networks.

Strengths: + The authors present superior results compared to the standard training approach on various tasks, ranging from detection and segmentation to zero-shot domain adaptation and image generation.

Weaknesses: - The authors compared their method to the baseline approach only. However, there are plenty of curriculum learning methods that could have been used as relevant state-of-the-art competing methods to compare with, e.g. [R1, R2, R3, R4]. Comparison with such competing methods is mandatory, in my opinion. - In Eq. (2.1), I believe that the non-linearity is typically applied before the pooling operation. - In terms of novelty, the idea of adding some Gaussian kernels to the network is quite straightforward and simple. Even so, it is not clear why it works so well. The provided motivation is not enough. I would have like to see some visualizations of low-level, mid-level and high-level filters and how these evolve during training in order to figure out what is happening. All the experiments are performed on images, so I would consider this a vision paper. A vision paper without figures is not a properly written vision paper. - Does the approach apply to data other than images? Until proven otherwise, it should clearly stated in the title that the approach applies to images only, e.g. "Curriculum by Smoothing for Images". - Are the improvements statistically significant? A statistical test should be performed to test the null hypothesis. Missing references: [R1] Saxena, S., Tuzel, O. and DeCoste, D., 2019. Data parameters: A new family of parameters for learning a differentiable curriculum. In Advances in Neural Information Processing Systems (pp. 11093–11103). [R2] Soviany, P., Ardei, C., Ionescu, R.T. and Leordeanu, M., 2020. Image difficulty curriculum for generative adversarial networks (CuGAN). In The IEEE Winter Conference on Applications of Computer Vision (pp. 3463-3472). [R3] Penha, G. and Hauff, C., 2020, April. Curriculum Learning Strategies for IR. In European Conference on Information Retrieval (pp. 699-713). [R4] Karras, Tero, Timo Aila, Samuli Laine, and Jaakko Lehtinen. "Progressive growing of gans for improved quality, stability, and variation." arXiv preprint arXiv:1710.10196 (2017).

Correctness: Seems correct, but a comparison with competing methods is missing.

Clarity: There are some typos and English mistakes to be corrected: - "Gaussian kernels are a deterministic functions" => "Gaussian kernels are deterministic functions"; - ". n the sample pseudo-code" => ". In the sample pseudo-code"; - "[31, 30, 45, 8]" => "[8, 30, 31, 45]" (references should be provided in order).

Relation to Prior Work: There are some missing references, e.g. [R1, R2, R3, R4].

Reproducibility: Yes

Additional Feedback: The main drawback is that the authors have only compared with the standard CNN training. Hence, it is unclear how the method compares to other curriculum learning methods. Another important issue is the lack of visualizations: it is not clear how and why the method works. I have the following observations regarding authors' response: 1. Regarding the requirement to compare with competing methods, the authors mentioned that related works based on curriculum are orthogonal. However, there are works that are not orthogonal. For example, [R4] is a method that applies curriculum using a very similar idea. In [R4], the authors progressively increase the size of input, starting with low-resolution images and increasing their size until they are able to generate realistic high-resolution images. I believe that the idea of smoothing the kernels is very similar to [R4], i.e. applying smoothed kernels on large images is equivalent to applying sharp (non-smoothed) kernels on smaller images. Since there are non-orthogonal approaches, e.g. [R4], I believe that a comparison with other curriculum learning methods is still mandatory. 2. The authors did not address my comment regarding visualization of kernels during training. I believe it is important to see how the kernels from some layer converge with and without smoothing. It could explain why and when the proposed idea is useful. Could be inserted in the supplementary at least. 3. It is not a problem that the authors present results on a single data modality: images. However, the contribution should be stated accordingly. For example, I am not sure that smoothing kernels applied on text data would have the same effect. [R4] Karras, Tero, Timo Aila, Samuli Laine, and Jaakko Lehtinen. "Progressive growing of gans for improved quality, stability, and variation." arXiv preprint arXiv:1710.10196 (2017).