NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2850
Title:Incremental Few-Shot Learning with Attention Attractor Networks

Reviewer 1

The paper is nicely written, motivates incremental few-shot learning problem, and makes a couple of interesting architectural/algorithmic contributions. The results are interesting/convincing, and the ablation study helps to better understand the properties of the prosed method. I believe the paper would be of interest to the community but more clarifications are necessary. === Comments and questions: - Inconsistencies in softmaxes. From 3.1, it looks like the softmaxes computed used for training fast and slow weights have different normalization constants (see line 135 and line 139), and hence the logits for b-classes might have entirely different scales than logits for a-classes. I feel that the reason why you need the proposed attractor-based regularization in the first place is to compensate for this scaling issue. Alternatively, you can use the same softmax from line 139 when computing the loss in eq (1) to avoid this discrepancy (since the base classes are available, W_a is pretrained and fixed, this should be possible). Why not do that? I would like to see a comparison with the vanilla architecture (i.e., no attractor-based regularization) that uses a consistent softmax normalization. - Computational considerations. RBP requires the inverse of the Jacobian wich scales cubically. What is the computational overhead for this method? How would it scale with the increased number of classes? Analysis and discussion of this are necessary. - Results. The authors mention that tiered-ImageNet is a harder task (which intuitively makes sense), but somehow results on that dataset are better than on the mini-ImageNet. How would you explain that? - Will the code for reproducing results be released? === Minor: - line 124: I believe \theta_E denotes parameters of the feature extractor, but it has not been properly defined.

Reviewer 2

Overall a very nice and interesting read. In terms of originality I believe the proposed method to be sufficiently novel and at no point felt this was merely an incremental improvement. In terms of significance it should be said that I feel this idea to be fairly specific to the incremental classification setting and wouldn't be general enough to be directly applicable in another domain (e.g. RL). However, I still believe this work should be accepted and would expect recognition within the domain. With regards to the clarity of the submission, I believe sections 3.1 and 3.2 could be improved. Detailed comments below: Introduction: L36: "We optimize a regularizer that reduces catastrophic forgetting" - Perhaps it would be a good idea to delineate this from many of the other works on regularization-based methods to reduce catastrophic forgetting where the regularizer isn't learnt? Examples are [1] or [2]. L37 "can be thought of as a memory of the base classes, adapted to the new classes". This is very unclear and not particularly helpful in the introduction. I could only make sense of this sentence after reading Section 3. Figure 1: Very helpful, thank you for providing this. Section 2: Nice overview of related work. Perhaps some more discussion on work to tackle the catastrophic forgetting problem would be useful here. Section 3: L112: "there are K' novel classes disjoint from the base classes" - When we use the same datset D_a also used during pretraining, this seems only possible when data-augmentation is introduced (as the authors explain in Section 4 (L239)). It would be good to already mention this here (possibly with a footnote). Also, I understand data-augmentation as training on the union of original data and a distroted version thereof. In order to ensure that the K' classes are indeed disjoint, are the authors ensuring that during episodic sampling from D_a there is always some form of distortion applied? If for instance, during sampling of random rotations we choose between {90, 180, 270, 360), one could run into the risk of training on identical data already used during pre-training. L120: "from from" -> "from" L126: Where do parameters \theta_E fit in Figure 1? They appear as arguments to R_(W_b, \theta_E) but never appear in the definition of equation (2). This is not made clear until line 172 which was rather confusing. Also the subscript E seems like a stranger letter to choose. L139: W_a are called "slow weights" (also in L154) whereas they have previously been referred to as "Base class Weights" (Figure 1). I found myself repeatedly looking at Figure 1 to keep track of what was happening in the model. Using consistent notation would have made this a lot easier. Section 3.3: Very clear. Section 4: L245: Missing whitespace after the full-stop. L248: "RBP" -> Recurent back-propagation as readers might be unfamiliar with the abbrevation and might want to only briefly skim the experimental section. Section 4.3: 2) I don't think I understand what \delta_a and \delta_b are supposed to be. Section 4.6: Nice to see the ablation study I was hoping for when reading Section 3. Interesting also to see that in some cases a simple LR model for W_b works better or just as well. Also great to see the comparison between RBP and T-BPTT Figure 3 is really nice. [1] Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." Proceedings of the national academy of sciences 114.13 (2017): 3521-3526. [2] Nguyen, Cuong V., et al. "Variational continual learning." arXiv preprint arXiv:1710.10628 (2017).

Reviewer 3

Originality: The proposed Attention Attractor Network is novel. Quality: The intuition of using the attention attractor network is not that sound. The motivation of the model is to prevent forgetting base classes. But the method (line 160-166) takes the cosine similarity between the average representation of novel classes and the base class weights. This is purely intuitive. Clarity: There some places not very clear. Significance: the empirical results seem significant.