NeurIPS 2020

Mitigating Forgetting in Online Continual Learning via Instance-Aware Parameterization

Review 1

Summary and Contributions: The authors propose a method for overcoming forgetting even in online settings using recent ideas from architecture search. They utilized a controller module that is trained in a pretraining stage that then selects paths through the network.

Strengths: - Interesting combination of ideas - Experimental results on challenging datasets are promising.

Weaknesses: A few things that might be weaknesses or need to be clarified - Computational complexity of the updates, in the online setting this is critical, it should be discussed and compared to the other methods. - Pretraining phase could be described more clearly. Currently it is only mentioned in L279-285. It is my understanding that an offline pretraining is done on the controller but using only the first task, does the controller then adapt online for subsequent tasks? -Experiments: Why do the authors not compare to ER[1], why are teh results shown weaker for example for CIFAR100 on A-GEM than those reported in [1]. It would be good to also report performance after each task and as well forgetting metric - (minor) The authors note the memory o the controller storage can be large, although the reviewer is not too concerned of this. Do the authors have thoughts about how to adapt this to the shared-head / class incremental settings?

Correctness: Yes

Clarity: The paper is overall well written. (minor) I suggest the authors make it more clear in Fig 1 those are convolutional (or generally multi parameter) blocks. The figure and position of the FC layer text make it seem initiailly like those are MLP layers.

Relation to Prior Work: I found the literature review thorough.

Reproducibility: Yes

Additional Feedback: Overall I found the paper and direction promising and an interesting insight about the connection of recent architectures search methods, continual learning, and conditional computation. I have a few potential concerns mentioned above but will consider further increasing my score.

Review 2

Summary and Contributions: This paper studies online continual learning in a multi-headed setting, meaning that the agent knows the task it is supposed to predict at test time. An input is fed into a controller that determines what parts of the network should be used for training.

Strengths: While sparse updates and modules have been discussed in the community for mitigating catastrophic forgetting, their approach is interesting and novel.

Weaknesses: The justification for not studying the "single head setting" is weak. The multi-headed setting where task labels are used during testing for inference makes the task much easier: (see Table 4) (see Table 1) They ignore other recent work on continual online learning from 2019: A comparison of the number of additional parameters/storage needed for each method is needed. The authors argue against replay methods, but the replay buffer size needed for those methods is likely much smaller in terms of memory than the 43M+ additional parameters they add in. I'm not sure if the multi-headed paradigm has much value for studying today. A deployed agent would rarely have access to the necessary information to select the output head and it isn't needed by many recent methods: iCaRL - CVPR-2017 - EEIL - ECCV-2018 - Unified Classifier - CVPR-2019 - BiC - CVPR-2019 - IL2M - ICCV-2019 - ScaIL - WACV-2020 - Deep SLDA - arXiv-2020 - REMIND - arXiv-2020 - The experimental setup is confusing and I am not exactly sure how they are training the controller. They are storing data, so I don't know if that makes it any better than a replay buffer.

Correctness: I think so, but see weaknesses.

Clarity: The paper has many grammar problems. I am confused about the experimental setup.

Relation to Prior Work: No -- the discussion of the prior work on continual learning methods needs work.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper proposes to use dynamic architectures (InstaNAS [7] specifically) along with tricks from other related work for task-incremental online learning (no revisiting samples, task identity known for all samples). Similar looking samples are (theoretically) routed similarly, while dissimilar samples use different parts of the network to help mitigate catastrophic forgetting. The paper shows 4-5% performance boosts for CIFAR10, CIFAR100, and TinyImageNet.

Strengths: - Motivation Despite similar approaches in the literature, using one-shot NAS methods (which are relatively new) for incremental learning is relatively new and a good idea. Routing based on instance similarity is neat (although also present in related work). - claims The experiments are relatively well-designed, taking into account of fairness well most of the time. Method design is reasonable without too many ad-hoc components that are under-explored. - relevance The problem, task-incremental online learning, is quite relevant to NeurIPS despite the relatively narrow scope (the intersection of task-IL and online learning).

Weaknesses: - significance and novelty: is quite similar to closely related work [13,14] (and maybe PathNet), but does not compare to them. -- The discussion says "the major difference between [13,14] and ours is that they treat their network depends on the task, which might not be suitable for online continual learning setting." So the only difference is that they may underperform. This paper either has to compare to [13,14] (especially [13] is very similar and open-source), or show numbers in efficiency (FLOPs, test time, etc) to show that they cannot be used efficiently. - claims and evaluation -- fairness of number of parameters: As far as I can tell, HAT has 7.1M params, A-GEM uses ResNet18 which is 11M params. This paper uses 43M params which is not fair. It is also not fair to say all methods have similar efficiency at test time, since multiple blocks in the same layer can be on. -- The exploration trick improved performance quite a lot. It is great that this paper did an ablation study, but since it has a lot of hyperparameters and design choices (why H^1/2 not H, why sigmoid, how are gamma/kappa/epsilon chosen), it is unclear how the design choices are made and how the hyperparameter was tuned, as this would tweak the balance between learning and remembering. An analysis of sensitivity to hyperparameters is preferred. -- The paper is not *entirely* an online method, since the model is pre-trained on the first task (with multiple epochs) to make NAS training work. --- Was line 253 epochs referring to this? Or did this paper train anything else with more than 1 epoch?

Correctness: The claims, method, and empirical analysis is mostly correct (see weaknesses).

Clarity: There are several spots of vagueness: - How the blocks in the meta-graph are connected. Do they use different connection layers between blocks in neighboring layers? How is ResNet18 split in blocks? - Did this paper end up using L2 or EWC loss for the weights? - Line 218 contradicts with eq. 8 -- was A used or was C used?

Relation to Prior Work: Yes, but the reason for not comparing is lacking.

Reproducibility: No

Additional Feedback: The idea is fine and the results are nice, but the experiments are not very convincing for me to say that the proposed method outperforms prior work or closely related work on a fair ground. Please address the weakness section (everything except the "not entirely online" comment) with arguments or additional experiments. -------------------------------------- Update: The rebuttal adequately addresses most of my concerns except for hyperparameter sensitivity. Although please double check that your score increases 54.5->55.5 by decreasing number of parameters. Also please include the details (e.g. about how to do single-Path NAS) and address requested clarifications. I'm increasing the score.

Review 4

Summary and Contributions: They adopted the concept of "instance awareness" in the neural network to alleviate the catastrophic forgetting problem. They proposed a method to protect the path by restrticting the gradient updates of one instance from overriding past updates calculated from previous instances if these instances are not similar. Finally, they achieved the best results in CIFAR-10, CIFAR-100, and Tiny-ImageNet tests.

Strengths: The algorithm is novel. Also, they did not use the replay buffers, but it shows competitive results.

Weaknesses: The proposed method is good, but the experimental results are not convincing because they just compared the proposed algorithm with only four methods such as EWC, HAT, A-GEM, GSS-Greedy. They need to compare it to other state-of-the-art algorithms.

Correctness: Yes.

Clarity: It is easy to follow.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I will keep my score after reading the rebuttal.