Review for NeurIPS paper: Meta-Consolidation for Continual Learning

NeurIPS 2020

Meta-Consolidation for Continual Learning

Review 1

Summary and Contributions: The paper proposes an online continual learning method, MERLIN, that learns a distribution over the task-specific model parameters given a context (task identifiers, etc). VAE is used to model the distribution over the model parameters. More specifically, given a dataset of a task t, the idea is to train ‘B’ separate models. A VAE is then trained, using these ‘B’ model parameters as training points to learn an encoder (mapping the parameters to the latent) and decoder (mapping the latent to model parameters). The standard VAE ELBO is maximized during the training. One notable change is that the (parametric) prior over the latent distribution is task-specific and is learned along with the VAE parameters. After training each task, the updated VAE is consolidated for previous tasks by sampling from the task-specific learned priors, generating the parameters from those samples, and updating all the VAE parameters using those generated samples as supervisory signals. At inference time, the latent is sampled from the task-specific prior or all the priors (depending on whether the task information is present or not), ‘E’ number of models set is sampled from the decoder. This set is then fine-tuned on the replay buffer and the results are ensembled over the set. The experiments are reported on the standard continual learning benchmarks for image classification. ------------------------- Post-rebuttal: - The authors adequately addressed some of my concerns. While I still believe that Bayesian Continual Learning type baselines would be better suitable for this work and I encourage the authors to add those in their final draft, comparison with CN-DPM, if done correctly, suggests that MERLIN can outperform other Bayesian baselines (although I am not sure if the authors used CN-DPM correctly or in the right setting). I don't agree with the author's' assertion that VCL does not learn a distribution over model parameters. Anyhow, the rebuttal is strong and addressed most of my concerns. Therefore, I am increasing my score to marginally above acceptance threshold. I am not giving a clear accept because I still believe that the method is unnecessarily cumbersome and some of the components can be simplified.

Strengths: Positives: - By and large, the paper is well-written. Although the method seems overly complicated (more on this in the negative section) but the overall writing of the paper is very good. - Barring Bayesian continual learning, the paper is well-grounded in the recent literature.

Weaknesses: Negatives: 1) Why not posterior over model parameters: It’s not clear to me what is the advantage of this framework over standard variational continual learning type approaches (https://arxiv.org/abs/1710.10628, https://openreview.net/pdf?id=SJxSOJStPr, etc). In this work and in the VCL type approaches the objective is to model the distribution over network parameters. Could the authors point out why using model parameters as training data for VAE (like they did) is better than standard VAE training in a continual setting? It seems like a lot of machinery has been used in this work without properly grounding the study in the literature. The best baselines to study this work would have been VCL and likes. There are no comparisons with them in the experiments section. 2) Task-specific learned priors: Why do the authors choose these? What would happen if you take a standard normal prior or any other sensible prior and finetune the decoder on the exemplars? I can’t seem to find the experiments where this ablation has been done, which, frankly, makes the choice rather ad-hoc and unnecessarily cumbersome. You would want to experimentally show that this choice makes sense. 3) Experiments: There are a few issues with experiments. 3.1) For SplitCIFAR100 and miniImageNet, 10 classes per task would correspond to the 5000 samples per class and not the 2500 samples. The authors of the cited works used 5 samples per task, making 2500 samples per class. 3.2) Network architectures not being the same: It’s the standard practice in continual learning to make the network architectures the same size. More overly parameterized architectures are more prone to prune forgetting. The use of bigger architectures in the baselines may already hurt their performance. Please make a fair comparison using an equal number of parameters. 3.3) Baselines in the task-aware setting: Some of the baselines (GEM etc) are known to work well in the task-aware setting. The authors didn’t report the numbers of these baselines in that setting in the main paper. Please report that. Also, there is no information on the amount of episodic memory used for these baselines. Please provide that information. 3.4) As pointed out earlier, the best baselines to compare this work against are the ones based on Bayesian continual learning (VCL, etc). 4) Efficiency: The authors claim that at inference time their method works in real-time which seems strange given the number of steps they perform at inference time (Alg 3). Could you please provide the timing comparison of train/ test times with the baselines?

Correctness: See the weakness section.

Clarity: Yes.

Relation to Prior Work: See the weakness section.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper introduces a novel method for continual learning which can be applied both on task-free and task-aware scenario. It trains a network for prediction and a variational auto-encoder to learn latent space for encountered tasks hence it enables for generating the appropriate model parameters according to the tasks (meta-consolidation). At inference, the learned decoder generates the model parameters and reconstructs the network with the coreset which are randomly sampled from previous tasks. By aggregating prior for encountered tasks, the method generate the model parameter well.

Strengths: As they perform learning latent space across arriving tasks through a variational auto-encoder, the method successfully reconstruct task-specific knowledge at inference time with only a few of the coreset instances without occupying task-specific parameters. Also, the method can be applied on task-free continual learning scenario which are further realistic direction to the area.

Weaknesses: However, I'm not sure that how the latent space describe the task well for complex tasks or larger number of tasks. While the method is validated across various datasets, such as MNIST-variants and CIFAR-variants. I strongly recommend to do experiment and analyze the method for larger or multiple heterogeneous datasets from HAT [1]. Also, the authors assign too small margins for the paper, and it largely reduces a readability. [1] Serra, Joan, et al. "Overcoming catastrophic forgetting with hard attention to the task." ICML 2018.

Correctness: The method looks correct.

Clarity: The paper is written well and properly describes their methods and experimental procedures.

Relation to Prior Work: The method looks clearly differentiate with prior works.

Reproducibility: Yes

Additional Feedback: The paper shows interesting idea and outstanding performance compared to strong baselines. But the validation performed on conventional but simple datasets, and the margins between texts, pages, tables are too small that harms the quality of the paper. ======== Post-rebuttal: After I read other reviews and the author response with additional experiments, I believe that the work deserves to be accepted. I raise the score to 7.

Review 3

Summary and Contributions: In this paper, the authors propose a novel approach, which employs the consolidation in a meta-space of model parameters. The proposed method can handle both class-incremental and domain-incremental settings, and can work with or without task information at test time. -----------------------Final Decision The athuors have solved most of my concern. But I still think the strategy of the vote seems like a trick to improve the performance. Thus, I decide to keep my score unchanged.

Strengths: 1, The writing of this paper is clear and easy to understand. 2, The performance of the proposed method is impressive. 3, The experiments are sufficient to prove the effectiveness of the proposed method.

Weaknesses: 1，The scale of subsets for training basic models is not mentioned and discussed in the paper, which is an important part for the method in my opinion. 2, The vote of basic model seems like a trick to improve the performance. When the number is set as 1, the performance is not better than other methods.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper outlines a novel method for continual learning in deep neural networks. The authors model the meta distribution of the model parameters, and proposes a variational auto-encoder (VAE) method to continually learn in this parameter space while doing model consolidation. This method of continual learning in the parameter space allows this approach to work in a task-aware setting when appropriate, and it operates in an online setting where only a single pass is made over the training data. The authors state that this is the first known method of incremental learning in the parameter space, to the best of their knowledge. After author feedback: Our initial recommendation of marginal accept of the paper remains unchanged.

Strengths: (1) The meta-learning approach adapts to new tasks arriving over time, and consolidates the tasks over time. (2) It models task-specific parameter distributions and does a meta-consolidation as necessary. (3) It does inference in both task-aware and task-agnostic modes, as necessary. (4) Experiments and user studies showed the effectiveness of the proposed approach called MERLIN.

Weaknesses: (1) No discussion is provided about why the forgetting measure is worse for MERLIN on Split MERLIN and Split CIFAR-100. (2) The results can be validated on more domains -- all the current results are shown on the image analysis domain.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: No

Additional Feedback: