Reviews: Superposition of many models into one

-------------------------------------------------------------------------------------------------------- Post-rebuttal update: I have now read the rebuttal and the other reviews. I appreciate that the authors re-implemented the CIFAR benchmark I had requested. However, I'm still unconvinced of the significance or the originality of the proposed approach. For me, two fundamental issues remain: 1) The proposed approach is conceptually very similar to the masking proposed in Masse et al. (2018) that I mentioned in my review. The only difference is essentially masking with a {1,-1} vector vs. masking with a {0,1} vector. For sufficiently sparse masks (as used in Masse et al.), the latter approach will also produce largely non-overlapping feature subsets for different tasks, so I don't see this as a huge difference. The authors respond that their approach is more general. That may be so, but is this enough of a conceptual advance? We don't even know how well the proposed approach works compared to the {0,1} masking, because the authors have chosen to largely ignore the prior literature in their evaluations. Relatedly, Masse et al. also show that one has to combine masking with a synaptic stabilization mechanism such as EWC in order to get the best results. Is this also the case in the current paper as well? Again, we don't know, because the evaluations in the current paper are not thorough enough. 2) The authors are essentially claiming that doing something random (i.e. random masking) works better than doing something intelligent, like considering the importance of different parameters for prior tasks as in EWC, and I feel very uneasy about this claim. I feel like there's an important catch that isn't currently being made clear in the paper. In their rebuttal, the authors say that "methods like EWC, GEM constraint the amount of change in parameters for a new task in a manner that ensures performance on past tasks is maintained. With increase in tasks, constraints accumulate, making it impossible to add new tasks once optimization slows. Our method, however, shows a different property – its much more flexible in the sense that it forgets the older tasks when network capacity is reached in order to accommodate newer tasks." So maybe this is the catch that needs to be made really clear (but currently isn't): for a large enough number of tasks, the proposed method will fail much more rapidly in prior tasks than methods like EWC or GEM, but again we aren't really sure if and when this will happen (or relatedly, whether a combination of masking and stabilization as in Masse et al. might strike the best balance by guaranteeing both learning and retention), because again the experiments in the paper aren't thorough enough. So, in conclusion, given the two lingering fundamental issues I have, I have decided to keep my score as it is. -------------------------------------------------------------------------------------------------------- - In general, the comparison with earlier methods is not thorough at all in this paper. Several important earlier works have not been discussed: Schwarz et al. (2018) Compress & Progress: A scalable framework for continual learning. ICML Lopez-Paz & Ranzato (2017) Gradient episodic memory for continual learning. NIPS - The proposed approach is actually quite similar to the one proposed in the following paper (this paper has been on arxiv since Feb. 2018): Masse et al. (2018) Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. PNAS Yet, this paper is also not discussed, cited, or compared against. - Conceptually, the proposed method is also quite similar to the "sparsely-gated mixture-of-experts" model (Shazeer et al, 2018, ICLR). The difference is that the gating is learned in that paper (which may actually be more sensible than using a fixed, random gating) and it's applied on a per-example basis, instead of a per-task basis. - The CIFAR-100 experiment is not implemented in a standard way (please see Lopez-Paz & Ranzato, 2017, cited above). Earlier papers use a larger number of tasks (20 tasks in Lopez-Paz & Ranzato, 2017). Please implement this experiment consistently with earlier papers, and compare your numbers with earlier methods. - There is, similarly, a question about the scalability of the proposed method to more challenging tasks. For example, the EWC paper (and the C&P paper above) had Atari benchmarks (including a setup where the task identity was automatically inferred instead of being manually set). I understand that the main advantage of the proposed method is its simplicity, but if it doesn't scale up to more challenging tasks, its significance will be limited. So, I would encourage the authors to test their method on Atari benchmarks and compare with earlier methods. - Because of the random nature of the context variables, it seems to me that the proposed method does not allow transfer learning, which is not ideal. I would encourage the authors to think about extending their method to enable transfer learning while retaining its simplicity as much as possible. For example, at the beginning of each new task, one can fix the weights of the network, and train the context variables first (to encourage the network to utilize the already learned features), then fix the context variables and fine-tune the network weights etc. - In several places, the authors talk about the low intrinsic dimensionality of natural signals, which is necessary to make the context-gated inputs for different tasks distinct from each other. But since the authors are applying the context gating at every layer of a network, the low intrinsic dimensionality of the raw inputs is not enough. The input to each layer should be low-dimensional. This is clearly not the case for higher layers of a conv-net, for example (or at least, they're nowhere near as redundant as the raw input). The authors should discuss this important caveat. More minor: - There are several typos: line 34 (" ... training a neural networks in..."); line 44 (should be "separate tasks"); line 135 (should be "evaluate the performance of the..."); line 216 (should be "it is worse in performance than..."); line 238 (should be "change"). - The description of the proposed method can be improved. For example, the context variables are fixed and random; they are not trainable. But, this important detail is somehow never explicitly stated in describing the method. Worse, the authors keep calling these "parameters" as if they are like the trainable parameters of the network. I'm sure a lot of readers will be very confused by this. - Line 209: the authors say that it is beyond the scope of the paper to propose a solution to the problem of automatic inference of task identity, but I personally think this would make the paper more compelling. For example, in the rotation example, the circular nature of the transformations is assumed to be known already, which is clearly unrealistic in the general case. So I would again encourage the authors to think about this issue. - In Figure 5 (right), what is pspLocalMix? This is not defined anywhere in the text or the caption (pspFastLocalMix is defined). - Please make the figure fonts bigger; currently they are very hard to read without zooming in.

Reviewer 2

I have a few comments which I hope Authors can address to make this paper stronger and more complete: 1- Related work is minimal. Although there is a short paragraph compare your results with previous work, but it is not enough. The reader should get more insight about previous approaches and a high-level understanding of why previous approaches are different than yours. Please expand this section. 2- There are some confusions and inconsistencies in experimental results, section 4.1.3, paragraph "choosing context parameters": 2.1- I see in Figure 5-right that you present results for `pspLocalMix`, but I couldn't find what approach it refers to. 2.2- The text says: "... Figure 5 right shows that while pspFast is better than standard model ..." but `standard model` results are not in Figure5-right. It is a combination of Figure 4-b and Figure 5, right? 2.3- Does anywhere in the text talks about Figure5-left? 3- I think how many models can be stored in superposition with each other is a very interesting question and shouldn't be left out in this paper. I know Authors mentioned this in Future work and Discussion, but this is a fundamental question that directly related to the effectiveness of their proposed solution and at least some preliminary empirical results are required to be included in this paper. I understand Authors have realized degradation in performance while keeping 1000 models superposition-ed with each other, but it is very interesting to get more insight about the limitation of their approach in terms of the number of models stored together.

Reviewer 3

# Strengths: - The paper copes with an interesting problem. Both model compression and lifelong learning are active domains of research and proposing an approach which somehow bridges the two problems and proposes a dual view on these problems is great. - The exposition of the paper is very clear. The paper reads nicely and the problem is very well motivated. - The proposed algorithm is very simple and therefore easy to understand. Its simplicity should definitely be considered as a feature. - The approach is evaluated in multiple experiments, some of them on "controlled" data, and one experiment on a real model (ResNet-18) on real data (CIFAR-10 CIFAR-100). The experiments are quite convincing. - The proposed model compares favorably with previous work (See Fig. 3). # Weaknesses - When reading papers about catastrophic forgetting, I can't help myself from asking for actual practical applications. All experiments in this paper are on datasets that have been actually made up from standard benchmarks, by concatenating manipulated samples. There are probably real tasks with available data for which it would make sense to use such a model. I would like the authors to comment on that. - What lessons learnt from this work could influence or guide research in model compression? Overall, I enjoyed reading this paper who's motivations are clear and execution is very good. The tackled problem is interesting and the experiments convincing. Please note however that I am not an expert in lifelong learning, and may have missed important details or relevant work that could change my opinion. Therefore, I await the author's response, the other reviews and discussion to make my final decision.

Paper ID:	5810
Title:	Superposition of many models into one

Reviewer 1

Reviewer 2

Reviewer 3