NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6371
Title:Online Continual Learning with Maximal Interfered Retrieval

Reviewer 1

The authors propose a new way to use replay to avoid catastrophic forgetting. What the authors propose is after each new example to train on the previous examples that had the loss go up the most after the update. This is a very simple and it is clear why this should be better than replay uniform at random. In order to make their algorithm feasible the authors first select uniformly random subset of C examples, then they have to compute how the loss changed between all those C example due to the parameter update and then they do a training step on the K examples where the loss changed most. As a way to make this more efficient the authors propose to instead do nearest neighbor search in the latent space of an autoencoder to find similar examples. This method reminds me of memory based parameter adaptation by sprechmann et al ( where the authors also do continual learning by doing nearest neighbor search in a latent space. I am a bit worried about this approach in so far as when the new task is very dissimilar to the old task the examples retrieved by lookup in autoencoder latent space could well end up being quite meaningless, however the nearest neighbor search will give examples which are not very diverse. In general it seems to me using the nearest neighbor search in autoencoder latent space would easily give examples which are not very diverse. Have the authors observed this in practice? Could you comment on whether you see this as a problem? The experiments seem fairly toyish to me, however I think the idea is nice and should be published

Reviewer 2

This paper describes an approach to improve rehearsal-based continual learning techniques (either replay-based or with a generative model) by identifying samples that are most useful to avoid forgetting. This is achieved by computing the increase in loss on the replayed samples, and using this to determine which samples should be used during learning. It is a simple and intuitive idea, the paper is clearly written, and experiments on multiple datasets are compelling. I think it could make a nice addition to the conference, but needs a few improvements first. My main criticism is that the approach requires a separate virtual gradient step for each actual step, to compute the change in loss on the replay samples. I think the discussion around this could be stronger, eg. explicitly mentioning the overhead in computation / memory for this approach versus others. Is there a way to alleviate this additional step (eg. keep a running estimate of 'utility' for each sample based on previous gradient steps)? If not, a fairer comparison with baselines should include twice as many gradient steps (though still online). Does this make any difference? A few other comments and questions: - Why is the hybrid approach in the appendix rather than the main experiments? It seems like one of the central proposed methods to this paper - and given it's introduced in the main body, the experiments should appear as well. - The performance of baselines in Split MNIST seems poorer than that reported in related work (eg. iid offline is 97, and Deep Generative Replay (which is not included in this paper as a comparison) is at 92. Is this because of multiple vs a single pass? Additional baselines and clarification would be useful here. - The only external comparison is with GEM and, effectively, DGR (the GEN case), with the rationale that prior-based approaches like EWC don't perform as well, but I think other performant methods should be included (eg. VCL). This may not require reimplementing the approach, but comparing with the same conditions as previously published results. - In the single-pass setting, things like learning rate become more important - how were hyperparameter sweeps for baselines performed? - Some of the reproducibility checklist items are missing, and are easy to fix (eg. description of compute) POST-REBUTTAL: After reading the response and other reviews, I think the authors have done a great job of addressing my concerns. The additional baselines (both on the side of additional gradient steps and on other approaches like VCL) strengthen the paper. The discussion around gradient steps and computation provided by the authors in the rebuttal is convincing, and I think this should appear in the camera-ready version. i.e. comparison of computation / gradient steps taken by other approaches, and the 2-step baseline to show that this doesn't help with a standard approach. I think this is a good paper, and I have increased my score to a 7.

Reviewer 3

Originality: This paper considers generative modeling in the online continual learning setting for the first time. While previous algorithms in continual learning are based on random sampling to store the past samples, the authors proposed more sophisticated method to construct the memory structure. Moreover, the authors treats Hybrid approach of experience replay and generative replay to obtain the benefits of both. Quality : The authors do not provide theoretical guarantee for the proposed algorithms. Although the experimental results show its effectiveness compared to the existing state-of-the-art algorithms, the proposed methods are not applied to longer task sequence than 10 different tas and real-world data beyond relatively easy benchmark dataset. Clarity: Although this paper is well-organized, there are some concerns about the clarity. What is the definition of the sample loss l in Section 3.1? Do not it have two variables like l(f_\theta(x),y)? Similarly, do not \mathcal{L} have two variables in equation (1)? I do not understand the meaning of “Single trains the model …” in line 181. Significance: This paper has great significance because this paper considers generative modeling in the online continual learning setting for the first time as stated above. In this sense, this paper address a difficult task which was not treated in the previous studies. Moreover, the authors propose a better sampling method for memory structure.