{"title": "Continual Learning with Deep Generative Replay", "book": "Advances in Neural Information Processing Systems", "page_first": 2990, "page_last": 2999, "abstract": "Attempts to train a comprehensive artificial intelligence capable of solving multiple tasks have been impeded by a chronic problem called catastrophic forgetting. Although simply replaying all previous data alleviates the problem, it requires large memory and even worse, often infeasible in real world applications where the access to past data is limited. Inspired by the generative nature of the hippocampus as a short-term memory system in primate brain, we propose the Deep Generative Replay, a novel framework with a cooperative dual model architecture consisting of a deep generative model (\u201cgenerator\u201d) and a task solving model (\u201csolver\u201d). With only these two models, training data for previous tasks can easily be sampled and interleaved with those for a new task. We test our methods in several sequential learning settings involving image classification tasks.", "full_text": "Continual Learning with Deep Generative Replay\n\nMassachusetts Institute of Technology\n\nSK T-Brain\n\nHanul Shin\n\nSK T-Brain\n\nskyshin@mit.edu\n\nJung Kwon Lee\u2217, Jaehong Kim\u2217, Jiwon Kim\n\n{jklee,xhark,jk}@sktbrain.com\n\nAbstract\n\nAttempts to train a comprehensive arti\ufb01cial intelligence capable of solving multiple\ntasks have been impeded by a chronic problem called catastrophic forgetting.\nAlthough simply replaying all previous data alleviates the problem, it requires\nlarge memory and even worse, often infeasible in real world applications where the\naccess to past data is limited. Inspired by the generative nature of the hippocampus\nas a short-term memory system in primate brain, we propose the Deep Generative\nReplay, a novel framework with a cooperative dual model architecture consisting\nof a deep generative model (\u201cgenerator\u201d) and a task solving model (\u201csolver\u201d). With\nonly these two models, training data for previous tasks can easily be sampled and\ninterleaved with those for a new task. We test our methods in several sequential\nlearning settings involving image classi\ufb01cation tasks.\n\n1\n\nIntroduction\n\nOne distinctive ability of humans and large primates is to continually learn new skills and accumulate\nknowledge throughout the lifetime [6]. Even in small vertebrates such as rodents, established con-\nnections between neurons seem to last more than an year [13]. Besides, primates incorporate new\ninformation and expand their cognitive abilities without seriously perturbing past memories. This\n\ufb02exible memory system results from a good balance between synaptic plasticity and stability [1].\nContinual learning in deep neural networks, however, suffers from a phenomenon called catastrophic\nforgetting [22], in which a model\u2019s performance on previously learned tasks abruptly degrades when\ntrained for a new task. In arti\ufb01cial neural networks, inputs coincide with the outputs by implicit\nparametric representation. Therefore training them towards a new objective can cause almost complete\nforgetting of former knowledge. Such problem has been a key obstacle to continual learning for deep\nneural network through sequential training on multiple tasks.\nPrevious attempts to alleviate catastrophic forgetting often relied on episodic memory system that\nstores past data [31]. In particular, recorded examples are regularly replayed with real samples drawn\nfrom the new task, and the network parameters are jointly optimized. While a network trained in this\nmanner performs as well as separate networks trained solely on each task [29], a major drawback\nof memory-based approach is that it requires large working memory to store and replay past inputs.\nMoreover, such data storage and replay may not be viable in some real-world situations.\nNotably, humans and large primates learn new knowledge even from limited experiences and still\nretain past memories. While several biological mechanisms contribute to this at multiple levels, the\nmost apparent distinction between primate brains and arti\ufb01cial neural networks is the existence of\nseparate, interacting memory systems [26]. The Complementary Learning Systems (CLS) theory\nillustrates the signi\ufb01cance of dual memory systems involving the hippocampus and the neocortex.\nThe hippocampal system rapidly encodes recent experiences, and the memory trace that lasts for\n\n\u2217Equal Contribution\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fa short period is reactivated during sleep or conscious and unconscious recall [8]. The memory\nis consolidated in the neocortex through the activation synchronized with multiple replays of the\nencoded experience [27]\u2013a mechanism which inspired the use of experience replay [23] in training\nreinforcement learning agents.\nRecent evidence suggests that the hippocampus is more than a simple experience replay buffer. Reac-\ntivation of the memory traces yields rather \ufb02exible outcomes. Altering the reactivation causes a defect\nin consolidated memory [35], while co-stimulating certain memory traces in the hippocampus creates\na false memory that was never experienced [28]. These properties suggest that the hippocampus is bet-\nter paralleled with a generative model than a replay buffer. Speci\ufb01cally, deep generative models such\nas deep Boltzmann machines [32] or a variational autoencoder [17] can generate high-dimensional\nsamples that closely match observed inputs.\nWe now propose an alternative approach to sequentially train deep neural networks without referring to\npast data. In our deep generative replay framework, the model retains previously acquired knowledge\nby the concurrent replay of generated pseudo-data. In particular, we train a deep generative model in\nthe generative adversarial networks (GANs) framework [10] to mimic past data. Generated data are\nthen paired with corresponding response from the past task solver to represent old tasks. Called the\nscholar model, the generator-solver pair can produce fake data and desired target pairs as much as\nneeded, and when presented with a new task, these produced pairs are interleaved with new data to\nupdate the generator and solver networks. Thus, a scholar model can both learn the new task without\nforgetting its own knowledge and teach other models with generated input-target pairs, even when\nthe network con\ufb01guration is different.\nAs deep generative replay supported by the scholar network retains the knowledge without revisiting\nactual past data, this framework can be employed to various practical situation involving privacy\nissues. Recent advances on training generative adversarial networks suggest that the trained models\ncan reconstruct real data distribution in a wide range of domains. Although we tested our models\non image classi\ufb01cation tasks, our model can be applied to any task as long as the trained generator\nreliably reproduces the input space.\n\n2 Related Works\n\nThe term catastrophic forgetting or catastrophic interference was \ufb01rst introduced by McCloskey\nand Cohen in 1980\u2019s [22]. They claimed that catastrophic interference is a fundamental limitation of\nneural networks and a downside of its high generalization ability. While the cause of catastrophic\nforgetting has not been studied analytically, it is known that the neural networks parameterize the\ninternal features of inputs, and training the networks on new samples causes alteration in already\nestablished representations. Several works illustrate empirical consequences in sequential learning\nsettings [7, 29], and provide a few primitive solutions [16, 30] such as replaying all previous data.\n\n2.1 Comparable methods\n\nA branch of works assumes a particular situation where access to previous data is limited to the\ncurrent task[12, 18, 20]. These works focus on optimizing network parameters while minimizing\nalterations to already consolidated weights. It is suggested that regularization methods such as dropout\n[33] and L2 regularization help reduce interference of new learning [12]. Furthermore, elastic weight\nconsolidation (EWC) proposed in [18] demonstrates that protecting certain weights based on their\nimportance to the previous tasks tempers the performance loss.\nOther attempts to sequentially train a deep neural network capable of solving multiple tasks reduce\ncatastrophic interference by augmenting the networks with task-speci\ufb01c parameters. In general, layers\nclose to inputs are shared to capture universal features, and independent output layers produce task-\nspeci\ufb01c outputs. Although separate output layers are free of interference, alteration on earlier layers\nstill causes some performance loss on older tasks. Lowering learning rates on some parameters is also\nknown to reduce forgetting [9]. A recently proposed method called Learning without Forgetting (LwF)\n[21] addresses the problem of sequential learning in image classi\ufb01cation tasks while minimizing\nalteration on shared network parameters. In this framework, the network\u2019s response to new task input\nprior to \ufb01ne-tuning indirectly represents knowledge about old tasks and is maintained throughout the\nlearning process.\n\n2\n\n\f2.2 Complementary Learning System(CLS) theory\n\nA handful of works are devoted to designing a complementary networks architecture to alleviate\ncatastrophic forgetting. When the training data for previous tasks are not accessible, only pseudo-\ninputs and pseudo-targets produced by a memory network can be fed into the task network. Called\na pseudorehearsal technique, this method is claimed to maintain old input-output patterns without\naccessing real data [31]. When the tasks are as elementary as coupling two binary patterns, simply\nfeeding random noises and corresponding responses suf\ufb01ces [2]. A more recent work proposes an\narchitecture that resembles the structure of the hippocampus to facilitate continual learning for more\ncomplex data such as small binary pixel images [15]. However, none of them demonstrates scalability\nto high-dimensional inputs similar to those appear in real world due to the dif\ufb01culty of generating\nmeaningful high-dimensional pseudoinputs without further supervision.\nOur generative replay framework differs from aforementioned pseudorehearsal techniques in that\nthe fake inputs are generated from learned past input distribution. Generative replay has several\nadvantages over other approaches because the network is jointly optimized using an ensemble of\ngenerated past data and real current data. The performance is therefore equivalent to joint training on\naccumulated real data as long as the generator recovers the input distribution. The idea of generative\nreplay also appears in Mocanu et al. [24], in which they trained Restricted Boltzmann Machine to\nrecover past input distribution.\n\n2.3 Deep Generative Models\n\nGenerative model refers to any model that generates observable samples. Speci\ufb01cally, we consider\ndeep generative models based on deep neural networks that maximize the likelihood of generated\nsamples being in given real distribution [11]. Some deep generative models such as variational\nautoencoders [17] and the GANs [10] are able to mimic complex samples like images.\nThe GANs framework de\ufb01nes a zero-sum game between a generator G and a discriminator D. While\nthe discriminator learns to distinguish between the generated samples from real samples by comparing\ntwo data distributions, the generator learns to mimic the real distribution as closely as possible. The\nobjective of two networks is thereby de\ufb01ned as:\n\nmin\n\nG\n\nmax\n\nD\n\nV (D, G) = Ex\u223cpdata(x)[log D(x)] + Ez\u223cpz(z)[log(1 \u2212 D(G(z)))]\n\n3 Generative Replay\n\nWe \ufb01rst de\ufb01ne several terminologies. In our continual learning framework, we de\ufb01ne the sequence of\ntasks to be solved as a task sequence T = (T1, T2,\u00b7\u00b7\u00b7 , TN ) of N tasks.\nDe\ufb01nition 1 A task Ti is to optimize a model towards an objective on data distribution Di, from\nwhich the training examples (xi, yi)\u2019s are drawn.\nNext, we call our model a scholar, as it is capable of learning a new task and teaching its knowledge to\nother networks. Note that the term scholar differs from standard notion of teacher-student framework\nof ensemble models [5], in which the networks either teach or learn only.\nDe\ufb01nition 2 A scholar H is a tuple (cid:104)G, S(cid:105), where a generator G is a generative model that produces\nreal-like samples and a solver S is a task solving model parameterized by \u03b8.\n\nThe solver has to perform all tasks in the task sequence T. The full objective is thereby given as to\nminimize the unbiased sum of losses among all tasks in the task sequence E(x,y)\u223cD[L(S(x; \u03b8), y)],\nwhere D is the entire data distribution and L is a loss function. While being trained for task Ti, the\nmodel is fed with samples drawn from Di.\n\n3.1 Proposed Method\n\nWe consider sequential training on our scholar model. However, training a single scholar model while\nreferring to the recent copy of the network is equivalent to training a sequence of scholar models\ni=1 where the n-th scholar Hn (n > 1) learns the current task Tn and the knowledge of previous\n(Hi)N\nscholar Hn\u22121. Therefore, we describe our full training procedure as in Figure 1(a).\n\n3\n\n\fTraining the scholar model from another scholar involves two independent procedures of training the\ngenerator and the solver. First, the new generator receives current task input x and replayed inputs\nx(cid:48) from previous tasks. Real and replayed samples are mixed at a ratio that depends on the desired\nimportance of a new task compared to the older tasks. The generator learns to reconstruct cumulative\ninput space, and the new solver is trained to couple the inputs and targets drawn from the same mix of\nreal and replayed data. Here, the replayed target is past solver\u2019s response to replayed input. Formally,\nthe loss function of the i-th solver is given as\n\nLtrain(\u03b8i) = rE(x,y)\u223cDi[L(S(x; \u03b8i), y)] + (1 \u2212 r) Ex(cid:48)\u223cGi\u22121 [L(S(x(cid:48); \u03b8i), S(x(cid:48); \u03b8i\u22121))]\n\n(1)\nwhere \u03b8i are network parameters of the i-th scholar and r is a ratio of mixing real data. As we aim to\nevaluate the model on original tasks, test loss differs from the training loss:\n\nLtest(\u03b8i) = rE(x,y)\u223cDi[L(S(x; \u03b8i), y)] + (1 \u2212 r) E(x,y)\u223cDpast [L(S(x; \u03b8i), y)]\n\n(2)\nwhere Dpast is a cumulative distribution of past data. Second loss term is ignored in both functions\nwhen i = 1 because there is no replayed data to refer to for the \ufb01rst solver.\nWe build our scholar model with a solver that has suitable architecture for solving a task sequence\nand a generator trained in the generative adversarial networks framework. However, our framework\ncan employ any deep generative model as a generator.\n\nFigure 1: Sequential training of scholar models. (a) Training a sequence of scholar models is equivalent\nto continuous training of a single scholar while referring to its most recent copy. (b) A new generator\nis trained to mimic a mixed data distribution of real samples x and replayed inputs x(cid:48) from previous\ngenerator. (c) A new solver learns from real input-target pairs (x, y) and replayed input-target pairs\n(x(cid:48), y(cid:48)), where replayed response y(cid:48) is obtained by feeding generated inputs into previous solver.\n\n3.2 Preliminary Experiment\n\nPrior to our main experiments, we show that the trained scholar model alone suf\ufb01ces to train an empty\nnetwork. We tested our model on classifying MNIST handwritten digit database [19]. Sequence\nof scholar models were trained from scratch through generative replay from previous scholar. The\naccuracy on classifying full test data is shown in Table 1. We observed that the scholar model transfers\nknowledge without losing information.\nTable 1: Test accuracy of sequentially learned solver measured on full test data from MNIST database.\nThe \ufb01rst solver learned from real data, and subsequent solvers learned from previous scholar networks.\n\nSolver1 \u2192 Solver2 \u2192 Solver3 \u2192 Solver4 \u2192 Solver5\n98.56%\n98.81%\n\n98.58%\n\n98.64%\n\n98.53%\n\nAccuracy(%)\n\n4 Experiments\n\nIn this section, we show the applicability of generative replay framework on various sequential\nlearning settings. Generative replay based on a trained scholar network is superior to other continual\nlearning approaches in that the quality of the generative model is the only constraint of the task\nperformance. In other words, training the networks with generative replay is equivalent to joint\ntraining on entire data when the generative model is optimal. To draw the best possible result, we\nused WGAN-GP [14] technique in training the generator.\n\n4\n\n(b) Training Generator (c) Training Solver \ud835\udc99\u2032 \ud835\udc6e\ud835\udc86\ud835\udc8f\ud835\udc86\ud835\udc93\ud835\udc82\ud835\udc95\ud835\udc90\ud835\udc93 \ud835\udc46\ud835\udc5c\ud835\udc59\ud835\udc63\ud835\udc52\ud835\udc5f \ud835\udc99 \ud835\udc70\ud835\udc8f\ud835\udc91\ud835\udc96\ud835\udc95 \ud835\udc47\ud835\udc4e\ud835\udc5f\ud835\udc54\ud835\udc52\ud835\udc61 \ud835\udc6e\ud835\udc86\ud835\udc8f\ud835\udc86\ud835\udc93\ud835\udc82\ud835\udc95\ud835\udc90\ud835\udc93 \ud835\udc46\ud835\udc5c\ud835\udc59\ud835\udc63\ud835\udc52\ud835\udc5f \ud835\udc41\ud835\udc52\ud835\udc64 \ud835\udc46\ud835\udc50\u210e\ud835\udc5c\ud835\udc59\ud835\udc4e\ud835\udc5f \ud835\udc36\ud835\udc62\ud835\udc5f\ud835\udc5f\ud835\udc52\ud835\udc5b\ud835\udc61 \ud835\udc47\ud835\udc4e\ud835\udc60\ud835\udc58 \ud835\udc42\ud835\udc59\ud835\udc51 \ud835\udc46\ud835\udc50\u210e\ud835\udc5c\ud835\udc59\ud835\udc4e\ud835\udc5f \ud835\udc36\ud835\udc62\ud835\udc5f\ud835\udc5f\ud835\udc52\ud835\udc5b\ud835\udc61 \ud835\udc45\ud835\udc52\ud835\udc5d\ud835\udc59\ud835\udc4e\ud835\udc66 \ud835\udc99\u2032 \ud835\udc9a\u2032 \ud835\udc6e\ud835\udc86\ud835\udc8f\ud835\udc86\ud835\udc93\ud835\udc82\ud835\udc95\ud835\udc90\ud835\udc93 \ud835\udc7a\ud835\udc90\ud835\udc8d\ud835\udc97\ud835\udc86\ud835\udc93 \ud835\udc99 \ud835\udc9a \ud835\udc70\ud835\udc8f\ud835\udc91\ud835\udc96\ud835\udc95 \ud835\udc7b\ud835\udc82\ud835\udc93\ud835\udc88\ud835\udc86\ud835\udc95 \ud835\udc3a\ud835\udc52\ud835\udc5b\ud835\udc52\ud835\udc5f\ud835\udc4e\ud835\udc61\ud835\udc5c\ud835\udc5f \ud835\udc7a\ud835\udc90\ud835\udc8d\ud835\udc97\ud835\udc86\ud835\udc93 \ud835\udc41\ud835\udc52\ud835\udc64 \ud835\udc46\ud835\udc50\u210e\ud835\udc5c\ud835\udc59\ud835\udc4e\ud835\udc5f \ud835\udc36\ud835\udc62\ud835\udc5f\ud835\udc5f\ud835\udc52\ud835\udc5b\ud835\udc61 \ud835\udc47\ud835\udc4e\ud835\udc60\ud835\udc58 \ud835\udc42\ud835\udc59\ud835\udc51 \ud835\udc46\ud835\udc50\u210e\ud835\udc5c\ud835\udc59\ud835\udc4e\ud835\udc5f \ud835\udc36\ud835\udc62\ud835\udc5f\ud835\udc5f\ud835\udc52\ud835\udc5b\ud835\udc61 \ud835\udc45\ud835\udc52\ud835\udc5d\ud835\udc59\ud835\udc4e\ud835\udc66 (a) Sequential Training \ud835\udc47\ud835\udc4e\ud835\udc60\ud835\udc581 \ud835\udc46\ud835\udc50\u210e\ud835\udc5c\ud835\udc59\ud835\udc4e\ud835\udc5f1 \ud835\udc47\ud835\udc4e\ud835\udc60\ud835\udc582 \ud835\udc46\ud835\udc50\u210e\ud835\udc5c\ud835\udc59\ud835\udc4e\ud835\udc5f2 \ud835\udc47\ud835\udc4e\ud835\udc60\ud835\udc583 \ud835\udc46\ud835\udc50\u210e\ud835\udc5c\ud835\udc59\ud835\udc4e\ud835\udc5f3 \ud835\udc47\ud835\udc4e\ud835\udc60\ud835\udc58\ud835\udc41 \ud835\udc46\ud835\udc50\u210e\ud835\udc5c\ud835\udc59\ud835\udc4e\ud835\udc5f\ud835\udc41 \fAs a base experiment, we test if generative replay enables sequential learning while compromising\nperformance on neither the old tasks nor a new task. In section 4.1, we sequentially train the networks\non independent tasks to examine the extent of forgetting. In section 4.2, we train the networks on\ntwo different yet related domains. We demonstrate that generative replay not only enables continual\nlearning on our design of the scholar network but also compatible with other known structures. In\nsection 4.3, we show that our scholar network can gather knowledge from different tasks to perform a\nmeta-task, by training the network on disjoint subsets of training data.\nWe compare the performance of the solver trained with variants of replay methods. Our model with\ngenerative replay is denoted in the \ufb01gure as GR. We specify the upper bound by assuming a situation\nwhen the generator is perfect. Therefore, we replayed actual past data paired with the predicted\ntargets from the old solver network. We denote this case as ER for exact replay. We also consider the\nopposite case when the generated samples do not resemble the real distribution at all. Such case is\ndenoted as Noise. A baseline of naively trained solver network is denoted as None. We use the same\nnotation throughout this section.\n\n4.1 Learning independent tasks\n\nThe most common experimental formulation used in continual learning literature [34, 18] is a simple\nimage classi\ufb01cation problem where the inputs are images from MNIST handwritten digit database\n[19], but pixel values of inputs are shuf\ufb02ed by a random permutation sequence unique to each task.\nThe solver has to classify permuted inputs into the original classes. Since the most, if not all pixels\nare switched between the tasks, the tasks are technically independent from each other, being a good\nmeasure of memory retention strength of a network.\n\n(a)\n\n(b)\n\nFigure 2: Results on MNIST pixel permutation tasks. (a) Test performances on each task during\nsequential training. Performances for previous tasks dropped without replaying real or meaningful\nfake data. (b) Average test accuracy on learnt tasks. Higher accuracy is achieved when the replayed\ninputs better resembled real data.\n\nWe observed that generative replay maintains past knowledge by recalling former task data. In\nFigure 2(a), the solver with generative replay (orange) maintained the former task performances\nthroughout sequential training on multiple tasks, in contrast to the naively trained solver (violet).\nAn average accuracy measured on cumulative tasks is illustrated in Figure 2(b). While the solver\nwith generative replay achieved almost full performance on trained tasks, sequential training on a\nsolver alone incurred catastrophic forgetting (violet). Replaying random gaussian noises paired with\nrecorded responses did not help tempering performance loss (pink).\n\n4.2 Learning new domains\n\nTraining independent tasks on the same network is inef\ufb01cient because no information is to be shared.\nWe thus demonstrate the merit of our model in more reasonable settings where the model bene\ufb01ts\nfrom solving multiple tasks.\nA model operating in multiple domains has several advantages over a model that only works in a\nsingle domain. First, the knowledge of one domain can help better and faster understanding of other\ndomains if the domains are not completely independent. Second, generalization over multiple domains\nmay result in more universal knowledge that is applicable to unseen domains. Such phenomenon is\n\n5\n\n\falso observed in infants learning to categorize objects [3, 4]. Encountering similar but diverse objects,\nyoung children can infer the properties shared within the category, and can make a guess of which\ncategory that the new object may belong to.\nWe tested if the model can incorporate the knowledge of a new domain with generative replay. In\nparticular, we sequentially trained our model on classifying MNIST and Street View House Number\n(SVHN) dataset [25], and vice versa. Experimental details are provided in supplementary materials.\n\n(a) MNIST to SVHN\n\n(b) SVHN to MNIST\n\nFigure 3: Accuracy on classifying samples from two different domains. (a) The models are trained on\nMNIST then on SVHN dataset or (b) vice versa. When the previous data are recalled by generative\nreplay (orange), knowledge of the \ufb01rst domain is retained as if the real inputs with predicted responses\nare replayed (green). Sequential training on the solver alone incurs forgetting on the former domain,\nthereby resulting in low average performance (violet).\n\nFigure 4: Samples from trained generator in MNIST to SVHN experiment after training on SVHN\ndataset for 1000, 2000, 5000, 10000, and 20000 iterations. The samples are diverted into ones that\nmimic either SVHN or MNIST input images.\n\nFigure 3 illustrates the performance on the original task (thick curves) and the new task (dim curves).\nA solver trained alone lost its performance on the old task when no data are replayed (purple). Since\nMNIST and SVHN input data share similar spatial structure, the performance on the former task did\nnot drop to zero, yet the decline was critical. In contrast, the solver with generative replay (orange)\nmaintained its performance on the \ufb01rst task while accomplishing the second one. The results were no\nworse than replaying past real inputs paired with predicted responses from the old solver (green). In\nboth cases, the model trained without any replay data achieved slightly better performance on new\ntask, as the network was solely optimized to solve the second task.\nGenerative replay is compatible with other continual learning models as well. For instance, Learning\nwithout Forgetting (LwF), which replays current task inputs to revoke past knowledge, can be\naugmented with generative models that produce samples similar to former task inputs. Because LwF\nrequires the context information of which task is being performed to use task-speci\ufb01c output layers,\nwe tested the performance separately on each task. Note that our scholar model with generative replay\ndoes not need the task context.\nIn Figure 5, we compare the performance of LwF algorithm with a variant LwF-GR, where the\ntask-speci\ufb01c generated inputs are fed to maintain older network responses. We used the same training\nregime as proposed in the original literature, namely warming up the new network head for some\namount of the time and then \ufb01ne tuning the whole network. The solver trained with original LwF\nalgorithm loses performance on the \ufb01rst task when \ufb01ne-tuning begins, due to alteration to shared\n\n6\n\n(cid:18)(cid:17)(cid:17)(cid:17)(cid:1)(cid:74)(cid:85)(cid:70)(cid:83)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79)(cid:84)(cid:19)(cid:17)(cid:17)(cid:17)(cid:1)(cid:74)(cid:85)(cid:70)(cid:83)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79)(cid:84)(cid:22)(cid:17)(cid:17)(cid:17)(cid:1)(cid:74)(cid:85)(cid:70)(cid:83)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79)(cid:84)(cid:18)(cid:17)(cid:17)(cid:17)(cid:17)(cid:1)(cid:74)(cid:85)(cid:70)(cid:83)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79)(cid:84)(cid:19)(cid:17)(cid:17)(cid:17)(cid:17)(cid:1)(cid:74)(cid:85)(cid:70)(cid:83)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79)(cid:84)\fFigure 5: Performance of LwF and LwF augmented with generative replay (LwF-GR) on classifying\nsamples from each domain. The networks were trained on SVHN then on MNIST database. Test\naccuracy on SVHN classi\ufb01cation task (thick curves) dropped when the shared parameters were\n\ufb01ne-tuned, but generative replay greatly tempered the loss (orange). Both networks achieved high\naccuracy on MNIST classi\ufb01cation (dim curves).\n\nnetwork (green). However, with generative replay, the network maintains most of the past knowledge\n(orange).\n\n4.3 Learning new classes\n\nTo illustrate that generative replay can recollect the past knowledge even when the inputs and targets\nare highly biased between the tasks, we propose a new experiment in which the network is sequentially\ntrained on disjoint data. In particular, we assume a situation where the agent can access examples of\nonly a few classes at a time. The agent eventually has to correctly classify examples from all classes\nafter being sequentially trained on mutually exclusive subsets of classes. We tested the networks on\nMNIST handwritten digit database.\nNote that training the arti\ufb01cial neural networks independently on classes is dif\ufb01cult in standard\nsettings, as the network responses may change to match the new target distribution. Hence replaying\ninputs and outputs that represent former input and target distributions is necessary to train a balanced\nnetwork. We thus compare the variants described earlier in this section from the perspective of\nwhether the input and target distributions of cumulative real data is recovered. For ER and GR models,\nboth the input and target distributions represent cumulative distribution. Noise model maintains\ncumulative target distributions, but the input distribution only mirrors current distribution. None\nmodel has current distribution for both.\n\nFigure 6: The models were sequentially trained on 5 tasks where each task is de\ufb01ned to classify\nMNIST images belong to 2 out of 10 labels. In this case, the networks are given with examples of\n0 and 1 during the \ufb01rst task, 2 and 3 for the second, and in the same manner. Only our networks\nachieved test performance close to the upper bound.\n\nIn Figure 6, we divided MNIST dataset into 5 disjoint subsets, each of which contains samples from\nonly 2 classes. When the networks are sequentially trained on the subsets, we observed that a naively\ntrained classi\ufb01er completely forgot previous classes and only learned the new subset of data (purple).\nRecovering only the past output distribution without a meaningful input distribution did not help\nretaining knowledge, as evidenced by the model with a noise generator (pink). When both the input\n\n7\n\n\fand output distributions are reconstructed, generative replay evoked previously learnt classes, and the\nmodel was able to discriminate all encountered classes (orange).\n\nFigure 7: Generated samples from trained generator after the task 1, 2, 3, 4, and 5. The generator is\ntrained to reproduce cumulative data distribution.\n\nBecause we assume that the past data are completely discarded, we trained the generator to mimic\nboth current inputs and the generated samples from the previous generator. The generator thus\nreproduces cumulative input distribution of all encountered examples so far. As shown in Figure 7,\ngenerated samples from trained generator include examples equally from encountered classes.\n\n5 Discussion\n\nWe introduce deep generative replay framework, which allows sequential learning on multiple tasks\nby generating and rehearsing fake data that mimics former training examples. The trained scholar\nmodel comprising a generator and a solver serves as a knowledge base of a task. Although we\ndescribed a cascade of knowledge transfer between a sequence of scholar models, a little change in\nformulation proposes a solution to other topically relevant problems. For instance, if the previous\nscholar model is just a past copy of the same network, it can learn multiple tasks without explicitly\npartitioning the training procedure.\nAs comparable approaches, regularization methods such as EWC and careful training the shared\nparameters as in LwF have shown that catastrophic forgetting could be alleviated by protecting\nformer knowledge of the network. However, regularization approaches constrain the network with\nadditional loss terms for protecting weights, so they potentially suffer from the tradeoff between the\nperformances on new and old tasks. To guarantee good performances on both tasks, one should train\non a huge network that is much larger than normally needed. Also, the network has to maintain the\nsame structure throughout all tasks when the constraint is given speci\ufb01c to each parameter as in EWC.\nDrawbacks of LwF framework are also twofold: the performance highly depends on the relevance of\nthe tasks, and the training time for one task linearly increases with the number of former tasks.\nThe deep generative replay mechanism bene\ufb01ts from the fact that it maintains the former knowledge\nsolely with input-target pairs produced from the saved networks, so it allows ease of balancing the\nformer and new task performances and \ufb02exible knowledge transfer. Most importantly, the network is\njointly optimized towards task objectives, hence guaranteed to achieve the full performance when the\nformer input spaces are recovered by the generator. One defect of the generative replay framework is\nthat the ef\ufb01cacy of the algorithm heavily depends on the quality of the generator. Indeed, we observed\nsome performance loss while training the model on SVHN dataset within same setting employed in\nsection 4.3. Detailed analysis is provided in supplementary materials.\nWe acknowledge that EWC, LwF, and ours are not completely exclusive, as they contribute to memory\nretention at different levels. Nevertheless, each method poses some constraints on training procedure\nor network con\ufb01gurations, and there is no straightforward mixture of any two frameworks. We believe\na good mix of the three frameworks would give a better solution to the chronic problem in continual\nlearning.\nFuture works of generative replay may extend to reinforcement learning domain or the form of\ncontinuously evolving network that maintains knowledge from past copy of the self. Also, we expect\nthe improvements in training deep generative models would directly aid the performance of generative\nreplay framework on more complex domains.\n\n8\n\n\fAcknowledgement\n\nWe would like to thank Hyunsoo Kim, Risto Vuorio, Joon Hyuk Yang, Junsik Kim and our reviewers\nfor their valuable feedback and discussion that greatly assisted this research.\n\nReferences\n[1] W. C. Abraham and A. Robins. Memory retention\u2013the synaptic stability versus plasticity\n\ndilemma. Trends in neurosciences, 28(2):73\u201378, 2005.\n\n[2] B. Ans and S. Rousset. Avoiding catastrophic forgetting by coupling two reverberating neu-\nral networks. Comptes Rendus de l\u2019Acad\u00e9mie des Sciences-Series III-Sciences de la Vie,\n320(12):989\u2013997, 1997.\n\n[3] D. A. Baldwin, E. M. Markman, and R. L. Melartin. Infants\u2019 ability to draw inferences about\nnonobvious object properties: Evidence from exploratory play. Child development, 64(3):711\u2013\n728, 1993.\n\n[4] M. H. Bornstein and M. E. Arterberry. The development of object categorization in young\nchildren: Hierarchical inclusiveness, age, perceptual attribute, and group versus individual\nanalyses. Developmental psychology, 46(2):350, 2010.\n\n[5] T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple\n\nclassi\ufb01er systems, pages 1\u201315. Springer, 2000.\n\n[6] J. Fagot and R. G. Cook. Evidence for large long-term memory capacities in baboons and\npigeons and its implications for learning and the evolution of cognition. Proceedings of the\nNational Academy of Sciences, 103(46):17564\u201317567, 2006.\n\n[7] R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences,\n\n3(4):128\u2013135, 1999.\n\n[8] H. Gelbard-Sagiv, R. Mukamel, M. Harel, R. Malach, and I. Fried.\n\nInternally generated\nreactivation of single neurons in human hippocampus during free recall. Science, 322(5898):96\u2013\n101, 2008.\n\n[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object\ndetection and semantic segmentation. In Proceedings of the IEEE conference on computer\nvision and pattern recognition, pages 580\u2013587, 2014.\n\n[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems\n(NIPS), 2014.\n\n[11] I. J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. CoRR, abs/1701.00160,\n\n2017.\n\n[12] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation\nof catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211,\n2013.\n\n[13] J. Grutzendler, N. Kasthuri, and W.-B. Gan. Long-term dendritic spine stability in the adult\n\ncortex. Nature, 420(6917):812\u2013816, 2002.\n\n[14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of\n\nwasserstein gans. arXiv preprint arXiv:1704.00028, 2017.\n\n[15] M. Hattori. A biologically inspired dual-network memory model for reduction of catastrophic\n\nforgetting. Neurocomputing, 134:262\u2013268, 2014.\n\n[16] G. E. Hinton and D. C. Plaut. Using fast weights to deblur old memories. In Proceedings of the\n\nninth annual conference of the Cognitive Science Society, pages 177\u2013186, 1987.\n\n9\n\n\f[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[18] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,\nJ. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural\nnetworks. Proceedings of the National Academy of Sciences, 114(13):3521\u20133526, 2017.\n\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[20] S.-W. Lee, J.-H. Kim, J.-W. Ha, and B.-T. Zhang. Overcoming catastrophic forgetting by\n\nincremental moment matching. arXiv preprint arXiv:1703.08475, 2017.\n\n[21] Z. Li and D. Hoiem. Learning without forgetting. In European Conference on Computer Vision,\n\npages 614\u2013629. Springer, 2016.\n\n[22] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The\n\nsequential learning problem. Psychology of learning and motivation, 24:109\u2013165, 1989.\n\n[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[24] D. C. Mocanu, M. T. Vega, E. Eaton, P. Stone, and A. Liotta. Online contrastive divergence\nwith generative replay: Experience replay without storing data. CoRR, abs/1610.05555, 2016.\n\n[25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural im-\nages with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised\nfeature learning, volume 2011, page 5, 2011.\n\n[26] R. C. O\u2019Reilly and K. A. Norman. Hippocampal and neocortical contributions to memory:\nAdvances in the complementary learning systems framework. Trends in cognitive sciences,\n6(12):505\u2013510, 2002.\n\n[27] J. O\u2019Neill, B. Pleydell-Bouverie, D. Dupret, and J. Csicsvari. Play it again: reactivation of\n\nwaking experience and memory. Trends in neurosciences, 33(5):220\u2013229, 2010.\n\n[28] S. Ramirez, X. Liu, P.-A. Lin, J. Suh, M. Pignatelli, R. L. Redondo, T. J. Ryan, and S. Tonegawa.\n\nCreating a false memory in the hippocampus. Science, 341(6144):387\u2013391, 2013.\n\n[29] R. Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and\n\nforgetting functions. Psychological review, 97(2):285\u2013308, 1990.\n\n[30] A. Robins. Catastrophic forgetting in neural networks: the role of rehearsal mechanisms.\nIn Arti\ufb01cial Neural Networks and Expert Systems, 1993. Proceedings., First New Zealand\nInternational Two-Stream Conference on, pages 65\u201368. IEEE, 1993.\n\n[31] A. Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science,\n\n7(2):123\u2013146, 1995.\n\n[32] R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 448\u2013455, 2009.\n\n[33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[34] R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber. Compete to compute.\n\nIn Advances in neural information processing systems, pages 2310\u20132318, 2013.\n\n[35] R. Stickgold and M. P. Walker. Sleep-dependent memory consolidation and reconsolidation.\n\nSleep medicine, 8(4):331\u2013343, 2007.\n\n10\n\n\f", "award": [], "sourceid": 1715, "authors": [{"given_name": "Hanul", "family_name": "Shin", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Jung Kwon", "family_name": "Lee", "institution": "SK T-Brain"}, {"given_name": "Jaehong", "family_name": "Kim", "institution": "T-Brain"}, {"given_name": "Jiwon", "family_name": "Kim", "institution": "SK T-Brain"}]}