{"title": "Random Path Selection for Continual Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 12669, "page_last": 12679, "abstract": "Incremental life-long learning is a main challenge towards the long-standing goal of Artificial General Intelligence. In real-life settings, learning tasks arrive in a sequence and machine learning models must continually learn to increment already acquired knowledge. The existing incremental learning approaches fall well below the state-of-the-art cumulative models that use all training classes at once. In this paper, we propose a random path selection algorithm, called RPS-Net, that progressively chooses optimal paths for the new tasks while encouraging parameter sharing and reuse. Our approach avoids the overhead introduced by computationally expensive evolutionary and reinforcement learning based path selection strategies  while achieving considerable performance gains. As an added novelty, the proposed model integrates knowledge distillation and retrospection along with the path selection strategy to overcome catastrophic forgetting. In order to maintain an equilibrium between previous and newly acquired knowledge, we propose a simple controller to dynamically balance the model plasticity.  Through extensive experiments, we demonstrate that the proposed method surpasses the state-of-the-art performance on incremental learning and by utilizing parallel computation this method can run in constant time with nearly the same efficiency as a conventional deep convolutional neural network.", "full_text": "Random Path Selection for Incremental Learning\n\nJathushan Rajasegaran\n\nMunawar Hayat\n\nSalman Khan\n\nFahad Shahbaz Khan\n\nLing Shao\n\nInception Institute of Arti\ufb01cial Intelligence\n\nfirst.last@inceptioniai.org\n\nAbstract\n\nIncremental life-long learning is a main challenge towards the long-standing goal\nof Arti\ufb01cial General Intelligence. In real-life settings, learning tasks arrive in a\nsequence and machine learning models must continually learn to increment already\nacquired knowledge. Existing incremental learning approaches, fall well below the\nstate-of-the-art cumulative models that use all training classes at once. In this paper,\nwe propose a random path selection algorithm, called RPS-Net, that progressively\nchooses optimal paths for the new tasks while encouraging parameter sharing.\nSince the reuse of previous paths enables forward knowledge transfer, our approach\nrequires a considerably lower computational overhead. As an added novelty, the\nproposed model integrates knowledge distillation and retrospection along with the\npath selection strategy to overcome catastrophic forgetting. In order to maintain\nan equilibrium between previous and newly acquired knowledge, we propose a\nsimple controller to dynamically balance the model plasticity. Through extensive\nexperiments, we demonstrate that the proposed method surpasses the state-of-the-\nart performance on incremental learning and by utilizing parallel computation this\nmethod can run in constant time with nearly the same ef\ufb01ciency as a conventional\ndeep convolutional neural network.\n\n1\n\nIntroduction\n\nThe ability to incrementally learn novel tasks and acquire new knowledge is necessary for life-long\nmachine learning. Deep neural networks suffer from \u2018catastrophic forgetting\u2019 [18], a phenomenon\nthat occurs when a network is sequentially trained on a series of tasks and the learning acquired\non new tasks interferes with the previously learned concepts. As an example, in a typical transfer\nlearning scenario, when a model pre-trained on a source task is adapted to another task by \ufb01ne-tuning\nits weights, its performance signi\ufb01cantly degrades on the source task whose weights are overridden\nby the newly learned parameters [13]. It is, therefore, necessary to develop continual learning models\ncapable of incrementally adding newly available classes without the need to retrain models from\nscratch using all previous class-sets (a cumulative setting). .\nAn ideal incremental learning model must meet the following criterion. (a) As a model is trained\non new tasks, it is desirable to maintain its performance on the old ones, thus avoiding catastrophic\nforgetting. (b) The knowledge acquired on old tasks should help in accelerating the learning on new\ntasks (a.k.a forward transfer) and vice versa. (c) As the class-incremental learning progresses, the\nnetwork must share and reuse the previously tuned parameters to realize a bounded computational\ncomplexity and memory footprint, (d) At all learning phases, the model must maintain a tight\n\nCodes available at https://github.com/brjathu/RPSnet\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fequilibrium between the existing knowledge base and newly presented information (stability-plasticity\ndilemma).\nDespite several attempts, existing incremental learning models partially address the above mentioned\nrequirements. For example, [16] employs a distillation loss to preserve knowledge across multiple\ntasks but requires prior knowledge about the task corresponding to a test sample during inference. An\nincremental classi\ufb01er and representation learning approach [21] jointly uses distillation and prototype\nrehearsal but retrains the complete network for new tasks, thus compromising model stability. The\nprogressive network [22] lacks scalability as it grows paths linearly (and parameters quadratically)\nwith the number of tasks. The elastic weight consolidation scheme [15] computes synaptic importance\nof\ufb02ine using Fisher information metric thus restricting its scalability and while it works well for\npermutation tasks, its performance suffers on class-incremental learning [12].\nHere, we argue that the most important characteristic of a true incremental learner is to maintain the\nright trade-off between \u2018stability\u2019 (leading to intransigence) and \u2018plasticity\u2019 (resulting in forgetting).\nWe achieve this requisite via a dynamic path selection approach, called RPS-Net, that proceeds with\nrandom candidate paths and discovers the optimal one for a given task. Once a task is learned, we\n\ufb01x the parameters associated with it, that can only be shared by future tasks. To complement the\npreviously learned representations, we propose a stacked residual design that focuses on learning\nthe supplementary features suitable for new tasks. Besides, our learning scheme leverages exemplar-\nbased retrospection and introduces an explicit controller module to maintain the equilibrium between\nstability and plasticity for all tasks. During training, our approach always operates with a constant\nparameter budget that at max equals to a conventional linear model (e.g., resent [6]). Furthermore,\nit can be straightforwardly parallelized during both train and test stages. With these novelties, our\napproach obtains state-of-the-art class-incremental learning results, surpassing the previous best\nmodel [21] by 7.38% and 10.64% on CIFAR-100 and ImageNet datasets, respectively.\nOur main contributions are:\n\nand reuse.\n\nnetwork reuse and accelerate the learning process resulting in faster training.\n\n\u2022 A random path selection approach that provides faster convergence through path sharing\n\u2022 The residual learning framework that incrementally learns residual paths which allows\n\u2022 Ours is a hybrid approach that combines the respective strengths of knowledge distillation\n(via regularization), retrospection (via exemplar replay) and dynamic architecture selection\nmethodologies to deliver a strong incremental learning performance.\n\u2022 A novel controller that guides the plasticity of the network to maintain an equilibrium\n\nbetween the previously learned knowledge and the newly presented tasks.\n\n2 Related Work\n\nThe catastrophic interference problem was \ufb01rst noted to hinder the learning of connectionist networks\nby [18]. This highlights the stability-plasticity dilemma in neural networks [1] i.e., a rigid and stable\nmodel will not be able to learn new concepts while an easily adaptable model is susceptible to forget\nold concepts due to major parameter changes. The existing continual learning schemes can be divided\ninto a broad set of three categories: (a) regularization schemes, (b) memory based retrospection and\nreplay, and (c) dynamic sub-network training and expansion.\nA major trend in continual learning research has been on proposing novel regularization schemes\nto avoid catastrophic forgetting by controlling the plasticity of network weights. [16] proposed a\nknowledge distillation loss [7] which forces the network to retain its predictions on the old tasks.\nKirkpatrick et al. [15] proposed an elastic weight consolidation mechanism that quanti\ufb01es the\nrelevance of parameters to a particular task and correspondingly adjusts the learning rate. In a similar\nspirit, [28] designed intelligent synapses which measure their relevance to a particular task and\nconsequently adjust plasticity during learning to minimize interference with old tasks.\nRebuf\ufb01 et al. [21] proposed a distillation scheme intertwined with exemplar-based retrospection to\nretain the previously learned concepts. [8] considered a similar approach for cross-dataset continual\nlearning [16]. The combination of episodic (short-term) and semantic (long-term) memory was\nstudied in [11, 5, 10] to perform memory consolidation and retrieval. Particularly, [10, 11] help avoid\nexplicitly storing exemplars in the memory, rather using a generative process to recall memories.\n\n2\n\n\fFigure 1: An overview of our RPS-Net: The network architecture utilizes a parallel residual design\nwhere the optimal path is selected among a set of randomly sampled candidate paths for new tasks.\nThe residual design allows forward knowledge transfer and faster convergence for later tasks. The\nrandom path selection approach is trained with a hybrid objective function that ensures the right\ntrade-off between network stability and plasticity, thus avoiding catastrophic forgetting.\n\nThe third stream of works explores dynamically adapting network architectures to cope with the\ngrowing learning tasks. [22] proposed a network architecture that progressively adds new branches\nfor novel tasks that are laterally connected to the \ufb01xed existing branches. Similarly, [26] proposed a\nnetwork that not only grows incrementally but also expands hierarchically. Speci\ufb01c paths through the\nnetwork were selected for each learning task using a genetic algorithm in PathNet [4]. Afterwards,\ntask-relevant paths were \ufb01xed and reused for new tasks to speed-up the learning ef\ufb01ciency.\nThe existing adaptive network architectures come with their respective limitations e.g., [22]\u2019s com-\nplexity grows linearly with the tasks, [26] has an expensive training procedure and a somewhat rigid\narchitecture and [4] does not allow incrementally learning new classes due to a detached output\nlayer and a relatively expensive genetic learning algorithm used in [4]. In comparison, we propose a\nrandom path selection methodology that provides a signi\ufb01cant boost and enables faster convergence.\nFurthermore, our approach combines the respective strengths of the above two types of methods by\nintroducing a distillation procedure alongside an exemplar-based memory replay to avoid catastrophic\nforgetting.\n\n3 Method\n\nWe consider the recognition problem in an incremental setting where new tasks are sequentially added.\nAssuming a total of K tasks, each comprising of U classes. Our goal is to sequentially learn a deep\nneural network, that not only performs well on the new tasks but also retains its performance on the\nold tasks. To address this problem, we propose a random path selection approach (RPS-Net) for new\ntasks that progressively builds on the previously acquired knowledge to facilitate faster convergence\nand better performance. In the following, we explain our network architecture, the path selection\nstrategy, a hybrid objective function and the training procedure for incremental learning.\n\n3.1 RPS-Net Architecture\nOur network consists of L distinct layers (see Figure 1). Each layer (cid:96) \u2208 [1, L] is constitutes a set of\nbasic building blocks, called modules M(cid:96). For simplicity, we consider each layer to contain an equal\nnumber of M modules, stacked in parallel, i.e., M(cid:96) = {M(cid:96)\nm=1, along with a skip connection\nmodule M(cid:96)\nskip is an identity\nfunction when the feature dimensions do not change and a learnable module when the dimensions\nvary between consecutive layers. A module M(cid:96)\nm is a learnable sub-network that maps the input\nfeatures to the outputs. In our case, we consider a simple combination of (conv-bn-relu-conv-bn)\nlayers for each module, similar to a single resnet block [6]. In contrast to a residual block which\n\nskip that carries the bypass signal. The skip connection module M(cid:96)\n\nm}M\n\n3\n\n\fconsists of a single identity connection and a residual branch, we have one skip connection and M\nresidual blocks stacked in parallel. The intuition behind developing such a parallel architecture is to\nensure multiple tasks can be continually learned without causing catastrophic interference with other\npaths, while simultaneously providing parallelism to ensure ef\ufb01ciency.\nTowards the end of each layer in RPS-Net, all the residual connections, as well as skip connections,\nare combined together using element-wise addition to aggregate complimentary task-speci\ufb01c features\nobtained from different paths. Remarkably, for the base-case when M = 1, the network is identical\nto a conventional resnet model. After the Global Average Pooling (GAP) layer that collapses the\ninput feature maps to generate a \ufb01nal feature f \u2208 RD, we use a fully connected layer classi\ufb01er with\nweights Wf c \u2208 RD\u00d7C (C being the total number of classes) that is shared among all tasks.\nFor a given RPS-Net with M modules and L layers, we can de\ufb01ne a path Pk \u2208 RL\u00d7M for a task k:\n\nPk((cid:96), m) =\n\nif the module M(cid:96)\notherwise.\n\nm is added to the path,\n\n(1)\n\n(cid:26)1,\n\n0,\n\nThe path Pk is basically arranged as a stack of one-hot encoded row vectors e(i) (i-th standard basis):\n\n(cid:110)\nPk((cid:96)) \u2208 {0, 1}M : Pk((cid:96)) = e(i) \u2261 M(cid:88)\n\nPk =\n\n(cid:111)\n, s.t., i \u223c U(cid:0){Z \u2229 [1, M ]}(cid:1),\n\n(2)\n\nPk((cid:96), m) = 1\n\nm=1\n\nwhere i is the selected module index, uniformly sampled using U(\u00b7) over the set of integers [1, M ].\nWe de\ufb01ne two set of paths Ptr\nk that denote the train and inference paths, respectively. Both\nk and Pts\n\u2208 {0, 1}L\u00d7M . When training the network, any mth module\nare formulated as binary matrices: Ptr,ts\nin lth layer with Ptr\nk (l, m) = 1 is activated and all such modules together constitute a training path\nk for task k. As we will elaborate in Sec. 3.2, the inference path is evolved during training by\nPtr\nsequentially adding newly discovered training paths and ends up in a \u201ccommon\u201d inference path for all\ninputs, therefore our RPS-Net does not require knowledge about the task an input belongs to. Some\nprevious methods (e.g., [16]) need such information, which limits their applicability to real-world\nincremental class-learning settings where one does not know in advance the corresponding task for\nan input sample. Similarly, only the modules with Pts\n\nk ((cid:96), m) = 1 are used in the inference stage.\n\nk\n\n3.2 Path Selection\n\nWith a total of K tasks, we assume a constant number of U classes that are observed in each kth task,\nsuch that U = C/K. Without loss of generality, the proposed path selection strategy can also be\napplied to a variable number of classes occurring in each task. The path selection scheme enables\nincremental and bounded resource allocation, with progressive learning that ensures knowledge\nexchange between the old and new tasks resulting in positive forward and backward transfer.\nTo promote resource reuse during training that in turn improves training speed and minimizes\ncomputational requirements, we propose to perform path selection after every J task, where 1<J<K.\nAs a result, the path selection is performed only (cid:100)K/J(cid:101) times in total during the complete training\nprocess. Our experiments show that J can be set to a higher value without sacri\ufb01cing the incremental\nlearning performance (see Sec. 4.3). For every J tasks, N paths are randomly chosen and followed\nby training process. The best path is then selected from these group of N sub-models and is shared\namong the next J tasks. Further, we also stop the training of the old modules (i.e., \ufb01x their paths and\nparameters) after the training for a particular group of tasks is completed. Hence, at any point, only L\nlayers with a maximum of one module are being trained.\nThe random path selection strategy is illustrated in Fig. 2. Our choice of random path generation as a\nmechanism to select an optimal path is mainly inspired by the recent works of [27, 30, 20]. These\nworks show that random search for an optimal network architecture performs almost the same as\nother computationally demanding approaches e.g., genetic algorithms and reinforcement learning\n(RL) based methods. Besides, some incremental learning approaches resort to adding new resources\nto the network, resulting in network expansion [22, 26]. In contrast, our path selection algorithm\ndoes not result in linear expansion of resources since a new path is created only after J tasks and\noverlapping modules are reused when the new path is intersecting old paths. Further, even when\nall the modules are exhausted (saturated), the skip connections are always trained. We show via an\n\n4\n\n\fFigure 2: Path Selection Approach: Given a task k, N random paths are initialized. For each path,\nk\u22121 are used to form the training path\nonly the modules different from the previous inference path Pts\nk\u22121 to obtain Pts\nk .\nPtr\nNotably, the path selection is only performed after J tasks. During training, the complexity remains\nbounded by a standard single path network and the resources are shared between tasks.\n\nk . Among N such paths, the optimal Pk is selected and combined with the Pts\n\nextensive ablation study that even when all paths are saturated, our RPS-Net can still learn useful\nrepresentations as the skip connections and classi\ufb01cation layer remains tunable in every case.\nAt any point in time, we train a single path (equivalent to a resnet) while rest of the inference paths\nare \ufb01xed. Due to this, the path we use for a task k essentially learns the residual signal relative to\nthe \ufb01xed paths that were previously trained for old tasks. For example, if we are training Ptr\nk , the\nweights of Pts(cid:98)k/J(cid:99) (cid:89) Ptr\nk are \ufb01xed, where (cid:89) denotes the exclusive disjunction (logical XOR operation).\nEssentially, the complete Ptr\nk is not used for training rather its disjoint portion that has not already\nk \u2227 Pts(cid:98)k/J(cid:99)). In this way, previous\nbeen trained for any of the old tasks is learned i.e., Ptr\nk\nknowledge is shared across the network via overlapping paths and skip connections. When the\nnetwork is already trained for several tasks, a new path for the current task only needs to learn higher\norder residuals of the network. This has an added advantage that convergence becomes faster as we\nlearn more tasks since each new task will be learned taking advantage of the previous information.\nThe optimal path based on the performance of N path con\ufb01gurations is selected as Pk. All such\ntask-speci\ufb01c paths are progressively combined together to evolve a common inference path Pts\nk ,\n\n(cid:89) (Ptr\n\n(3)\nwhere \u2228 denotes the inclusive disjunction (logical OR) operation. At each task k, the inference path\nPts\nk is used to evaluate all previous classes.\n\nPts\nk = Ptr\n\n1 \u2228 Ptr\n\n2 . . . \u2228 Ptr\nk ,\n\n3.3\n\nIncremental Learning Objective\n\nLoss function: We use a hybrid loss function that combines regular cross-entropy loss as well as a\ndistillation loss to incrementally train the network.\nFor a task k \u2208 [1, K] with each task having U classes, we calculate the cross-entropy loss as follows,\n\n(cid:88)\n\ni\n\nLce = \u2212 1\nn\n\nti[1 : k \u2217 U ] log(softmax(qi[1 : k \u2217 U ])),\n\n(4)\n\nwhere i denotes the example index, t(x) is the one-hot encoded true label, q(x) are the logits obtained\nfrom the network\u2019s last layer and n is the mini batch size. To keep the network robust to catastrophic\n\n5\n\n\fforgetting, we also use distillation loss in the objective function,\n\n(cid:88)\n\ni\n\nLdist =\n\n1\nn\n\nKL\n\nlog\n\n\u03c3\n\n(cid:18)\n\n(cid:18)\n\n(cid:18) qi[1 : (k \u2212 1) \u2217 U ]\n\n(cid:19)(cid:19)\n\nte\n\n(cid:18) q(cid:48)\n\ni[1 : (k \u2212 1) \u2217 U ]\n\nte\n\n(cid:19)(cid:19)\n\n.\n\n(5)\n\n, \u03c3\n\nHere, \u03c3 is the softmax function and te is the temperature used in [7] and q(cid:48)(x) are the logits obtained\nfrom the networks\u2019 previous state.\nController: It is crucial to maintain a balance between the previously acquired learning and the\nknowledge available from the newly presented task. If the learning is biased towards either of the\ntwo objectives, it will result in either catastrophic forgetting (losing old task learning) or interference\n(obstructing learning for the new task). Since our network is trained with a combined objective\nfunction with Lce and Ldist, it is necessary to adequately control the plasticity of the network. We\npropose the following controller that seeks to maintain an equilibrium between Lce and Ldist,\n\nL = Lce + \u03c6(k, \u03b3) \u00b7 Ldist,\n\n(6)\nwhere, \u03c6(k, \u03b3) is a scalar coef\ufb01cient function with \u03b3 as a scaling factor, introduced to increase the\ndistillation contribution to the total loss. Intuitively, as we progress through training, \u03c6(k, \u03b3) will\nalso increase to ensure that network remembers old information,\nif k \u2264 J\notherwise.\n\n(k \u2212 J) \u2217 \u03b3,\n\n(cid:26)1,\n\n\u03c6(k, \u03b3) =\n\n(7)\n\n4 Experiments and Results\n\n4.1\n\nImplementation Details\n\nDataset and Protocol: For our experiments, we use evaluation protocols similar to iCARL [21]. We\nincrementally learn 100 classes on CIFAR-100 in groups of 10, 20 and 50 at a time. For ImageNet,\nwe use the same subset as [21] comprising of 100 classes and incrementally learn them in groups of\n10. After training on a new group of classes, we evaluate the trained model on test samples of all seen\nclasses (including current and previous tasks). Following iCARL [21], we restrict exemplar memory\nbudget to 2k samples for CIFAR-100 and ImageNet datasets. Note that unlike iCARL, we randomly\nselect our exemplars and do not employ any herding and exemplar selection mechanism.\nWe also experiment our model with MNIST and SVHN datasets. For this, we resize all images to\n32\u00d732 and keep a random exemplar set of 4.4k, as in [9]. We group 2 consecutive classes into one\ntask and incrementally learn \ufb01ve tasks. For evaluation, we report the average over all classes (A5).\nTraining: For the CIFAR100 dataset, we use resnet-18 along with max pooling after 5th, 7th\nblocks and global average pooling (GAP) after 9th block. For ImageNet dataset, we use the standard\nresnet-18 architecture as in [21]. After the GAP layer, a single fully connected layer with weights\nWf c \u2208 R512\u00d7100 is used as a classi\ufb01er. For MNIST, a simple 2 layered MLP (with 400 neurons\neach), whereas for SVHN resnet-18 is used, similar to [9].\nFor each task, we train our model for 100 epochs using Adam [14] with te = 2, with learning rate\nstarting from 10\u22123 and divided by 2 after every 20 epochs. We set the controller\u2019s scaling factor to\n\u03b3 = 2.5 and \u03b3 = 10 respectively for CIFAR and ImageNet datasets. We use the ratio between the\nnumber of training samples for a task and the \ufb01xed number of exemplars as the value for \u03b3. We \ufb01x\nM = 8 and J = 2 except for the 50 classes per task, where J = 1. We do not use any weight or\nnetwork regularization scheme such as dropout in our model. For augmentation, training images are\nrandomly cropped, \ufb02ipped and rotated (< 100). For each task, we train N = 8 models in parallel\nusing a NVIDIA-DGX-1 machine. These models come from the randomly sampled paths in our\napproach and may have some parts frozen due to an overlap with previous tasks. Our codes are\navailable https://github.com/brjathu/RPSnet.\n\n4.2 Results and Comparisons\n\nWe extensively compare the proposed technique with existing state-of-the-art methods for incremental\nlearning. These include Elastic Weight Consolidation (EWC) [15], Riemannian Walk (RWalk) [3],\nLearning without Forgetting (LwF) [16], Synaptic Intelligence (SI) [28], Memory Aware Synapses\n\n6\n\n\fFigure 3: Results on CIFAR-100 with 10, 5 and 2 tasks (from left to right). We surpass STOA results.\n\nMethods\n\nJoint training\nEWC [15]\n\nonline-EWC [23]\n\nSI [28]\nMAS [2]\nLwF [16]\nGEM\u2217 [17]\nDGR\u2217 [24]\nRtF\u2217 [25]\nRPS-Net\u2217\n\nMNIST(A5)\n\n97.53%\n19.80%\n19.77%\n19.67%\n19.52%\n24.17%\n92.20%\n91.24%\n92.56%\n96.16%\n\nSVHN(A5)\n\n88.91%\n\n93.23%\n18.21%\n18.50%\n17.33%\n17.32%\n\n75.61%\n\n-\n\n-\n-\n\nFigure 4: Results on ImageNet dataset for learn-\ning 10 classes at a time. We surpass STOA re-\nsults by 10.3%.\n\nTable 1: Comparison on MNIST and SVHN\ndatasets. Ours is a memory based approach (de-\nnoted by \u2018\u2217\u2019), and outperforms state-of-the-art.\n\n(MAS) [2], Deep Model Consolidation (DMC) [29] and Incremental Classi\ufb01er and Representation\nLearning (iCARL) [21]. We further evaluate on three baseline approaches: Fixed Representation\n(FixedRep) where the convolution part of the model is frozen and only the classi\ufb01er is trained for\nnewly added classes, FineTune where the complete previously learnt model is tuned for the new data,\nand Oracle where the model is trained on all samples from previous and current tasks.\nFig. 3 compares different methods on CIFAR-100 datasets, where we incrementally learn groups of\n10, 20 and 50 classes at a time. The results indicate superior performance of the proposed method\nin all settings. For the case of learning 10 classes at a time, we outperform iCARL [21] by an\nabsolute margin of 7.3%. Compared with the second best method, our approach achieves a relative\ngain of 5.3% and 9.7% respectively for the case of incrementally learning 20 and 50 classes on\nCIFAR-100 dataset. For the case of 50 classes per task, our performance is only 3.2% below the\nOracle approach, where all current and previous class samples are used for training. Fig. 4 compares\ndifferent methods on ImageNet dataset. The results show that for experimental settings consistent\nwith iCARL [21], our proposed method achieves a signi\ufb01cant absolute performance gain of 10.3%\ncompared with the existing state-of-the-art [21]. Our experimental results indicate that commonly\nused technique of \ufb01ne-tuning a model on new classes is clearly an inferior approach, and results in\ncatastrophic forgetting. Table 1 compares different methods on MNIST and SVHN datasets following\nexperimental setting of [9]. The results show that RPS-Net, surpasses all previous methods with a\nmargin of 4.3% and 13.3% respectively for MNIST and SVHN datasets. The results further indicate\nthat the methods which do not use a memory perform relatively lower.\n\n4.3 Ablation Studies and Analysis\n\nContribution from Each Component of RPS-Net: Fig. 5a studies the impact of progressively\nintegrating individual components of our RPS-Net. We begin with a simple baseline model with a\nsingle path that achieves 37.97% classi\ufb01cation accuracy on CIFAR100 dataset. When distillation\nloss is used alongside the baseline model, the performance increases to 44.93%. The addition of our\nproposed controller \u03c6(k, \u03b3) in the loss function further gives a signi\ufb01cant boost of +6.83%, resulting\nin an overall accuracy of 51.76%. Finally, the proposed multi-path selection algorithm along with\nabove mentioned components increases the classi\ufb01cation accuracy up to 58.48%. This demonstrates\n\n7\n\n102030405060708090100Number of Classes0102030405060708090100Accuracy %CIFAR-100: Learning 10 Classes at a timeDMCLwFRWalkSIMASEWCFineTuneFixedRepiCARLOursOracle2030405060708090100Number of Classes0102030405060708090100Accuracy %CIFAR-100: Learning 20 Classes at a timeDMCLwFRWalkSIMASEWCFineTuneFixedRepiCARLOursOracle5060708090100Number of Classes0102030405060708090100Accuracy %CIFAR-100: Learning 50 Classes at a timeDMCLwFRWalkSIMASEWCFineTuneFixedRepiCARLOursOracle102030405060708090100Number of Classes0102030405060708090100Top-5 Accuracy %ImageNet: Learning 10 Classes at a timeFineTuneFixedRepLwFiCARLOurs\fthat our two contributions, controller and multi-path selection, provide a combined gain of 13.6%\nover baseline + distillation.\nIncrease in the #Parameters: Fig. 5b compares total parameters across tasks for Progressive Nets\n[22], iCARL [21] and our RPS-Net on CIFAR100. Our model effectively reuses previous parameters,\nand the model size does not increase signi\ufb01cantly with tasks. After 10 tasks, RPS-Net has 72.26M\nparameters on average, compared with iCARL (21.3M) and Progressive Nets (932.84M). In RPS-Net\nthe number of parameters and FLOPs increase logarithmically, while for Progressive Nets they\nincrease quadratically.\nScaling Factor \u03b3: It controls the equilibrium between cross-entropy and distillation losses (or the\nbalance between new and old tasks). In Fig. 6, for smaller \u03b3, the network tends to forget old\ninformation while learning the new tasks well and vice versa. For example, when \u03b3 = 1 (same as\nloss function used in iCaRL [21]) the performance drops after 5 tasks, showing the model is not at\nits equilibrium state. On the other hand, \u03b3 = 8 achieves the best performance at earlier task (2, 3, 4\nand 5), with drop in performance towards the later tasks (51% at task 10). Empirically, we found the\noptimal value for \u03b3 = 2.5, to keep the equilibrium till last tasks.\nVarying Blocks and Paths: One of the important restriction in RPS-Net design is the networks\u2019\ncapacity, upper-bounded by M\u00d7L modules. As proposed in the learning strategy, a module is trained\nonly once for a path. Hence, it is interesting to study whether the network saturates for a high number\nof tasks. To analyze this effect, we change the parameter M and J. Our results with varying M\nare reported in Fig. 6, which demonstrate that the network can perform well even when all paths\nare saturated. This effect is a consequence of our residual design where skip connections and last\nclassi\ufb01cation layer are always trained, thus helping to continually learn new tasks even if the network\nis saturated. If saturation occurs, the model has already learned the generalization of input distribution,\nhence, a residual signal (carrying complementary information) via skip connections is enough to\nadjust to a new task. Further, once the network has seen many tasks, it learns generalizable features\nthat can work well for future tasks with adaptation of the \ufb01nal classi\ufb01cation layer weights.\nIn Fig. 6, we illustrate results with varying paths (paths \u221d 1\nJ ) in the network. We note that learning a\nhigh number of paths degrades performance as the previously learned parameters are less likely to be\neffectively reused. On the other hand, we obtain comparable performance with fewer paths (e.g., 2\nfor CIFAR-100).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: From left to right: (a) Contribution from each component of the RPS-Net, (b) Increase in\nnumber of parameters with number of tasks, (c) RPS-Net performance on different memory sizes\nand (d) Forward transfer showing faster convergence as the tasks increase.\n\nFigure 6: From left to right: Ablation analysis for parameters \u03b3, J & M and the number of FLOPS.\n\n8\n\n102030405060708090100Number of Classes405060708090Accuracy %Contribution of each component in RPS-NetSingle PathSingle Path + DistillationSingle Path + Distillation + Multiple Path + Distillation + 12345678910Number of tasks102103#parameters (millions) - log scale#Parameters vs TasksProgressive NetsRPS-NetiCARL500100015002000250030003500400045005000Number of examplers50556065707580Average accuracy %RPS-Net vs iCARL for different #examplarsOurs (RPS-Net)iCARL5101520253035404550Epochs6065707580859095100Accuracy %Convergence across different taskstask-2task-4task-6task-8task-10102030405060708090100Number of Classes505560657075808590Accuracy %Variations in scaling factor=1=2=2.5=3=4=8102030405060708090100Number of Classes505560657075808590Accuracy %Different Number of paths#Paths=10#Paths=5#Paths=2102030405060708090100Number of Classes45505560657075808590Accuracy %Evaluation for different values of MM=1,J=2M=2,J=2M=3,J=2M=8,J=2102030405060708090100Number of Classes23456789101112#FLOPS 109FLOPS ComparisonM=8,J=1M=1,J=2M=2,J=2M=3,J=2M=8,J=2\fFigure 7: Confusion matrices over 10 incremental tasks on CIFAR-100, showing backward knowl-\nedge transfer.\n\nDifference from Genetic Algorithms: We compare our random selection with a genetic algorithm\ni.e., Binary Tournament Selection (BTS) for 25 maximum generations, on MNIST with 5 tasks (each\nof 2 classes), using a simple 2 layer (100 neurons) MLP with M = 8, J = 1. On 5 runs, our proposed\nrandom selection achieves an average accuracy of 96.52% vs BTS gets 96.32%. For same time\ncomplexity as ours, BTS has an average accuracy of 71.24% for the \ufb01rst generation models. For BTS\nto gain similar performance as our random selection, it needs an average of 10.2 generations (> #\nrandom paths), hence BTS has more compute complexity. Sophisticated genetic algorithms may beat\nrandom selection with a small margin, but likely with a high compute cost, which is not suitable for\nan incremental classi\ufb01er learning setting having multiple tasks.\nForward Transfer: The convergence trends shown in Fig. 5d demonstrate the forward knowledge\ntransfer for RPS-Net. We can see that for task-2, the model takes relatively longer to converge\ncompared with task-10. Precisely, for the \ufb01nal task, the model achieves 95% of the total performance\nwithin only one epoch, while for the second task it starts with 65% and takes up-to 20 epochs to\nachieve 95% of the \ufb01nal accuracy. This trends shows the faster convergence of our model for newer\ntasks This effect is due to residual learning as well as overlapping module sharing in RPS-Net design,\ndemonstrating its forward transfer capability.\nBackward Transfer: Fig. 7 shows evolution of our model with new tasks. We can see that the\nperformance of the current task (k) is lower than the previous tasks (<k). Yet, as the model evolves,\nthe performance of task k gradually increases. This demonstrates models\u2019 capability of backward\nknowledge transfer, which is also re\ufb02ected in biological aspects of human brain. Speci\ufb01cally,\nhippocampus in human brain accomplishes fast learning which is later slowly consolidated with\nthe slow learning at neocortex [19]. In Fig. 7, we can see the pattern of slow learning, with the\nperformance on new tasks gradually maturing. We also quantitatively validate Backwards Transfer\nwith BWT metric (see Eq. 3 in GEM [17], larger the better). After last task, BWT values are -0.1462\n(RPS-Net) vs. -0.4602 (iCARL) which shows the better backward transfer capability of our model.\nFLOPS comparison: As the number of tasks increase, the network\u2019s complexity grows. As shown\nin Fig. 6, with different con\ufb01gurations of modules and paths, the computational complexity of\nour approach scales logarithmically. This proves that the complexity of RPS-Net is bounded by\nO(log(#task)). This is due to the fact that the overlapping modules increase as the training\nprogresses. Further, in our setting we chose new paths after every J > 1 tasks. Hence, in practice\nour computational complexity is well below the worst-case logarithmic curve. For example with a\nsetting of M =2, J=2 the computational requirements reduces by 63.7% while achieving the best\nperformance. We also show that even when a single path is used for all the tasks (M =1), our model\nachieves almost the same performance as state-of-the-art with constant computational complexity.\n\n5 Conclusion\n\nLearning tasks appear in a sequential order in real-world problems and a learning agent must\ncontinually increment its existing knowledge. Deep neural networks excel in the cumulative learning\nsetting where all tasks are available at once, but their performance deteriorates rapidly for incremental\nlearning. In this paper, we propose a scalable approach to class-incremental learning that aims to keep\nthe right balance between previously acquired knowledge and the newly presented tasks. We achieve\nthis using an optimal path selection approach that support parallelism and knowledge exchange\nbetween old and new tasks. Further, a controlling mechanism is introduced to maintain an equilibrium\nbetween the stability and plasticity of the learned model. Our approach delivers strong performance\ngains on MNIST, SVHN, CIFAR-100 and ImageNet datasets for incremental learning problem.\n\n9\n\nTask 4Task 5Task 6Task 7Task 8\fReferences\n[1] W. C. Abraham and A. Robins. Memory retention\u2013the synaptic stability versus plasticity dilemma. Trends\n\nin neurosciences, 28(2):73\u201378, 2005.\n\n[2] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning\nwhat (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages\n139\u2013154, 2018.\n\n[3] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learning:\nUnderstanding forgetting and intransigence. In Proceedings of the European Conference on Computer\nVision (ECCV), pages 532\u2013547, 2018.\n\n[4] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet:\n\nEvolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.\n\n[5] A. Gepperth and C. Karaoguz. A bio-inspired incremental learning architecture for applied perceptual\n\nproblems. Cognitive Computation, 8(5):924\u2013934, 2016.\n\n[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[7] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[8] S. Hou, X. Pan, C. Change Loy, Z. Wang, and D. Lin. Lifelong learning via progressive distillation and\nretrospection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 437\u2013452,\n2018.\n\n[9] Y.-C. Hsu, Y.-C. Liu, A. Ramasamy, and Z. Kira. Re-evaluating continual learning scenarios: A categoriza-\n\ntion and case for strong baselines. 2018.\n\n[10] N. Kamra, U. Gupta, and Y. Liu. Deep generative dual memory network for continual learning. arXiv\n\npreprint arXiv:1710.10368, 2017.\n\n[11] R. Kemker and C. Kanan. Fearnet: Brain-inspired model for incremental learning. International Conference\n\non Learning Representations, 2018.\n\n[12] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan. Measuring catastrophic forgetting in\n\nneural networks. In Thirty-second AAAI conference on arti\ufb01cial intelligence, 2018.\n\n[13] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun. A guide to convolutional neural networks for\n\ncomputer vision. Synthesis Lectures on Computer Vision, 8(1):1\u2013207, 2018.\n\n[14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[15] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan,\nT. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. In\nProceedings of the national academy of sciences, volume 114, pages 3521\u20133526. National Acad Sciences,\n2017.\n\n[16] Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine\n\nintelligence, 40(12):2935\u20132947, 2018.\n\n[17] D. Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information\n\nProcessing Systems, pages 6467\u20136476, 2017.\n\n[18] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential\nlearning problem. In Psychology of learning and motivation, volume 24, pages 109\u2013165. Elsevier, 1989.\n\n[19] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural\n\nnetworks: A review. CoRR, abs/1802.07569, 2018.\n\n[20] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef\ufb01cient neural architecture search via parameter\n\nsharing. arXiv preprint arXiv:1802.03268, 2018.\n\n[21] S.-A. Rebuf\ufb01, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classi\ufb01er and representation\nlearning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n2001\u20132010, 2017.\n\n10\n\n\f[22] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and\n\nR. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.\n\n[23] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell.\nProgress & compress: A scalable framework for continual learning. arXiv preprint arXiv:1805.06370,\n2018.\n\n[24] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in\n\nNeural Information Processing Systems, pages 2990\u20132999, 2017.\n\n[25] G. M. van de Ven and A. S. Tolias. Generative replay with feedback connections as a general strategy for\n\ncontinual learning. arXiv preprint arXiv:1809.10635, 2018.\n\n[26] T. Xiao, J. Zhang, K. Yang, Y. Peng, and Z. Zhang. Error-driven incremental learning in deep convolutional\nIn Proceedings of the 22nd ACM international\n\nneural network for large-scale image classi\ufb01cation.\nconference on Multimedia, pages 177\u2013186. ACM, 2014.\n\n[27] S. Xie, A. Kirillov, R. Girshick, and K. He. Exploring randomly wired neural networks for image\n\nrecognition. arXiv preprint arXiv:1904.01569, 2019.\n\n[28] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In Proceedings of\nthe 34th International Conference on Machine Learning-Volume 70, pages 3987\u20133995. JMLR. org, 2017.\n\n[29] J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, and C.-C. J. Kuo. Class-incremental\n\nlearning via deep model consolidation. arXiv preprint arXiv:1903.07864, 2019.\n\n[30] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n8697\u20138710, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6891, "authors": [{"given_name": "Jathushan", "family_name": "Rajasegaran", "institution": "Inception Institute of Artificial Intelligence"}, {"given_name": "Munawar", "family_name": "Hayat", "institution": "IIAI"}, {"given_name": "Salman", "family_name": "Khan", "institution": "Inception Institute of Artificial Intelligence"}, {"given_name": "Fahad Shahbaz", "family_name": "Khan", "institution": "Inception Institute of Artificial Intelligence"}, {"given_name": "Ling", "family_name": "Shao", "institution": "Inception Institute of Artificial Intelligence"}]}