{"title": "Continual Unsupervised Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7647, "page_last": 7657, "abstract": "Continual learning aims to improve the ability of modern learning systems to deal with non-stationary distributions, typically by attempting to learn a series of tasks sequentially. Prior art in the field has largely considered supervised or reinforcement learning tasks, and often assumes full knowledge of task labels and boundaries. In this work, we propose an approach (CURL) to tackle a more general problem that we will refer to as unsupervised continual learning. The focus is on learning representations without any knowledge about task identity, and we explore scenarios when there are abrupt changes between tasks, smooth transitions from one task to another, or even when the data is shuffled.\nThe proposed approach performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting. \nWe demonstrate the efficacy of CURL in an unsupervised learning setting with MNIST and Omniglot, where the lack of labels ensures no information is leaked about the task. \nFurther, we demonstrate strong performance compared to prior art in an i.i.d setting, or when adapting the technique to supervised tasks such as incremental class learning.", "full_text": "Continual Unsupervised Representation Learning\n\nDushyant Rao, Francesco Visin, Andrei A. Rusu,\nYee Whye Teh, Razvan Pascanu, Raia Hadsell\u2217\n\nDeepMind\nLondon, UK\n\nAbstract\n\nContinual learning aims to improve the ability of modern learning systems to\ndeal with non-stationary distributions, typically by attempting to learn a series\nof tasks sequentially. Prior art in the \ufb01eld has largely considered supervised or\nreinforcement learning tasks, and often assumes full knowledge of task labels and\nboundaries. In this work, we propose an approach (CURL) to tackle a more general\nproblem that we will refer to as unsupervised continual learning. The focus is on\nlearning representations without any knowledge about task identity, and we explore\nscenarios when there are abrupt changes between tasks, smooth transitions from\none task to another, or even when the data is shuf\ufb02ed. The proposed approach\nperforms task inference directly within the model, is able to dynamically expand to\ncapture new concepts over its lifetime, and incorporates additional rehearsal-based\ntechniques to deal with catastrophic forgetting. We demonstrate the ef\ufb01cacy of\nCURL in an unsupervised learning setting with MNIST and Omniglot, where\nthe lack of labels ensures no information is leaked about the task. Further, we\ndemonstrate strong performance compared to prior art in an i.i.d setting, or when\nadapting the technique to supervised tasks such as incremental class learning.\n\n1\n\nIntroduction\n\nHumans have the impressive ability to learn many different concepts and perform different tasks\nin a sequential lifelong setting. For example, infants learn to interact with objects in their environ-\nment without clear speci\ufb01cation of tasks (task-agnostic), in a sequential fashion without forgetting\n(non-stationary), from temporally correlated visual inputs (non-i.i.d), and with minimal external\nsupervision (unsupervised). For a learning system such as a robot deployed in the real world, it is\nhighly desirable to satisfy these desiderata as well. In contrast, learning algorithms often require\ninput samples to be shuf\ufb02ed in order to satisfy the i.i.d. assumption, and have been shown to\nperform poorly when trained on sequential data, with newer tasks or concepts overwriting older\nones; a phenomenon known as catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow\net al., 2013). As a result, there has been renewed research focus on the continual learning problem\nin recent years (e.g. Kirkpatrick et al., 2017; Nguyen et al., 2017; Zenke et al., 2017; Shin et al.,\n2017), with several approaches addressing catastrophic forgetting as well as backwards or forwards\ntransfer\u2014using the current task to improve performance on past or future tasks. However, most of\nthese techniques have focused on a sequence of tasks in which both the identity of the task (task\nlabel) and boundaries between tasks are provided; moreover, they often focus on the supervised\nlearning setting, where class labels for each data point are given. Thus, many of these methods fail\nto capture some of the aforementioned properties of real-world continual learning, with unknown\ntask labels or poorly de\ufb01ned task boundaries, or when abundant class-labelled data is not available.\nIn this paper, we propose to address the more general unsupervised continual learning setting (also\nsuggested separately by Smith et al. (2019)), in which task labels and boundaries are not provided\n\n\u2217Correspondence to: {dushyantr, visin}@google.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Graphical model for\nCURL. The categorical task\nvariable y is used to instantiate\na latent mixture-of-Gaussians\nz, which is then decoded to x.\n\nFigure 2: Diagram of the proposed approach, showing the\ninference procedure and architectural components used.\n\nto the learner, and hence the focus is on unsupervised task learning. The tasks could correspond to\neither unsupervised representation learning, or learning skills without extrinsic reward if applied to\nthe reinforcement learning domain. In this sense, the problem setting is \u201cunsupervised\u201d in two ways:\nin terms of the absence of task labels (or indeed well-de\ufb01ned tasks themselves), and in terms of the\nabsence of external supervision such as class labels, regression targets, or external rewards. The\ntwo aspects may seem independent, but considering the unsupervised learning problem encourages\nsolutions that aim to capture all fundamental properties of the data, which in turn might encourage,\nor reinforce, particular ways of addressing the task boundary problem. Hence the two aspects are\nconnected through the type of solutions they necessitate, and it is bene\ufb01cial to consider them jointly.\nWe argue that this is an important and challenging open problem, as it enables continual learning\nin environments without clearly de\ufb01ned tasks and goals, and with minimal external supervision.\nRelaxing these constraints is crucial to performing lifelong learning in the real world.\n\nOur approach, named Continual Unsupervised Representation Learning (CURL), learns a task-\nspeci\ufb01c representation on top of a larger set of shared parameters, and deals with task ambiguity by\nperforming task inference within the model. We endow the model with the ability to dynamically\nexpand its capacity to capture new tasks, and suggest methods to minimise catastrophic forgetting.\nThe model is experimentally evaluated in a variety of unsupervised settings: when tasks or classes\nare presented sequentially, when training data are shuf\ufb02ed, and with ambiguous task boundaries\nwhen transitions are continuous rather than discrete. We also demonstrate that despite focusing on\nunsupervised learning, the method can be trivially adapted to supervised learning while removing the\nreliance on task knowledge and class labels. The experiments demonstrate competitive performance\nwith respect to previous work, with the additional ability to learn without supervision in a continual\nlearning setting, and indicate the ef\ufb01cacy of the different components of the proposed method.\n\n2 Model\n\nWe begin by de\ufb01ning the CURL model and training loss, then introduce methods to perform dynamic\nexpansion, and propose a generative replay mechanism to combat forgetting.\n\n2.1\n\nInference over tasks\n\nTo address the problem, we utilise the following generative model (Figure 1):\n\ny \u223c Cat(\u03c0),\n2\nz \u223c N (\u00b5z(y), \u03c3\nz(y)),\nx \u223c Bernoulli(\u00b5x(z)),\n\n(1)\n\nwith the joint probability factorising as p(x, y, z) = p(y)p(z | y)p(x | z). Here, the categorical\nvariable y indicates the current task, which is then used to instantiate the task-speci\ufb01c Gaussian\nparameters for latent variable z, which is then decoded to produce the input x. p(y) is a \ufb01xed\nuniform prior, with component weights speci\ufb01ed by \u03c0. In the representation learning scenario,\ny can be interpreted as representing some discrete clusters in the data, with z then representing\n\n2\n\n\fa mixture of Gaussians which encodes both the inter- and intra-cluster variation. Posterior infer-\nence of p(y, z | x) in this model is intractable, so we employ an approximate variational posterior\nq(y, z | x) = q(y | x)q(z | x, y).\n\nEach of these components is parameterised by a neural network: the input is encoded to a shared\nrepresentation, the mixture probabilities q(y | x) are determined by an output softmax \u201ctask inference\u201d\nhead, and the Gaussian parameters for q(z | x, y = k) are produced by the output of a component-\nspeci\ufb01c latent encoding head (one for each component k). The component-speci\ufb01c prior parameters\n\u00b5z(y) and \u03c3z(y) are parameterised as a linear layer (followed by a softplus nonlinearity for the\nlatter) using a one-hot representation of y as the input. Finally, the decoder is a single network that\nmaps from the mixture-of-Gaussians latent space z to the reconstruction \u02c6x. The architecture is shown\n(k)}. The\nin Figure 2, where for simplicity, we denote the parameters of the kth Gaussian by {\u00b5\nloss for this model is the evidence lower bound (ELBO) given by:\n\n(k), \u03c3\n\nlog p(x) \u2265 L = E\n= E\n\nq(y,z | x) [log p(x, y, z) \u2212 log q(y, z | x)]\nq(y | x)q(z | x,y) [log p(x | z)] \u2212 E\n\nq(y | x) [KL(q(z | x, y) || p(z | y))]\n\n(2)\n\n\u2212KL(q(y | x) || p(y))\n\nThe expectation over q(y | x) can be computed exactly by marginalising over the K categorical\noptions, but the expectation over q(z | x, y) is intractable, and requires sampling. The resulting Monte\nCarlo approximation comprises a set of familiar terms, some of which correspond clearly to the\nsingle-component VAE (Kingma & Welling, 2013; Rezende et al., 2014):\n\nL \u2248\n\nKX\n\nk=1\n\nz\n\n}|\n\n{\n\ncomponent posterior\n\nq(y = k | x)\n\n\uf8ee\n\uf8ef\uf8f0\n\ncomponent-wise reconstruction loss\n\n}|\n\n{\nz\nlog p(x |ez(k))\n\n\u2212\n\nz\n\n}|\n\ncomponent-wise regulariser\n\nKL(q(z | x, y = k) || p(z | y = k))\n\n\uf8f9\n{\n\uf8fa\uf8fb\n\n(3)\n\n\u2212 KL(q(y | x) || p(y))\n\n|\n\nCategorical regulariser\n\n{z\n\n}\n\nwhere ez(k) \u223c q(z | x, y = k) is sampled using the reparametrisation trick. Of course, this can\n\nbe generalised to multiple samples in a similar fashion to the Importance-Weighted Autoencoder\n(IWAE) (Burda et al., 2015).\n\nIntuitively, this loss encourages the model to reconstruct the data and perform clustering where\npossible. For a given data point, the model can choose to have high entropy over q(y | x), in which\ncase all of the component-wise losses must be low, or assign high q(y = k | x) for some k, and use\nthat component to model the datum well. By exploiting diversity in the input data, the model can\nlearn to utilise different components for different discrete structures (such as classes) in the data.\n\n2.2 Component-constrained learning\n\nWhile our main aim is to operate in an unsupervised setting, there may be cases in which one may\nwish to train a speci\ufb01c component, or when labels can be generated in a self-supervised fashion. In\nsuch cases where labels yobs are available, we can use a supervised loss, adapted from Eqn. 3:\n\nLsup = log p(x |ez(yobs), y = yobs) \u2212 KL(q(z | x, y = yobs) || p(z | y = yobs))\n\n+ log q(y = yobs | x).\n\n(4)\n\nHere, instead of marginalising over y as in Equation 3, the component-wise ELBO (the \ufb01rst two\nterms) is computed only for the known label yobs. Furthermore, the \ufb01nal term in the original ELBO is\nreplaced with a supervised cross-entropy term encouraging q(y | x) to match the label, which reduces\nto the log posterior probability of the observed label. This loss will be utilised and further discussed\nin Sections 2.3 and 2.4.\n\n2.3 Dynamic expansion\n\nTo determine the number of mixture components, we opt for a dynamic expansion approach in\nwhich capacity is added as needed, by maintaining a small set of poorly-modelled samples and\nthen initialising and \ufb01tting a new component to this set when it reaches a critical size. In a similar\n\n3\n\n\ffashion to existing techniques such as the Forget-Me-Not process (Milan et al., 2016) and Dirichlet\nprocess (Teh, 2010), we rely on a threshold to determine when to instantiate a new component.\nMore concretely, we denote a subset of parameters \u03b8(k) = {\u03b8(k)\npz } corresponding to the\nparameters unique to each component k (i.e. the kth softmax output in q(y | x) and the kth Gaussian\ncomponent in p(z | y) and q(z | y, x)). During training, any sample with a log-likelihood less than a\nthreshold cnew is added to set Dnew (where the log-likelihood is approximated by the ELBO). Then,\nwhen the set Dnew reaches size Nnew, we initialise the parameters of the new component to the\ncurrent component k\u2217 that has greatest probability over Dnew:\n\nqz , \u03b8(k)\n\nqy , \u03b8(k)\n\n\u03b8(K+1) = \u03b8(k\u2217),\n\nk\u2217 = arg max\n\nk \u2208{1,2,...,K} X\n\nx\u2208Dnew\n\nq(y = k | x).\n\n(5)\n\nThe new component is then tuned to Dnew, by performing a small \ufb01xed number of iterations of\ngradient descent on all parameters \u03b8, using the component-constrained ELBO (Eqn. 4) with label\nK + 1.\n\nIntuitively, this process encourages forward transfer, by initialising new concepts to the \u201cclosest\u201d\nexisting concept learned by the model and then \ufb01netuning to a small number of instances. The\nadditional capacity used for each expansion is only in the top-most layer of the encoder, with \u223c 104\nparameters, compared to \u223c 2.5 \u00d7 106 for the rest of the shared model. That is, while dynamic\nexpansion incorporates a new high-level concept, the underlying low-level representations in the\nencoder, and the entire decoder, are both shared among all tasks.\n\n2.4 Combatting forgetting via mixture generative replay\n\nA shared low-level representation can mean that learning new tasks interferes with previous ones,\nleading to forgetting. One relevant technique to address this is Deep Generative Replay (DGR) (Shin\net al., 2017), in which samples from a learned generative model are reused in learning. We propose to\nadapt and extend DGR to the mixture setting to perform unsupervised learning without forgetting.\nIn contrast to the original DGR work, our approach is inherently generative, such that a generative\nreplay-based approach can be incorporated holistically into the framework at minimal cost. We note\nthat many other existing methods (e.g., Kirkpatrick et al. (2017)) could straightforwardly be adapted\nto our approach, but our experiments demonstrated generative replay to be simple and effective.\n\nTo be more precise, during training, the model alternates between batches of real data, with samples\nxdata \u223c D drawn from the current training distribution, and generated data, with samples xgen\nproduced by the previous snapshot of the model (with parameters \u03b8prev):\n\nygen \u223c \u03c0(y), zgen \u223c p\u03b8prev (z | ygen), xgen \u223c p\u03b8prev (x | zgen),\n\n(6)\n\nwhere \u03c0 represents a choice of prior distribution for the categorical y. While the uniform prior p(y)\nis a natural choice, this fails to consider the degree to which different components are used, and can\ntherefore result in poor sample quality. To address this, the model maintains a count over components\nby accumulating the mean of posterior q(y | x) over all previous timesteps, thereby favouring the\ncomponents that have been used the most. We refer to this process as mixture generative replay\n(MGR).\n\nWhile MGR ensures tasks or concepts that have been previously learned by the model are reused\nfor learning, it places no constraint on which components are used to model them. Given that each\ngenerated datum xgen is conditioned on a sampled ygen, we can use ygen as a self-supervised\nlearning signal and encourage mixture components to remain consistent with respect to the model\nsnapshot, by using the component-constrained loss from Eqn. 4.\n\nThe only remaining question is when to update the previous model snapshot \u03b8prev. For this, we\nexplore two cases, with snapshots taken at periodic \ufb01xed intervals, or immediately before performing\ndynamic expansion. The intuition behind the latter is that dynamic expansion is performed when\nthere is a suf\ufb01cient shift in the input distribution, and consolidating previously learned information is\nbene\ufb01cial prior to adding a newly observed concept. This is also advantageous as it eliminates the\nadditional snapshot period hyperparameter.\n\n4\n\n\f3 Related Work\n\nGenerative models A number of related approaches aim to learn a discriminative latent space\nusing generative models. Building on the original VAE (Kingma & Welling, 2013), Nalisnick et al.\n(2016) utilise a latent mixture of Gaussians, aiming to capture class structure in an unsupervised\nfashion, and propose a Bayesian non-parametric prior, further developed in (Nalisnick & Smyth,\n2017). Similarly, Joo et al. (2019) suggest a Dirichlet posterior in latent space to avoid some of the\npreviously observed component-collapsing phenomena. Lastly, Jiang et al. (2017) propose Variational\nDeep Embedding (VaDE) focused on the goal of clustering in an i.i.d setting. While VaDE has the\nsame generative process as CURL, it assumes a mean-\ufb01eld approximation, with y and z conditionally\nindependent given the input. In the case of CURL, conditioning z on y ensures we can adequately\ncapture the inter- and intra- class uncertainty of a sample within the same structured latent space z.\n\nContinual learning A large body of work has addressed the continual learning problem (Parisi\net al., 2019). Regularisation-based methods minimise changes to parameters that are crucial for earlier\ntasks, with some parameter-wise weight to measure importance (Kirkpatrick et al., 2017; Nguyen\net al., 2017; Zenke et al., 2017; Aljundi et al., 2018; Schwarz et al., 2018). Related techniques\nseek to ensure the performance on previous data does not decrease, by employing constrained\noptimisation (Lopez-Paz et al., 2017; Chaudhry et al., 2018) or distilling the information from old\nmodels or tasks (Li & Hoiem, 2018). In a similar vein, other methods encourage new tasks to\nutilise previously unused parameters, either by \ufb01nding \u201cfree\u201d linear parameter subspaces (He &\nJaeger, 2018); learning an attention mask over parameters (Serra et al., 2018); or using an agent\nto \ufb01nd new activation paths through a network (Fernando et al., 2017). Expansion-based models\ndynamically increase capacity to allow for additional tasks (Rusu et al., 2016; Yoon et al., 2017;\nDraelos et al., 2017), and optionally prune the network to constrain capacity (Zhou et al., 2012;\nGolkar et al., 2019). Another popular approach is that of rehearsal-based methods (Robins, 1995),\nwhere the data distribution from earlier tasks is captured by samples from a generative model trained\nconcurrently (Shin et al., 2017; van de Ven & Tolias, 2018; Ostapenko et al., 2018). Farquhar & Gal\n(2018) combine such methods with regularisation-based approaches under a Bayesian interpretation.\nAlternatively, Rebuf\ufb01 et al. (2017) learn class-speci\ufb01c exemplars instead of a generative model.\nHowever, these methods usually require task identities, rely on well-de\ufb01ned task boundaries, and are\noften evaluated on a sequence of supervised learning tasks.\n\nTask-agnostic continual learning Some recent work has investigated continual learning without\ntask labels or boundaries. Hsu et al. (2018) and van de Ven & Tolias (2019) identify the scenarios of\nincremental task, domain, and class learning; which operate without task labels in the latter cases,\nbut all focus on supervised learning tasks. Aljundi et al. (2019) propose a task-free approach to\ncontinual learning related to ours, which mitigates forgetting using the regularisation-based Memory\nAware Synapses (MAS) approach (Aljundi et al., 2018), maintains a hard example buffer to better\nestimate the regularisation weights, and detects when to update these weights (usually performed\nat known task boundaries in previous work). Zeno et al. (2018) propose a Bayesian task-agnostic\nlearning update rule for the mean and variance of each parameter, and demonstrate its ability to\nhandle ambiguous task boundaries. However, it is only applied to supervised tasks, and can exploit\nthe \u201clabel\u201d trick, inferring the task based on the class label. In contrast, Achille et al. (2018) address\nthe problem of unsupervised learning in a sequential setting by learning a disentangled latent space\nwith task-speci\ufb01c attention masks, but the main focus is on learning across datasets, and the method\nrelies on abrupt shifts in data distribution between datasets. Our approach builds upon this existing\nbody of work, addressing the full unsupervised continual learning problem, where task labels and\nboundaries are unknown, and the tasks themselves are without class supervision. We argue that\naddressing this problem is critical in order to tackle continual learning in challenging, real-world\nscenarios.\n\n4 Experiments\n\nIn the following sections, we empirically evaluate a) whether our method learns a meaningful\nclass-discriminable latent space in the unsupervised sequential learning setting, without forgetting,\neven when task boundaries are unclear; b) the importance of the dynamic expansion and generative\nreplay techniques to performance; and c) how CURL performs on external benchmarks when\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: a) Cluster accuracy for CURL variants on MNIST, measuring the contribution of mixture\ngenerative replay (\u201cMGR\u201d) and dynamic expansion (\u201cexp\u201d); b) Accuracy per class, over time; c)\nClass confusion matrix at the end of learning, for CURL w/ MGR & exp.\n\ntrained i.i.d or adapted to learn in a supervised fashion. Code for all experiments can be found at\nhttps://github.com/deepmind/deepmind-research/.\n\n4.1 Evaluation settings and datasets\n\nOne desired outcome of our approach is the ability to learn class-discriminative latent representations\nfrom non-stationary input data. We evaluate this using cluster accuracy (the accuracy obtained when\nassigning each mixture component to its most represented class), and with the accuracy of a k-Nearest\nNeighbours (k-NN) classi\ufb01er in latent space. The former measures the amount of class-relevant\ninformation encoded into the categorical variable y, while the latter measures the discriminability of\nthe entire latent space without imposing structure (such as a linear boundary).\n\nFor the evaluation we extensively utilise the MNIST (LeCun et al., 2010) and Omniglot (Lake et al.,\n2011) datasets, and further information can be found in Appendix B. We investigate a number of\ndifferent evaluation settings: i.i.d, where the model sees shuf\ufb02ed training data; sequential, where the\nmodel sees classes sequentially; and continuous drift, similar to the sequential case, but with classes\ngradually introduced by slowly increasing the number of samples from the new class within a batch.\n\n4.2 Continual class-discriminative representation learning\n\nWe begin by analysing our approach, and follow this with evaluation on external benchmarks in later\nsections. First, we measure the ability to perform class-discriminative representation learning in the\nsequential setting on MNIST, where each of the classes is observed for 10000 training steps (further\nexperimental details can be found in Appendix C.1). Figure 4a shows the cluster accuracy for a\nnumber of variants of CURL. We observe the importance of both dynamic expansion and mixture\ngenerative replay (MGR) to learn a coherent representation without forgetting. Figure 4b shows the\nclass-wise accuracies during training, for the model with MGR and expansion. Interestingly, while\nmany existing continual learning approaches appear to forget earlier classes (see e.g. Nguyen et al.\n(2017)), these classes are well modelled by CURL, and the confusion is more observed between\nsimilar classes (such as 3s and 5s; or 7s and 9s). Indeed, this is re\ufb02ected in the class-confusion matrix\nafter training (Figure 4c). This implies the model adequately addresses catastrophic forgetting, but\ncould improve in terms of plasticity, i.e., learning new concepts. Further analysis can be found in\nAppendix A.1, showing generated samples; and Appendix A.2, analysing the dynamic expansion\nbuffers.\n\n4.3 Ablation studies\n\nNext, we perform an ablation study to gauge the impact of the expansion threshold for continual\nlearning, in terms of cluster accuracy and number of components used, as shown in Figure 3. As the\nthreshold value is increased, samples are more frequently stored into the \u201cpoorly-modelled\u201d buffer,\nand the model expands more aggressively throughout learning. Consequently, for sequential learning,\nthe number of components ranges from 12 to 71, the cluster accuracy varies up to a maximum of 84%,\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 4: Ablation study for dynamic expansion on MNIST, showing (a) cluster accuracy; (b) 10-\nNN error; and (c) number of components used; when varying the expansion threshold cexp. For\ncomparison, we also show the performance without expansion (\u201cno exp\u201d), but using the same number\nof components as in the cexp = \u2212200 case.\n\nBenchmark\n\nScenario\n\nMGR (\ufb01xed, T)\n\nMGR (\ufb01xed, 0.1T)\n\nMGR (dyn)\n\nSMGR (\ufb01xed, T)\n\nSMGR (\ufb01xed, 0.1T)\n\nSMGR (dyn)\n\nCURL (no MGR)\n\n# clusters\n\n25.20\u00b12.23\n37.60\u00b12.15\n35.20\u00b12.79\n28.20\u00b10.40\n39.80\u00b16.05\n36.00\u00b12.45\n55.80\u00b11.94\n\nMNIST\nCluster acc (%) \u2191\n\n10-NN error (%) \u2193\n\n# clusters\n\nOmniglot\nCluster acc (%) \u2191\n\n10-NN error (%) \u2193\n\n77.74\u00b11.37\n49.14\u00b13.95\n57.76\u00b11.43\n69.27\u00b11.46\n48.18\u00b11.72\n53.97\u00b13.52\n45.35\u00b11.50\n\n6.29\u00b10.50\n14.95\u00b10.73\n12.08\u00b11.19\n7.50\u00b10.57\n15.48\u00b10.81\n11.72\u00b11.16\n17.46\u00b11.25\n\n101.20\u00b18.45\n131.60\u00b115.74\n127.20\u00b116.67\n105.20\u00b15.56\n137.40\u00b19.75\n152.20\u00b125.02\n189.60\u00b19.75\n\n13.21\u00b10.53\n12.13\u00b11.54\n12.74\u00b10.60\n11.32\u00b10.52\n9.01\u00b12.17\n10.48\u00b11.10\n13.36\u00b11.06\n\n76.34\u00b11.10\n81.21\u00b12.06\n80.56\u00b11.39\n76.62\u00b11.49\n85.73\u00b15.84\n84.44\u00b14.10\n81.91\u00b11.36\n\nTable 1: Ablation study for mixture generative replay (MGR and SMGR), indicating the performance\nand number of components used. All variants perform dynamic expansion\n\nand the k-NN error also marginally decreases over this range. Furthermore, without any dynamic\nexpansion, the result is signi\ufb01cantly poorer at 51% accuracy, and when discovering the same number\nof components with dynamic expansion (25, obtained with an expansion threshold of \u2212200), the\nequivalent performance is at 77%. Thus, the dynamic expansion threshold conveniently provides a\ntuning parameter to perform capacity estimation, trading off cluster accuracy with the memory cost of\nusing additional components in the latent mixture. Interestingly, if we perform the same analysis for\ni.i.d. data (also in Figure 3), we observe a similar trade-off; though the \ufb01nal performance is slightly\npoorer than when starting with an equivalent, \ufb01xed number of mixture components (22).\n\nWe also further analyse mixture generative replay (MGR) with an ablation study in Table 1. We\nevaluate standard and self-supervised MGR (SMGR), and compare between the case where snapshots\nare taken on expansion (i.e., no task information is needed), or at \ufb01xed intervals (either at T , the\nduration of training on each class, or 0.1T , ten times more frequently). Intuitively, the period is\nimportant as it determines how quickly a shifting data distribution is consolidated into the model: if\ntoo short, the generated data will drift with the model, leading to forgetting. The results in Table 1\npoint to a number of interesting observations. First, both MGR and SMGR are sensitive to the\n\ufb01xed snapshot period: the performance is unsurprisingly optimal when snapshots are taken as the\ntraining class changes, but drops signi\ufb01cantly when performed more frequently, and also uses a\ngreater number of clusters in the process. Second, by taking snapshots before dynamic expansion\ninstead, this performance can largely be recovered, and without any knowledge of the task boundaries.\nThird, perhaps surprisingly, SMGR harms performance compared to MGR. This may be due to the\nfact that mixture components already tend to be consistent in latent space throughout learning, and\nSMGR may be reducing plasticity; further analysis can be found in Appendix A.3. Lastly, we can also\nobserve the bene\ufb01ts of MGR, with the MNIST case exhibiting far poorer performance and utilising\nmany more components in the process. Interestingly, the Omniglot case without MGR performs well,\nbut at the cost of signi\ufb01cantly more components: expansion itself is able to partly address catastrophic\nforgetting by effectively oversegmenting the data.\n\n7\n\n\fBenchmark\n\nScenario\n\nSeq. w/ MGR (\ufb01xed)\nSeq. w/ MGR (dyn)\n\nCont. w/ MGR (\ufb01xed)\nCont. w/ MGR (dyn)\n\n# clusters\n\n25.20\u00b12.23\n35.20\u00b12.79\n44.60\u00b12.65\n50.40\u00b11.85\n\nMNIST\nCluster acc (%) \u2191\n\n10-NN error (%) \u2193\n\n# clusters\n\nOmniglot\nCluster acc (%) \u2191\n\n10-NN error (%) \u2193\n\n77.74\u00b11.37\n57.76\u00b11.43\n79.38\u00b14.26\n64.93\u00b12.09\n\n6.29\u00b10.50\n12.08\u00b11.19\n6.56\u00b10.42\n9.88\u00b11.43\n\n101.20\u00b18.45\n127.20\u00b116.67\n111.40\u00b13.77\n129.20\u00b12.14\n\n13.21\u00b10.53\n12.74\u00b10.60\n13.17\u00b10.37\n13.54\u00b10.35\n\n76.34\u00b11.10\n80.56\u00b11.39\n75.80\u00b11.19\n78.78\u00b10.39\n\nTable 2: Performance comparison between the sequential learning setting (with discrete changes in\nclass), versus the continuous drift setting (with class ratios gradually changing).\n\nBenchmark\nEvaluation\n\nSplitMNIST\n\nIncr. Task\n\nIncr. Class\n\nEWC\n\nSI\n\nMAS\nLwF\nGEM\nDGR\niCARL\nCURL\n\n98.64\u00b10.22\n99.09\u00b10.15\n99.22\u00b10.21\n99.60\u00b10.03\n98.42\u00b10.10\n99.50\u00b10.03\n\n-\n\n99.10\u00b10.06\n\n20.01\u00b10.06\n19.99\u00b10.06\n19.52\u00b10.29\n24.17\u00b10.33\n92.20\u00b10.12\n91.24\u00b10.33\n94.57\u00b10.11\n92.59\u00b10.66\n\nFigure 5: Mixture probabilities of the 5 components\nused most throughout training, with discrete class\nchanges (left), and with continuous class drift (right).\n\nTable 3: Supervised learning benchmark\non splitMNIST, for incremental task and\nincremental class learning. 2\n\n4.4 Learning with poorly-de\ufb01ned task boundaries\n\nNext, we evaluate CURL in the continuous drift setting, and compare to the standard sequential\nsetting. The overall performance on MNIST and Omniglot is shown in Table 2, using MGR with\neither \ufb01xed or dynamic snapshots. We observe that despite having unclear task boundaries, where\nclasses are gradually introduced, the continuous case generally exhibits better performance than the\ncase with well-de\ufb01ned task boundaries. We also closely investigate the mixture component dynamics\nduring learning, by obtaining the top 5 components (most used over the course of learning) and\nplotting their posterior probabilities over time (Figure 5). From the discrete task-change domain\n(left), we observe that probabilities change sharply with the hard task boundaries (every 10000 steps);\nand many mixture components are quite sparsely activated, modelling either a single class, or a\nfew classes. Some of the mixture components also observe \u201cechoes\u201d, where the sharp change to a\nnew class in the data distribution activates the component temporarily before dynamic expansion is\nperformed. In the continuous drift case (right of Figure 5), the mixture probabilities exhibit similar\nbehaviours, but are much smoother in response to the gradually changing data distribution. Further,\nwithout a sharp distributional shift, the \u201cechoes\u201d are not observed.\n\n4.5 External benchmarks\n\nSupervised continual learning While focused on task-agnostic continual learning in unsupervised\nsettings, CURL can also be trivially adapted to supervised tasks simply by training with the supervised\nloss in Eqn. 4. We evaluate on the split MNIST benchmark, where the data are split into \ufb01ve tasks,\neach classifying between two classes, and the model is trained on each task sequentially. If we\nevaluate the overall accuracy after training, this is called incremental class learning; and if we provide\nthe model with the appropriate task label and evaluate the binary classi\ufb01cation accuracy for each task,\nthis is incremental task learning (Hsu et al., 2018; van de Ven & Tolias, 2019). Experimental details\ncan be found in Appendix C.2. The results in Table 3 demonstrate that the proposed unsupervised\napproach can easily and effectively be adapted to supervised tasks, achieving competitive results\nfor both scenarios. While all methods perform quite well on incremental task learning, CURL\nis outperformed only by iCARL (Rebuf\ufb01 et al., 2017) on incremental class learning, which was\nspeci\ufb01cally proposed for this task. Interestingly, the result is also better than DGR, suggesting that by\nholistically incorporating the generative process and classi\ufb01er into the same model, and focusing on\nthe broader unsupervised, task-agnostic perspective, CURL is still effective in the supervised domain.\n\n2Performances of existing approaches are taken from studies by Hsu et al. (2018) and van de Ven & Tolias\n\n(2019), using the better of the two.\n\n8\n\n\fBenchmark\nEvaluation\n\nVAE3\n\nSBVAE3\nDirVAE3\n\nCURL (i.i.d)\n\nVaDE (bigger net)\n\nMNIST (nz = 50)\n\nOmniglot (nz = 100)\n\n3-NN error\n\n5-NN error\n\n10-NN error\n\n3-NN error\n\n5-NN error\n\n10-NN error\n\n27.16\u00b10.48\n10.01\u00b10.52\n5.98\u00b10.06\n4.40\u00b10.34\n\n2.20\n\n20.20\u00b10.93\n9.58\u00b10.47\n5.29\u00b10.06\n4.22\u00b10.28\n\n2.14\n\n14.89\u00b10.40\n9.39\u00b10.54\n5.06\u00b10.06\n4.23\u00b10.30\n\n2.22\n\n92.34\u00b10.25\n86.90\u00b10.82\n76.55\u00b10.23\n78.18\u00b10.47\n\n-\n\n91.21\u00b10.18\n85.10\u00b10.89\n73.81\u00b10.29\n75.41\u00b10.34\n\n-\n\n88.79\u00b10.35\n82.96\u00b10.64\n70.95\u00b10.29\n72.51\u00b10.46\n\n-\n\nCURL w/ MGR (seq)\n\n4.58\u00b10.26\n\n4.35\u00b10.32\n\n4.50\u00b10.34\n\n83.95\u00b10.72\n\n81.56\u00b10.75\n\n78.80\u00b10.74\n\nRaw pixels3\n\n3.00\n\n3.21\n\n3.44\n\n69.94\n\n69.41\n\n70.10\n\nTable 4: Unsupervised learning benchmark comparison with sampled latents. We compare with a\nnumber of approaches trained i.i.d, as well as CURL trained in the sequential setting.\n\nUnsupervised i.i.d learning We also demonstrate the ability of the underlying model to learn in a\nmore traditional setting with the entire dataset shuf\ufb02ed, and compare with existing work in clustering\nand representation learning: the VAE (Kingma & Welling, 2013), DirichletVAE (Joo et al., 2019),\nSBVAE (Nalisnick & Smyth, 2017), and VaDE (Jiang et al., 2017). We utilise the same architecture\nand hyperparameter settings as in Joo et al. (2019) for consistency, with latent spaces of dimension\n50 and 100 for MNIST and Omniglot respectively; and full details of the experimental setup can be\nfound in Appendix C.3. We note that the k-NN error values are much better here than in Section 4.3;\nthis is due to a higher dimensional latent space and hence they cannot be directly compared (see\nAppendix A.4).\n\nThe uppermost group in Table 4 show the results on i.i.d MNIST and Omniglot. The CURL generative\nmodel trained i.i.d (without MGR, and with dynamic expansion) is competitive with the state-of-\nthe-art on MNIST (bettered only by VaDE, which incorporates a larger architecture) and Omniglot\n(bettered only by DirVAE). While not the main focus of this paper, this demonstrates the ability\nof the proposed generative model to learn a structured, discriminable latent space, even in more\nstandard learning settings with shuf\ufb02ed data. Table 4 also shows the performance of CURL trained\nin the sequential setting. We observe that, despite learning from sequential data, these results are\ncompetitive with the state-of-the-art approaches that operate on i.i.d. data.\n\n5 Conclusions\n\nIn this work, we introduced an approach to address the unsupervised continual learning problem,\nin which task labels and boundaries are unknown, and the tasks themselves lack class labels or\nother external supervision. Our approach, named CURL, performs task inference via a mixture-\nof-Gaussians latent space, and uses dynamic expansion and mixture generative replay (MGR)\nto instantiate new concepts and minimise catastrophic forgetting. Experiments on MNIST and\nOmniglot showed that CURL was able to learn meaningful class-discriminative representations\nwithout forgetting in a sequential class setting (even with poorly de\ufb01ned task boundaries). External\nbenchmarks also demonstrated the method to be competitive with respect to previous work when\nadapted to unsupervised learning from i.i.d data, and to supervised incremental class learning. Future\ndirections will investigate additional techniques to alleviate forgetting, and the extension to the\nreinforcement learning domain.\n\nReferences\n\nAchille, Alessandro, Eccles, Tom, Matthey, Loic, Burgess, Chris, Watters, Nicholas, Lerchner,\nAlexander, and Higgins, Irina. Life-long disentangled representation learning with cross-domain\nlatent homologies. In Advances in Neural Information Processing Systems, pp. 9873\u20139883, 2018.\n\nAljundi, Rahaf, Babiloni, Francesca, Elhoseiny, Mohamed, Rohrbach, Marcus, and Tuytelaars,\nTinne. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European\nConference on Computer Vision (ECCV), pp. 139\u2013154, 2018.\n\nAljundi, Rahaf, Tuytelaars, Tinne, et al. Task-free continual learning. Proceedings CVPR 2019, 2019.\n\n3Performance numbers are obtained from Joo et al. (2019), with consistent architectures and hyperparameters.\n\n9\n\n\fBurda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. arXiv\n\npreprint arXiv:1509.00519, 2015.\n\nChaudhry, Arslan, Ranzato, Marc\u2019Aurelio, Rohrbach, Marcus, and Elhoseiny, Mohamed. Ef\ufb01cient\n\nlifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.\n\nDraelos, Timothy J, Miner, Nadine E, Lamb, Christopher C, Cox, Jonathan A, Vineyard, Craig M,\nCarlson, Kristofor D, Severa, William M, James, Conrad D, and Aimone, James B. Neurogenesis\ndeep learning: Extending deep networks to accommodate new classes. In 2017 International Joint\nConference on Neural Networks (IJCNN), pp. 526\u2013533. IEEE, 2017.\n\nFarquhar, Sebastian and Gal, Yarin. Towards robust evaluations of continual learning. arXiv preprint\n\narXiv:1805.09733, 2018.\n\nFernando, Chrisantha, Banarse, Dylan, Blundell, Charles, Zwols, Yori, Ha, David, Rusu, Andrei A,\nPritzel, Alexander, and Wierstra, Daan. Pathnet: Evolution channels gradient descent in super\nneural networks. arXiv preprint arXiv:1701.08734, 2017.\n\nGolkar, Siavash, Kagan, Michael, and Cho, Kyunghyun. Continual learning via neural pruning. arXiv\n\npreprint arXiv:1903.04476, 2019.\n\nGoodfellow, Ian J, Mirza, Mehdi, Xiao, Da, Courville, Aaron, and Bengio, Yoshua. An empir-\nical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint\narXiv:1312.6211, 2013.\n\nHe, Xu and Jaeger, Herbert. Overcoming catastrophic interference using conceptor-aided backpropa-\n\ngation. 2018.\n\nHsu, Yen-Chang, Liu, Yen-Cheng, and Kira, Zsolt. Re-evaluating continual learning scenarios: A\n\ncategorization and case for strong baselines. arXiv preprint arXiv:1810.12488, 2018.\n\nJiang, Zhuxi, Zheng, Yin, Tan, Huachun, Tang, Bangsheng, and Zhou, Hanning. Variational deep\nembedding: an unsupervised and generative approach to clustering. In Proceedings of the 26th\nInternational Joint Conference on Arti\ufb01cial Intelligence, pp. 1965\u20131972. AAAI Press, 2017.\n\nJoo, Weonyoung, Lee, Wonsung, Park, Sungrae, and Moon, Il-Chul. Dirichlet variational autoencoder.\n\narXiv preprint arXiv:1901.02739, 2019.\n\nKingma, Diederik P and Welling, Max. Auto-encoding variational bayes.\n\narXiv preprint\n\narXiv:1312.6114, 2013.\n\nKirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu,\nAndrei A, Milan, Kieran, Quan, John, Ramalho, Tiago, Grabska-Barwinska, Agnieszka, et al.\nOvercoming catastrophic forgetting in neural networks. Proceedings of the national academy of\nsciences, 114(13):3521\u20133526, 2017.\n\nLake, Brenden, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua. One shot learning of\nsimple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society,\nvolume 33, 2011.\n\nLeCun, Yann, Cortes, Corinna, and Burges, CJ. Mnist handwritten digit database. AT&T Labs\n\n[Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010.\n\nLi, Zhizhong and Hoiem, Derek. Learning without forgetting. IEEE transactions on pattern analysis\n\nand machine intelligence, 40(12):2935\u20132947, 2018.\n\nLopez-Paz, David et al. Gradient episodic memory for continual learning. In Advances in Neural\n\nInformation Processing Systems, pp. 6467\u20136476, 2017.\n\nMcCloskey, Michael and Cohen, Neal J. Catastrophic interference in connectionist networks: The\nsequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109\u2013165.\nElsevier, 1989.\n\nMilan, Kieran, Veness, Joel, Kirkpatrick, James, Bowling, Michael, Koop, Anna, and Hassabis,\nDemis. The forget-me-not process. In Advances in Neural Information Processing Systems, pp.\n3702\u20133710, 2016.\n\nNalisnick, Eric and Smyth, Padhraic. Stick-breaking variational autoencoders. In International\n\nConference on Learning Representations (ICLR), 2017.\n\nNalisnick, Eric, Hertel, Lars, and Smyth, Padhraic. Approximate inference for deep latent gaussian\n\nmixtures. In NIPS Workshop on Bayesian Deep Learning, volume 2, 2016.\n\n10\n\n\fNguyen, Cuong V, Li, Yingzhen, Bui, Thang D, and Turner, Richard E. Variational continual learning.\n\narXiv preprint arXiv:1710.10628, 2017.\n\nOstapenko, Oleksiy, Puscas, Mihai, Klein, Tassilo, and Nabi, Moin. Learning to remember: Dynamic\n\ngenerative memory for continual learning. 2018.\n\nParisi, German I, Kemker, Ronald, Part, Jose L, Kanan, Christopher, and Wermter, Stefan. Continual\n\nlifelong learning with neural networks: A review. Neural Networks, 2019.\n\nRebuf\ufb01, Sylvestre-Alvise, Kolesnikov, Alexander, Sperl, Georg, and Lampert, Christoph H. icarl:\nIncremental classi\ufb01er and representation learning. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pp. 2001\u20132010, 2017.\n\nRezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and\n\napproximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\nRobins, Anthony. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):\n\n123\u2013146, 1995.\n\nRusu, Andrei A, Rabinowitz, Neil C, Desjardins, Guillaume, Soyer, Hubert, Kirkpatrick, James,\nKavukcuoglu, Koray, Pascanu, Razvan, and Hadsell, Raia. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\nSchwarz, Jonathan, Czarnecki, Wojciech, Luketina, Jelena, Grabska-Barwinska, Agnieszka, Teh,\nYee Whye, Pascanu, Razvan, and Hadsell, Raia. Progress & compress: A scalable framework for\ncontinual learning. In International Conference on Machine Learning, pp. 4535\u20134544, 2018.\n\nSerra, Joan, Suris, Didac, Miron, Marius, and Karatzoglou, Alexandros. Overcoming catastrophic\nforgetting with hard attention to the task. In International Conference on Machine Learning, pp.\n4555\u20134564, 2018.\n\nShin, Hanul, Lee, Jung Kwon, Kim, Jaehong, and Kim, Jiwon. Continual learning with deep\ngenerative replay. In Advances in Neural Information Processing Systems, pp. 2990\u20132999, 2017.\n\nSmith, James, Baer, Seth, Kira, Zsolt, and Dovrolis, Constantine. Unsupervised continual learning\n\nand self-taught associative memory hierarchies. arXiv preprint arXiv:1904.02021, 2019.\n\nTeh, Yee Whye. Dirichlet process. Encyclopedia of machine learning, pp. 280\u2013287, 2010.\n\nvan de Ven, Gido M and Tolias, Andreas S. Generative replay with feedback connections as a general\n\nstrategy for continual learning. arXiv preprint arXiv:1809.10635, 2018.\n\nvan de Ven, Gido M and Tolias, Andreas S. Three scenarios for continual learning. arXiv preprint\n\narXiv:1904.07734, 2019.\n\nYoon, Jaehong, Yang, Eunho, Lee, Jeongtae, and Hwang, Sung Ju. Lifelong learning with dynamically\n\nexpandable networks. arXiv preprint arXiv:1708.01547, 2017.\n\nZenke, Friedemann, Poole, Ben, and Ganguli, Surya. Continual learning through synaptic intelligence.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987\u2013\n3995. JMLR. org, 2017.\n\nZeno, Chen, Golan, Itay, Hoffer, Elad, and Soudry, Daniel. Task agnostic continual learning using\n\nonline variational bayes. arXiv preprint arXiv:1803.10123, 2018.\n\nZhou, Guanyu, Sohn, Kihyuk, and Lee, Honglak. Online incremental feature learning with denoising\n\nautoencoders. In Arti\ufb01cial intelligence and statistics, pp. 1453\u20131461, 2012.\n\n11\n\n\f", "award": [], "sourceid": 4164, "authors": [{"given_name": "Dushyant", "family_name": "Rao", "institution": "DeepMind"}, {"given_name": "Francesco", "family_name": "Visin", "institution": "DeepMind"}, {"given_name": "Andrei", "family_name": "Rusu", "institution": "DeepMind"}, {"given_name": "Razvan", "family_name": "Pascanu", "institution": "Google DeepMind"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}, {"given_name": "Raia", "family_name": "Hadsell", "institution": "DeepMind"}]}