{"title": "Zero-shot Learning via Simultaneous Generating and Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 46, "page_last": 56, "abstract": "To overcome the absence of training data for unseen classes, conventional zero-shot learning approaches mainly train their model on seen datapoints and leverage the semantic descriptions for both seen and unseen classes.\nBeyond exploiting relations between classes of seen and unseen, we present a deep generative model to provide the model with experience about both seen and unseen classes.\nBased on the variational auto-encoder with class-specific multi-modal prior, the proposed method learns the conditional distribution of seen and unseen classes.\nIn order to circumvent the need for samples of unseen classes, we treat the non-existing data as missing examples.\nThat is, our network aims to find optimal unseen datapoints and model parameters, by iteratively following the generating and learning strategy.\nSince we obtain the conditional generative model for both seen and unseen classes, classification as well as generation can be performed directly without any off-the-shell classifiers.\nIn experimental results, we demonstrate that the proposed generating and learning strategy makes the model achieve the outperforming results compared to that trained only on the seen classes, and also to the several state-of-the-art methods.", "full_text": "Zero-shot Learning via Simultaneous Generating and\n\nLearning\n\nHyeonwoo Yu\n\nBeomhee Lee\n\nAutomation and Systems Research Institute (ASRI)\n\nDept. of Electrical and Computer Engineering\n\nSeoul National University\n\n{bgus2000,bhlee}@snu.ac.kr\n\nAbstract\n\nTo overcome the absence of training data for unseen classes, conventional zero-shot\nlearning approaches mainly train their model on seen datapoints and leverage\nthe semantic descriptions for both seen and unseen classes. Beyond exploiting\nrelations between classes of seen and unseen, we present a deep generative model\nto provide the model with experience about both seen and unseen classes. Based\non the variational auto-encoder with class-speci\ufb01c multi-modal prior, the proposed\nmethod learns the conditional distribution of seen and unseen classes. In order to\ncircumvent the need for samples of unseen classes, we treat the non-existing data\nas missing examples. That is, our network aims to \ufb01nd optimal unseen datapoints\nand model parameters, by iteratively following the generating and learning strategy.\nSince we obtain the conditional generative model for both seen and unseen classes,\nclassi\ufb01cation as well as generation can be performed directly without any off-\nthe-shell classi\ufb01ers. In experimental results, we demonstrate that the proposed\ngenerating and learning strategy makes the model achieve the outperforming results\ncompared to that trained only on the seen classes, and also to the several state-of-\nthe-art methods.\n\n1\n\nIntroduction\n\nThe combination of the large amount of data and deep learning \ufb01nds the usage in various \ufb01elds such\nas machine learning and arti\ufb01cial intelligence. However, deep learning as a non-linear regression\ntool based on statistics mostly suffers from the insuf\ufb01cient or non-existing training data, which is\nthe usual case and should be overcome for autonomous learning systems. The advantage of deep\nlearning, that learns reliable models on plenty of labeled training datapoints, becomes a curse in this\nscenario, since the model loses the generalization aspects with lack of training data. This severely\ninterrupts the scalability to unseen classes of which training samples simply does not exist.\nZero-shot learning (ZSL) is a learning paradigm that proposes an elegant way to ful\ufb01ll this desideratum\nby utilizing semantic descriptions of seen and unseen classes [8, 30]. These descriptions are usually\nassumed to be given as the form of the class embedding vectors or textual descriptions of each\nclass. By assuming that seen and unseen classes share the same class attribute space, transferring the\nknowledge from the seen to unseen can be achieved by training models on seen samples and plugging\nin embedding vectors of unseen classes. Based on this concept, previous works \ufb01nd a relation between\nclass embedding vectors and given datapoints of classes, by learning a projection from feature vectors\nto the class attribute space [16, 22, 19]. Similar works can be conducted by learning a visual-semantic\nmapping using either shallow or deep embeddings, thereby handle the unseen datapoints via an\nindirect manner [35, 36, 17, 21, 3, 34]. These approaches have shown promising results. However,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fintra-class variation is hardly considered which is inevitable to catch the more realistic situations,\nsince the methods assume that each class is represented as a deterministic vector.\nThanks to the advents of deep generative model, which enable us to unravel the data in complex\nstructure, one can overcome the scarce of unseen examples by directly generating samples from\nlearned distribution. With the generated datapoints, ZSL can be viewed as a traditional classi\ufb01ca-\ntion problem. This scenario thus becomes an excellent testbed for evaluating the generalization of\ngenerative models [29], and several approaches are presented to directly generate datapoints for\nunseen classes by exploiting semantic descriptions [18, 27, 29, 28, 15, 37]. Under the assumption\nthat the model which generates the high-quality samples for seen classes is also expected to have the\nsimilar results on unseen classes, these approaches mainly train conditional generative models on\nseen samples and plug the unseen class attribute vectors into their model to generate unseen samples.\nThey subsequently train off-the-shell classi\ufb01er such as SVM or softmax. However, as far as it goes,\nthe proposed models are trained mainly on the seen classes. Obtaining the generative model for\nboth seen and unseen is quite far from their consideration, since scarcity of the unseen samples is\napparently a fundamental problem for ZSL.\nWe therefore propose a training strategy to obtain a generative model which experiences both seen and\nunseen classes. We treat unseen datapoints as missing data, and also variables that can be optimized\nlike model parameters. Optimal model parameter requires the optimal training data, and optimal\nunseen samples can be sampled from the distribution expressed with optimal model parameters. To\nrelieve this chicken-egg problem, we lean to the Expectation-Maximization (EM) method, which\nenables the model to Simultaneously be Generating And Learning (SGAL). That is, while training,\nwe iteratively generate samples from current model and update networks based on that currently\ngenerated samples. For our model, a variational auto-encoder (VAE) [12] with category-speci\ufb01c\nmulti-modal prior is leveraged. Since we aim to have a multi-modal VAE (mmVAE) that covers both\nseen and unseen classes, no additional classi\ufb01er is needed and the encoder can directly serve as a\nclassi\ufb01er. In our case, model uncertainty can be an obstacle while generating samples and training\nmodel, since the model does not see the real unseen datapoints, and estimated samples for training\nare generated from the model. We thus exploit dropout which makes model take into account the\ndistributions of model parameters [9], and neutralize the model uncertainty by activating dropouts\nwhen sampling estimated datapoints during training procedures.\n\n2 Related Work\n\n2.1 Conditional VAE and Category Clustering in Latent Space\n\nIn order to exploit the labeled dataset for generative model, several methods based on VAE are\nintroduced. By modifying the Bayesian graph model of vanilla VAE, [23] and [13] utilizes labels\nof datapoints as the input of both encoder and decoder. Since they mainly focus on conditionally\ngenerating datapoints from trained model especially with the decoder, they assume the prior as\nisotropic Gaussian to simplify the formulation and network structure.\nSeveral methods utilize the latent space with explicitly structured prior distributions. Beyond a\n\ufb01xed Gaussian prior suffering from little variability, [32, 33, 27, 7, 13] use gaussian mixture model\n(GMM) prior, whose multi-modal is set to catch multiple types of categories. Especially [7] proposes\nunsupervised clustering method with this latent prior, by learning multi-modal prior and VAE together.\nTo categorize the training data with conditional generative model, [33] and [28] exploit a category-\nspeci\ufb01c multi-modal prior. With distinct clusters according to the category or instance, they perform\nclassi\ufb01cations using trained encoder as feature extractor. In addition, [33] further uses the model as\nan observation model for data association rather than classi\ufb01er only, and presents the applications for\nprobabilistic semantic SLAM.\n\n2.2 Zero-shot Learning and Generative Model\n\nZSL possesses a challenging setting that the training and test dataset are disjoint in category context,\nthus traditional non-linear regression is hardly applied. Therefore, several indirect methods are\nproposed. [16] handles the problem by solving related sub-problems. [19, 35, 6] exploit the relations\nbetween each class, and express unseen classes as a mixture of proportions of seen classes. [1, 8, 21,\n22, 24, 3] train their model to \ufb01nd a similarity between datapoints and classes.\n\n2\n\n\fIn order to overcome the scarceness of unseen samples directly, conditional generative models in\nvariations are exploited. [18, 15] exploit conditional VAE (CVAE) to generate conditional samples;\n[15] adds several regressor and restrictor, to let the model be more robust when generating unseen\ndatapoints. [28] proposes a method based on VAE, with category-speci\ufb01c prior distribution. Generative\nadversarial network (GAN) is also exploited and shows promising results since sharpness and realism\nof generated samples are high enough [29]. Commonly, these methods based on deep generative\nmodel train their models \ufb01rst and generate enough samples for unseen classes and subsequently train\nadditional classi\ufb01er, rather than training conditional generative models for both seen and unseen\nclasses. We thus present a deep generative model for both seen and unseen, which enables us to use\nthe model as a classi\ufb01er as well as a generator. Our model is single VAE, and end-to-end training is\npossible without training additional off-the-shell classi\ufb01er.\n\n3 Proposed Method\n\ni }N s\n\n3.1 Problem Scenario\nSuppose we have some dataset {X s\u2217,Y s\u2217} of S seen classes; a set of datapoints X s\u2217 = {xs\u2217\ni=1 and\ni=1, which are sampled from the true distribution p (xs|ys).\ntheir corresponding labels Y s\u2217 = {ys\u2217\nN s is the number of sampled datapoints, and ys\u2217 \u2208 Ls = {1, ..., S}. In the ZSL problem, we aim\nto have a model which can classify the datapoints of unseen classes X u\u2217 = {xu\u2217\nj=1 labeled as\nY u\u2217 = {yu\u2217\nj=1, where yu\u2217 \u2208 Lu = {S + 1, ..., S + U}. Clearly, Ls \u2229 Lu = \u2205 and at training\nwe have no corresponding datapoints for unseen classes. Yet in surrogate we have class semantic\nembedding (or class attribute) vectors A\u2217 = {a\u2217\nk=1 for both seen and unseen classes, that describe\nthe corresponding class and further imply the relations between classes. Note that each class has\nk}S\na distinct attribute vector, for example of seen classes As\u2217 = {a\u2217\nk=1, and we can express the\ny = {a\u2217\n}N s\ncorresponding classes of X s\u2217 with attribute vectors as As\u2217\ni=1.\nys\u2217\n\nk}S+U\n\nj }N u\n\nj }N u\n\ni }N s\n\ni\n\n3.2 Category-Speci\ufb01c Multi-Modal Prior and Classi\ufb01cation\n\nIn order to capture the complex distribution, VAE can be a useful tool. Especially with labeled\ndatapoints, CVAE can be utilized which approximates the conditional likelihood p (x|y) with the\nfollowing lower bound [23]:\n\nL (\u03b8, \u03c6; x, y) = \u2212KL (q\u03c6 (z|x, y)||p (z|y)) + E\nz\u223cq\n\n[log p\u03b8 (x|z, y)] .\n\n(1)\n\nHowever, since this model is designed to generate samples having certain desired properties such as\ncategory y, encoder q\u03c6 (z|x, y) and decoder p\u03b8 (x|z, y) need y for both training and testing. Hence,\nfor the classi\ufb01cation task performed with datapoints of which labels are missing, both encoder and\ndecoder are hardly exploited and decoder only takes advantage when generating datapoints in certain\ncondition. Often, to relax the conditional constraint, the latent prior p (z|y) in (1) is assumed as p (z)\nwhich is independent to input variables; exploiting the latent variables for classi\ufb01cation becomes\nanother challenge.\nWe therefore assume that categories represented as class embedding vectors a\u2217 cast a Bayesian dice\nvia latent variable z to generate x . For X s\u2217 and As\u2217\ny , the total marginal likelihood comprises a sum\n\nover that of individual datapoints log p(cid:0)X s\u2217|As\u2217\n\n(cid:17)\n\nL(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n\u0398; xs\u2217, a\u2217\nys\u2217\n\ni\n\n= \u2212KL\n\nq\u03c6 (z|xs\u2217)||p\u03c8\n\n(cid:1) =(cid:80)\n(cid:16)\n(cid:17)\n\ny\n\n(cid:16)\n\n3\n\n(cid:16)\n(cid:17)(cid:17)\n\ni log p\nz|a\u2217\nys\u2217\n\ni\n\ni\n\ni |a\u2217\nxs\u2217\nys\u2217\n+ E\nz\u223cq\n\n, we then have:\n[log p\u03b8 (xs\u2217|z)] ,\n\n(2)\n\nthe prior can be expressed as p (z) = (cid:80)\n\nwhere \u0398 = (\u03b8, \u03c6, \u03c8). In contrast to the traditional VAE, since our purpose is classi\ufb01cation, we\nassume the conditional prior to be a category-speci\ufb01c Gaussian distribution [27, 28, 32, 33]. Then\nwhich is a multi-modal, and\np\u03c8 (z|a) = N (z; \u00b5 (a) , \u03a3 (a)) where (\u00b5 (a) , \u03a3 (a)) = f\u03c8 (a), which is a non-linear function\nimplemented by, namely, prior network. In order to make the conditional prior simple and distinct\naccording to categories, we follow the basic settings of [32, 33]; we simply let \u03a3 (a) = I, and\nadopt the prior regularization loss which promotes each cluster of p\u03c8 (z|a) to be far away from all\nother clusters above the certain distance in latent space. The KL-divergence in (2) encourages the\n\nz|a\u2217\nys\u2217\n\na\u2217\nys\u2217\n\ni p\n\np\u03c8\n\ni\n\ni\n\n(cid:16)\n\n(cid:17)\n\n\fvariational likelihood q\u03c6 to be overlapped on the corresponding conditional prior distinct according\nto the categories, thus encoded features are naturally clustered [28]. Since (2) approximates the true\nconditional likelihood, Maximum Likelihood Estimation (MLE) of optimal label \u02c6y can be formulated\nas the following [32]:\n\np(cid:0)xs\u2217|a\u2217\n\nys\u2217(cid:1) (cid:39) argmax\n\np\u03c8\n\nys\u2217\n\n(cid:0)z = \u00b5 (xs\u2217)|a\u2217\nys\u2217(cid:1) ,\n\n\u02c6y = argmax\n\nys\u2217\n\n(3)\n\nwhere \u00b5 (xs\u2217) is the mean of the approximated variational likelihood q\u03c6 (z|xs\u2217). By simply calculat-\ning Euclidian distances between category-speci\ufb01c multi-modal and the encoded variable \u00b5 (xs\u2217),\nclassi\ufb01cation results can be achieved. In other words, as shown in Fig. 2(a), encoded features and\nconditional priors can be easily utilized for classi\ufb01cation, rather than simply abandon the encoder\nafter training. The optimal parameter \u02c6\u0398 can be obtained by maximizing the lower bound in (2),\nwhich are for the datapoints of seen classes. Note that when training is converged, the conditional\npriors and variational likelihoods of unseen classes can be obtained by plugging in their associated\nclass embedding vectors Au\u2217 = {a\u2217\nk=S+1. In this way, we can perform classi\ufb01cation task for\nboth seen and unseen classes with (3), or generate datapoints for unseen classes by sampling from\n\n(cid:1) dz, and train additional classi\ufb01er similar to [18].\n\np(cid:0)x|au\u2217\n\n(cid:0)z|au\u2217\n\n(cid:1) (cid:39)(cid:82)\n\nk}S+U\n\nz p\u02c6\u03b8 (x|z) p \u02c6\u03c8\n\ny\n\ny\n\n3.3 Generative Model for both Seen and Unseen Classes\nEven the model is trained on seen classes As\u2217, we can try to use the generative model by simply\ninputting the embedding vector of unseen classes Au\u2217. However, the optimal parameters \u02c6\u0398 obtained\nby maximizing (2) are still \ufb01tted to the datapoints of seen classes, and hardly guarantee the exact\nregression results for unseen classes. In other words, the model represented by parameters has in\neffect no experience with unseen classes. To approximate the distribution for both seen and unseen\nclasses, certainly it is necessary to \ufb01nd the optimal parameters taking into account datapoints sampled\nfrom all classes. Since the absence of datapoints X u\u2217 for unseen classes is a fundamental problem\nin ZSL, we therefore treat these missing datapoints as variables that should be optimized as well as\nmodel parameters. Usually, datapoints for training are sampled from a true distribution, and when\ngenerative model successfully approximates the target distribution, we can generate datapoints from\nthe model randomly. Therefore, for the ideal case that the lower bound successfully catches the target\ndistribution for both seen and unseen classes, the optimal parameters \u02c6\u0398 and optimal unseen datapoints\nX u\u2217 should satisfy the following equations simultaneously:\n\nX u\u2217|Au\u2217\n\ny\n\n\u223c p(cid:0)x|au\u2217\n\ny\n\n\u02c6\u0398 (cid:39) argmax\n\n(cid:90)\n(cid:1) =\nL(cid:0)\u0398;X s\u2217, As\u2217\n\nz\n\np\u02c6\u03b8 (x|z) p \u02c6\u03c8\n\n(cid:0)z|au\u2217\n(cid:1) dz\n(cid:1)\n\ny\n\ny ,X u\u2217, Au\u2217\n\ny\n\n\u0398\n\n(5)\nAs in (4), missing datapoints X u can be optimized by sampling from the generative model, which\noptimally approximates the target distribution. This optimal generative model can be obtained with\n(5) by trained on that sampled datapoints X u\u2217 of unseen classes, and existing datapoints of seen\nclasses. Consequently, we can have a generative model which covers both seen and unseen classes by\nobtaining optimal parameters and sampled datapoints that satisfy (4) and (5).\nIn general, however, the optimal solution satisfying this chicken-egg problem is challenging to obtain\nin a closed form. To relax the problem, we can have the approximated solution by iteratively solving\n(4) and (5), namely Simultaneously Generating And Learning (SGAL) strategy. When collecting\ntraining data is possible such as the case of seen class, traditional training scheme for the optimal\nparameter of the model can be expressed as:\n\n(4)\n\nS(cid:88)\n\n1\nN\n\nN(cid:88)\n\n\u02c6\u0398 = argmax\n\n\u0398\n\nk=1\n\nxn\u223cp(x|as\u2217\nk )\n\nlog p (xn|as\u2217\n\nk ; \u0398) .\n\n(6)\n\nHowever, collecting data from the target likelihood of unseen classes p (x|au\u2217) is impossible in this\ncase. Instead, we can lean to the Expectaion-Maximization [4] by approximating the distribution of\nthe auxiliary variable xI which follows the graphical model shown in Fig. 1. In our case, x and xI\nare assumed to be a feature vector and its corresponding image, respectively. Then EM formulation\n\n4\n\n\fFigure 1: Graphical model for the EM formulation. The feature vector x is generated from the class\nattribute vector a, and also generates the corresponding image xI. We assume that generating x is\nonly affected by a, and xI is depend only on x.\n\n(cid:90)\n\n(cid:90)\n\np (x|au\u2217; \u0398)\n\ncan be started with the following:\n\nlog p(cid:0)xI|au\u2217(cid:1) = \u2212\nFor Expectation step, we let q (x) = p(cid:0)x|au\u2217; \u0398old(cid:1) to let KL term go to zero \ufb01rst. Note that \u0398old\nL (\u0398, q) =(cid:80)S+U\n\ndenotes the model parameter obtained in previous step. Substituting q (x) to (7) and maximizing\n\n= KL (q (x)||p (x|au\u2217; \u0398)) + L (\u0398, q; au\u2217) .\n\np (xI|x) p (x|au\u2217; \u0398)\n\nk ) for the Maximization step, we have:\n\ndx \u2212\n\nq (x) log\n\nq (x) log\n\nq (x)\n\nq (x)\n\n(7)\n\ndx\n\nx\n\nx\n\nL (\u0398, q)\n\nargmax\n\n\u0398\n\nk ; \u0398old(cid:1)\n\n= argmax\n\n= argmax\n\n\u0398\n\n\u2212\n\nk=S+1 L (\u0398, q; au\u2217\n(cid:90)\n(cid:88)\np(cid:0)x|au\u2217\n(cid:88)\n\u2212KL(cid:0)p(cid:0)x|au\u2217\n(cid:124)\n(cid:88)\nS+U(cid:88)\n\nE\nx\u223cp(x|au\u2217\nk ;\u0398old)\n\np(cid:0)x|au\u2217\nk ; \u0398old(cid:1) log\nk ; \u0398old(cid:1)||p(cid:0)xI|x(cid:1)(cid:1)\n(cid:123)(cid:122)\n(cid:125)\nN(cid:88)\n\n[log p (x|au\u2217\n\nk ; \u0398)]\n\nconst\n\n+\n\n\u0398\n\n\u0398\n\nx\n\nk\n\nk\n\nk\n\n= argmax\n\n(cid:39) argmax\n\np (xI|x) p (x|au\u2217\n\ndx\n\nk ; \u0398)\nE\nx\u223cp(x|au\u2217\nk ;\u0398old)\n\n[log p (x|au\u2217\n\nk ; \u0398)]\n\n1\nN\n\nlog p (xn|au\u2217\n\n(8)\n\n\u0398\n\nk=S+1\n\nxn\u223cp(x|au\u2217\n\nNote that p(cid:0)xI|x(cid:1) is independent to \u0398, as the relation between xI and x is predetermined by the\nous model p(cid:0)x|au\u2217; \u0398old(cid:1), and maximizing current log-likelihood log p (x|au\u2217; \u0398) which can be\n\npre-trained network such as VGGNet or GoogLeNet; xI does not join the actual training for the\nproposed method.\nCompared to (6), last line of (8) can be seen as series of process that sampling data from previ-\n\nk ;\u0398old)\n\nk ; \u0398) .\n\nachieved by training VAE with (2). In other words, we gradually update parameter \u0398 = (\u03b8, \u03c6, \u03c8),\nwhile simultaneously generate the datapoints X u as training data, from the incomplete distributions\nrepresented by the decoder and prior network of previous step. See Algorithm 1 for a basic approach\nto approximate the generative model for both seen and unseen classes. Overview of the network\nstructure and training process for our model is also displayed in Fig. 2(b). In the actual implementation\nof the proposed method, we initialize the model parameter \u0398 with converged network trained on\nlabeled datapoints for seen classes, in order to ensure convergence and to exploit the seen classes as\nmuch as possible.\nIn (4), we assume that the model parameters are deterministic variables. However, unlike X s\u2217 which\nis sampled from the true distribution, X u is generated from the incomplete model which is still in the\ntraining process. In this case model uncertainty can take the place to disturb the datapoint generation.\nWe thus handle the uncertainty and create datapoints in more general way, by assuming the model\nparameters to be Bayesian random variables. The conditional probability for unseen classes in (4) is\napproximately expressed as the following:\n\np (x|au\u2217) =\n\np (x|z, \u03b8) p (z|au\u2217, \u03c8) p (\u03b8) p (\u03c8) dzd\u03b8d\u03c8\n\n(9)\nwhere \u03b8l \u223c p (\u03b8) and \u03c8l(cid:48) \u223c p (\u03c8). The prior distributions of parameters can be approximated with\nvariational likelihoods, which are represented as Bernoulli distributions implemented with dropouts\n\nl(cid:48)=1\n\nl=1\n\nz\n\np (x|z, \u03b8l) p (z|au\u2217, \u03c8l(cid:48)) dz\n\n(cid:90)\n(cid:39) L(cid:88)\n\n\u03b8,\u03c6,z\n\n(cid:90)\n\nL(cid:48)(cid:88)\n\n5\n\n\fy and Au\u2217\n\nAlgorithm 1 Simultaneously Generating-And-Learning Algorithm\nRequire: X s\u2217, As\u2217\n\n1: \u0398 \u2190 Initialize parameters with \u02c6\u0398 = argmax\u0398 L(cid:0)\u0398;X s\u2217, As\u2217\n(cid:1)\n(cid:1) (cid:39)(cid:82) p\u03b8 (x|z) p\u03c8\ng \u2190 \u2207\u0398L(cid:16)\n\n2: while \u0398 converges do\nX s\u2217\nM , As\u2217\n3:\n}N\ny = {au\u2217\nn=1 \u2190 Randomly choose unseen class vectors from Au\u2217 for N times\n4: Au\u2217\nX uN = {xu\nn}N\nn=1 \u2190 Sample xu\n5:\n\u0398;X s\u2217\n\ny \u2190 Sample M datapoints from X s\u2217, As\u2217\n\nn from p(cid:0)x|au\u2217\n\n(cid:0)z|au\u2217\n\ny as a minibatch\n\n(cid:1) dz\n\nM , As\u2217\n\ny ,X uN , Au\u2217\n\n(cid:17)\n\nyn\n\nyn\n\ny\n\n6:\n7: \u0398 \u2190 Update parameters using gradients g (e.g. Adam [11])\n8: end while\n9: return \u0398\n\nM\n\ny\n\nN\n\nM\n\nyn\n\nN\n\nFigure 2: Overview of the proposed method. (a) Encoder as a classi\ufb01er. Test datapoint is projected into\nlatent space by encoder, where multi-modal prior exists represented by prior network. By calculating\nEuclidian distance between projected datapoint and multi-modal clusters, category is determined.\n(b) P\u03c8, D\u03b8 and E\u03c6 denote the prior network for p\u03c8 (z|a), decoder for p\u03b8 (x|z) and encoder for\nq\u03c8 (z|x) respectively. For training, we iteratively perform two steps. Step 1: Generating datapoints for\n\nunseen classes using current model, p(cid:0)x|au\u2217\n\n(cid:1) dz. Step 2: Learning the\n\n(cid:0)z|au\u2217\n\n(cid:1) =(cid:82)\n\nmodel on both seen (existing training dataset) and unseen (generated dataset) classes using variational\nlower-bound.\n\nz p\u03b8 (x|z) p\u03c8\n\ny\n\ny\n\n[9]. Therefore, by activating dropouts when generating datapoints, parameter samplings expressed\nwith summation in (9) can easily be achieved. In other words, while sampling datapoints of unseen\nclasses using decoder p\u03b8 (x|z) and prior network p\u03c8 (z|au\u2217), model uncertainty can be considered\nby activating dropouts in each network.\n\n4 Experiments\n\n4.1 Datasets and Settings\n\nWe \ufb01rstly use the two benchmark datasets: AwA (Animals with Attributes) [16], which contains\n30,745 images of 40/10(train/test) classes, and CUB (Caltech-UCSD Birds-200-2011) [26], comprised\nof 11,788 images of 150/50(train/test) species. Even though these benchmarks are selected by many\nexisting ZSL approaches before [30], some unseen classes exist in the ImageNet 1K datasets. Since the\nImageNet dataset is exploited to pre-train the various image embedding networks which are used as\nimage-feature extractor for the datasets, these conventional setting breaks the assumption of zero-shot\nsetting. We thus additionally choose 4 datasets [30] following the generalized ZSL (GZSL) setting,\nwhich guarantees that none of the unseen classes appear in the ImageNet benchmark: AwA1, AwA2,\nCUB and SUN. AwA1 and AwA are the same dataset but AwA1 is rearranged to follow the GZSL\nsetting. AwA2 is an extension version of AwA and contains 37,322 images of 40/10(train/test) classes.\nSUN is a scene-image dataset and consists of 14,340 images with 645/72/65(train/test/validation)\n\n6\n\n\fTable 1: Comparision of the zero-shot classi\ufb01cation accuracy (%) on AwA and CUB with conventional\nsetting. F: how the image feature vector is obtained for non neural network approaches. FG for\nGoogLeNet and FV for VGGNet. For deep models, NG for Inception-V2(GoogLeNet with batch-\nnormalization), and NV for VGGNet. SS : semantic space. A: attribute space. W:semantic word\nvector space. mmVAE and SGAL denote our models trained as normal multi-modal VAE with seen\nclasses and trained in generating-and-learning manner, respectively.\n\nMethods\n\nF\n\nSS\n\nAwA\n\nCUB\n\n10-way 0-shot\n\n50-way 0-shot\n\n50.1\n47.2\n30.4\n42.1\n54.5\n43.3\n33.5\n39.6\n32.3\n34.0\n61.4\n58.3\n62.0\n57.4\n58.4\n62.5\n\nSJE[2]\n\nJLSE[36]\n\nSYNC-STRUCT[6]\n\nSEC-ML[5]\nDEVISE[8]\n\nESZSL[21]\n\nSSE-RELU[35]\n\nA\nFG\nA\nFG\nA\nFV\nA\nFV\nA\nFG\nA\nFV\nNG A/W\nSOCHER et al.[22] NG A/W\nNG A/W\nNG A/W\nA\nNG\nNG A/W\nA\nA\nA\nA\n\nRELATIONNET[24] NG\nNV\nNG\nNG\n\nVZSL[28]\nmmVAE\nSGAL\n\nMTMDL[31]\nBA et al.[17]\n\nSAE[14]\nDEM[34]\n\n66.7\n76.3\n76.3\n80.5\n72.9\n77.3\n\n56.7/50.4\n60.8/50.3\n63.7/55.3\n69.3/58.7\n\n84.7\n\n86.7/78.8\n\n84.5\n85.3\n74.2\n84.1\n\nclasses. These datasets under GZSL setting, are more suitable to the realistic zero-shot problems in\npractice.\n\n4.2 Network Structure and Training\n\nSimilar to the previous works [17, 20, 24, 34], we use image embedding networks for ZSL. For the\nconventional setting, Inception-V2 [25] is used and ResNet101 [10] for the GZSL setting. Since the\nproposed method exploits VAE with multi-modal latent prior, our network structure is composed\nof encoder, decoder and prior network as shown in Fig. 2(b). All parts of our model are basically\nconstructed with dense (or fully connected) layers. For computational complexity and memory\nrequirements, network structure and parameters can be a standard to examine the complexity and\nmemory requirements, and we compare ours with other generative-based methods: for ours on AwA2,\n1 hidden layer with 512 units is used for both encoder and decoder. In [15], 2 and 1 with both\n512 units are used for encoder and decoder, respectively. In [18], 2 with 512 and 1 with 1024 are\nused for encoder and decoder respectively. [29] uses 1 with 4096 for generator, and 1 with 1024\nfor discriminator. We will add this evaluation to our paper. Details of the network structures and\nparameter settings can be found in our supplementary.\nBefore applying the proposed SGAL strategy, we \ufb01rst pre-train our model on the seen classes, as\nshown in Algorithm 1. We found that learning diverges when training is proceeded for both seen and\nunseen classes from the beginning. Once the pre-training converges, we perform \ufb01ne-tuning for both\nseen and unseen classes subsequently; iteratively sampling and learning the minibatch by generating\ndatapoints for unseen classes. The number of iterations for the benchmarks are: for mmVAE and\nSGAL(EM), 170,000 and 1,300 for AwA1, 64,000 and 900 for AwA2, 17,000 and 2,000 for CUB1\nand 1,450,000 and 1,500 for SUN1. In order to consider the model uncertainty, we also train the\nmodel adopting (9) when generating unseen datapoints. For one latent variables sampled from prior\nnetwork, a total of 5 samples are generated while activating dropouts in the decoder. Unlike (9),\nin the actual implementation all the dropouts of the prior network are deactivated for the training\nstabilization.\n\n7\n\n\fTable 2: Zero-shot classi\ufb01cation comparison results with GZSL setting. Methods are evaluated using\nTop-1 accuracy (%) on u: unseen classes, s: seen classes. H: Harmonic mean of u and s is also\nreported. mmVAE, SGAL and SGAL-dropout denote our models trained as plane multi-modal VAE\nwith seen classes, trained in generating-and-learning manner for both seen and unseen classes, trained\nwith activated dropouts when generating unseen datapoints, respectively.\n\nAwA1\n\nAwA2\n\nCUB\n\ns\n\nSUN\n\ns\n\nMethods\n\nCONSE[19]\nDEVISE[8]\nESZSL[21]\n\nALE[1]\nSYNC[6]\nSAE[14]\nDEM[34]\n\nRELATION[24]\n\nSRZSL[3]\n\nCVAE-ZSL[18]\nf-CLSWGAN[29]\n\nSE-GZSL[15]\n\nmmVAE\nSGAL\n\nSGAL-dropout\n\ns\n\ns\n\n3.5\n\n1.1\n\n90.6\n\n88.6\n\n72.2\n\nH\n3.1\n\nH\n1.0\n\nH\n0.8\n\nu\n1.6\n\nu\n0.5\n\n75.6 12.1\n\nH\nu\n39.9 11.6\n0.4\n13.4 68.7 22.4 17.1 74.7 27.8 23.8 53.0 32.8 16.9 27.4 20.9\n6.6\n77.8 11.0 12.6 63.8 21.0 11.0 27.9 15.8\n16.8 76.1 27.5 14.0 81.8 23.9 23.7 62.8 34.4 21.8 33.1 26.3\n8.9\n43.3 13.4\n1.8\n18.0 11.8\n32.8 84.7 47.3 30.5 86.4 45.1 19.6 54.0 13.6 20.5 34.3 25.6\n31.4 91.3 46.7 30.0 93.4 45.3 38.1 61.1 47.0\n\n87.3 16.2 10.0 90.5 18.0 11.5 70.9 19.8\n77.1\n57.9 29.2\n\nu\n6.8\n\n7.9\n8.8\n\n82.2\n\n5.9\n\n-\n\n-\n\n-\n\n-\n\n-\n-\n-\n\n-\n-\n-\n\n47.2\n\n20.7 73.8 32.3 24.6 54.3 33.9 20.8 37.2 26.7\n26.7\n57.9 61.4 59.6 43.7 57.7 49.7 42.6 36.6 39.4\n56.3 67.8 61.5 58.3 68.1 62.8 41.5 53.3 46.7 40.9 30.5 34.9\n39.4 86.8 54.2 15.7 92.6 26.9 28.5 63.1 39.3 14.2 43.6 21.4\n52.7 74.0 61.5 52.5 86.3 65.3 40.9 55.3 47.0 35.5 34.4 34.9\n52.7 75.7 62.2 55.1 81.2 65.6 47.1 44.7 45.9 42.9 31.2 36.1\n\n51.2\n\n34.5\n\n-\n\n-\n\n-\n\n-\n\n2.2\n\n7.8\n\n-\n\n-\n\n-\n\nFigure 3: Structure visualization of learned dataset AwA1,2. Each color denotes unseen classes.\nResults of (a) mmVAE on AwA1, (b) SGAL on AwA1, (c) mmVAE on AwA2 and (d) SGAL on\nAwA2. While harmonic mean score is increased from 52.2% to 62.2% on AwA1, there are less drastic\nchanges between (a) and (b). On the other hand, increased from 26.9% to 65.6% on AwA2, clusters\nare more separated from each other in (d) compared to (c).\n\n4.3 Evaluation Results with Conventional and GZSL Settings\n\nTo evaluate the proposed method, we \ufb01rst compare several alternative approaches on the conventional\nsetting, and display the results in Table 1. Note that in most works for ZSL with conventional setting,\nit is assumed that the test data only comes from the unseen classes. Our method obtains competitive\nresult when evaluated on AwA, and state-of-the-art performance on more challenging CUB benchmark\ndataset. We also test our method on GZSL setting under the disjoint assumption as proposed by [30].\nAs a measure of performance for this generalized setting, we obtain classi\ufb01cation accuracy for both\nseen and unseen classes, and report the harmonic mean of the two accuracies. Results are shown in\nTable 2. Our model outperforms than other non-generative methods, and shows competitive results\ncompared to the models based on generative models [28, 18, 29, 15]. Note that other generative-based\nmethods mainly use additional off-the-shell classi\ufb01er, after generating estimated samples of unseen\nclasses with their model. In our case, however, the encoder serves as a classi\ufb01er since the proposed\nmodel covers seen and unseen classes by itself.\n\n8\n\n\f4.4 Effects of Generating And Learning, and Dropout Activation\n\nThe proposed approach is based on the VAE with multi-modal prior trained on seen classes, and\nlearns the unseen classes through SGAL strategy. Additionally, model uncertainty can be handled\nby dropouts while generating the missing datapoints for unseen classes. This series of steps can\nbe applied in order, and we show the evaluation results with each step\u2019s model in the bottom two\nrows in Table 1, and bottom three rows in Table 2: mmVAE indicates the VAE with multi-modal\nprior trained only on the seen classes as in Section 3.2, SGAL is for the model with SGAL strategy,\nand SGAL-dropout denotes the SGAL model activating dropouts in the decoder when generating\nunseen datapoints. In the case of mmVAE, it shows low performance for unseen classes since the\nmodel learns the target distribution of only the seen classes. However, SGAL generates missing\ndatapoints by using class embeddings and the model itself, and the entire model is trained from that\ngenerated datapoints and seen class datapoints iteratively. As SGAL aims to learn the distributions of\nboth seen and unseen classes in this manner, robust classi\ufb01cation performance of unseen classes is\nachieved. One can observe that SGAL shows the decreased performance for the seen classes rather\nthan mmVAE. We believe that the proposed method is a generative model that covers the distribution\nfor all classes, thus the performance trade-off between seen and unseen classes occurs. In order to\nvisualize the effects of the proposed method, several learned datasets are displayed in Fig. 3 using\nT-SNE.\nThe proposed model shows state-of-the-art results in harmonic mean on AwA1 and AwA2 dataset,\nand in classi\ufb01cation accuracy of unseen on CUB and SUN dataset; CUB and SUN datasets contain\nalmost 5 and 12 times more classes than AwAs, and the multi-modal distribution for seen classes is\ndistorted more easily when \ufb01ne-tuning for inserting new clusters for unseen. That is, unseen clusters\ncan be deduced based on plenty of seen clusters thus the model achieves outperformed results for\nunseen, but performance drops more easily for seen classes due to the distortion.\nIn general, the generative model leans to the training dataset sampled from the real-world, but\nin SGAL strategy the model learns the target distribution from the datapoints sampled from the\ndistribution which the model itself represents. Since the generated datapoints \ufb02oat depending on\nthe current model, the model uncertainty can affect the model performance. To relieve the problem,\nSGAL-dropout uses dropout activation when sampling unseen datapoints and shows more robust\nclassi\ufb01cation results compared to that of SGAL\u2019s. That is, by sampling the unseen datapoints while\nreducing the model uncertainty, the model better describes the target distribution of unseen classes.\nIn this case, however, the performance for the seen classes is further reduced by the generalization for\nboth seen and unseen classes, similar to the case between mmVAE and SGAL.\n\n5 Conclusion\n\nWe have introduced a novel strategy for zero-shot learning (ZSL) using VAE with multi-modal\nprior distribution. Absence of the datapoints for unseen classes is the fundamental problem of ZSL,\nwhich makes it challenging to obtain a generative model for both seen and unseen classes. We\ntherefore treat the missing datapoints as variables that should be optimized like model parameters,\nand train our network with Simultaneously Generating-And-Learning strategy similar to EM manner.\nIn other words, while training our model iteratively generate unseen samples and use them as training\ndatapoints to gradually update model parameters. Consequently, our model favorably attain both\nseen and unseen classes understanding. With the encoder and the prior network, classi\ufb01cation can be\nperformed directly without additional classi\ufb01ers. Further, by catching the model uncertainty with\ndropouts, we show that a more robust model for unseen classes is achievable. The proposed method\nhas competitive results with the state-of-the arts on various benchmarks, while outperforming them\nfor several datasets.\n\nAcknowledgments\n\nWe would like to thank Jihoon Moon and Hanjun Kim, who give us intuitive advices. This work\nwas supported by the National Research Foundation of Korea(NRF) grant funded by the Korea\ngovernment(MSIP) (No. 2017R1A2B2002608), in part by Automation and Systems Research Institute\n(ASRI), and in part by the Brain Korea 21 Plus Project.\n\n9\n\n\fReferences\n[1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image\nclassi\ufb01cation. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425\u20131438, 2016.\n\n[2] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embed-\ndings for \ufb01ne-grained image classi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 2927\u20132936, 2015.\n\n[3] Yashas Annadani and Soma Biswas. Preserving semantic relations for zero-shot learning. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 7603\u20137612, 2018.\n\n[4] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.\n\n[5] Maxime Bucher, St\u00e9phane Herbin, and Fr\u00e9d\u00e9ric Jurie. Improving semantic embedding consistency by\nmetric learning for zero-shot classif\ufb01cation. In European Conference on Computer Vision, pages 730\u2013746.\nSpringer, 2016.\n\n[6] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classi\ufb01ers for zero-shot\nlearning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n5327\u20135336, 2016.\n\n[7] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulku-\nmaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders.\narXiv preprint arXiv:1611.02648, 2016.\n\n[8] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A\ndeep visual-semantic embedding model. In Advances in neural information processing systems, pages\n2121\u20132129, 2013.\n\n[9] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty\n\nin deep learning. In international conference on machine learning, pages 1050\u20131059, 2016.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International\nConference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference\nTrack Proceedings, 2015.\n\n[12] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference\non Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track\nProceedings, 2014.\n\n[13] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning\nwith deep generative models. In Advances in neural information processing systems, pages 3581\u20133589,\n2014.\n\n[14] Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3174\u20133183,\n2017.\n\n[15] Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. Generalized zero-shot learning via\nsynthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 4281\u20134289, 2018.\n\n[16] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classi\ufb01cation for zero-\nIEEE Transactions on Pattern Analysis and Machine Intelligence,\n\nshot visual object categorization.\n36(3):453\u2013465, 2014.\n\n[17] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks\nusing textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision,\npages 4247\u20134255, 2015.\n\n[18] Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy. A generative model for zero\nshot learning using conditional variational autoencoders. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition Workshops, pages 2188\u20132196, 2018.\n\n10\n\n\f[19] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome,\nGreg Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In\nInternational Conference on Learning Representations, 2014.\n\n[20] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. Learning deep representations of \ufb01ne-grained\nvisual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 49\u201358, 2016.\n\n[21] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In\n\nInternational Conference on Machine Learning, pages 2152\u20132161, 2015.\n\n[22] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through\n\ncross-modal transfer. In Advances in neural information processing systems, pages 935\u2013943, 2013.\n\n[23] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep\nconditional generative models. In Advances in neural information processing systems, pages 3483\u20133491,\n2015.\n\n[24] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to\ncompare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 1199\u20131208, 2018.\n\n[25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 1\u20139, 2015.\n\n[26] Catherine Wah, Steve Branson, Pietro Perona, and Serge Belongie. Multiclass recognition and part\nIn 2011 International Conference on Computer Vision, pages\n\nlocalization with humans in the loop.\n2524\u20132531. IEEE, 2011.\n\n[27] Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. Diverse and accurate image description using a\nvariational auto-encoder with an additive gaussian encoding space. In Advances in Neural Information\nProcessing Systems, pages 5756\u20135766, 2017.\n\n[28] Wenlin Wang, Yunchen Pu, Vinay Kumar Verma, Kai Fan, Yizhe Zhang, Changyou Chen, Piyush Rai, and\nLawrence Carin. Zero-shot learning via class-conditioned deep generative models. In Thirty-Second AAAI\nConference on Arti\ufb01cial Intelligence, 2018.\n\n[29] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot\nlearning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n5542\u20135551, 2018.\n\n[30] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4582\u20134591,\n2017.\n\n[31] Yongxin Yang and Timothy M Hospedales. A uni\ufb01ed perspective on multi-domain and multi-task learning.\n\narXiv preprint arXiv:1412.7489, 2014.\n\n[32] Hyeonwoo Yu and Beomhee Lee. A variational feature encoding method of 3d object for probabilistic\nsemantic slam. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),\npages 3605\u20133612. IEEE, 2018.\n\n[33] Hyeonwoo Yu and Beomhee Lee. A variational observation model of 3d object for probabilistic semantic\nslam. In 2019 IEEE International Conference on Robotics and Automation (ICRA), pages 5866\u20135872.\nIEEE, 2019.\n\n[34] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot learning.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2021\u20132030,\n2017.\n\n[35] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity embedding.\n\nProceedings of the IEEE international conference on computer vision, pages 4166\u20134174, 2015.\n\nIn\n\n[36] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via joint latent similarity embedding. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6034\u20136042,\n2016.\n\n[37] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial\napproach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 1004\u20131013, 2018.\n\n11\n\n\f", "award": [], "sourceid": 36, "authors": [{"given_name": "Hyeonwoo", "family_name": "Yu", "institution": "Seoul National University"}, {"given_name": "Beomhee", "family_name": "Lee", "institution": "Seoul National University"}]}