{"title": "Meta-Reinforced Synthetic Data for One-Shot Fine-Grained Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 3063, "page_last": 3072, "abstract": "This paper studies the task of one-shot fine-grained recognition, which suffers from the problem of data scarcity of novel fine-grained classes. To alleviate this problem, a off-the-shelf image generator can be applied to synthesize additional images to help one-shot learning. However, such synthesized images may not be helpful in one-shot fine-grained recognition, due to a large domain discrepancy between synthesized and original images. To this end, this paper proposes a meta-learning framework to reinforce the generated images by original images so that these images can facilitate one-shot learning. Specifically, the generic image generator is updated by few training instances of novel classes; and a Meta Image Reinforcing Network (MetaIRNet) is proposed to conduct one-shot fine-grained recognition as well as image reinforcement. The model is trained in an end-to-end manner, and our experiments demonstrate consistent improvement over baseline on one-shot fine-grained image classification benchmarks.", "full_text": "Meta-Reinforced Synthetic Data for One-Shot\n\nFine-Grained Visual Recognition\n\nSatoshi Tsutsui\nIndiana University\n\nUSA\n\nYanwei Fu\u21e4\n\nFudan University\n\nChina\n\nstsutsui@indiana.edu\n\nyanweifu@fudan.edu.cn\n\ndjcran@indiana.edu\n\nDavid Crandall\nIndiana University\n\nUSA\n\nAbstract\n\nOne-shot \ufb01ne-grained visual recognition often suffers from the problem of training\ndata scarcity for new \ufb01ne-grained classes. To alleviate this problem, an off-the-shelf\nimage generator can be applied to synthesize additional training images, but these\nsynthesized images are often not helpful for actually improving the accuracy of\none-shot \ufb01ne-grained recognition. This paper proposes a meta-learning framework\nto combine generated images with original images, so that the resulting \u201chybrid\u201d\ntraining images can improve one-shot learning. Speci\ufb01cally, the generic image\ngenerator is updated by a few training instances of novel classes, and a Meta Image\nReinforcing Network (MetaIRNet) is proposed to conduct one-shot \ufb01ne-grained\nrecognition as well as image reinforcement. The model is trained in an end-to-end\nmanner, and our experiments demonstrate consistent improvement over baselines\non one-shot \ufb01ne-grained image classi\ufb01cation benchmarks.\n\n1\n\nIntroduction\n\nThe availability of vast labeled datasets has been crucial for the recent success of deep learning.\nHowever, there will always be learning tasks for which labeled data is sparse. Fine-grained visual\nrecognition is one typical example: when images are to be classi\ufb01ed into many very speci\ufb01c categories\n(such as species of birds), it may be dif\ufb01cult to obtain training examples for rare classes, and producing\nthe ground truth labels may require signi\ufb01cant expertise (e.g., ornithologists). One-shot learning is\nthus very desirable for \ufb01ne-grained visual recognition.\nA recent approach to address data scarcity is meta-learning [7,10,24,35], which trains a parameterized\nfunction called a meta-learner that maps labeled training sets to classi\ufb01ers. The meta-learner is trained\nby sampling small training and test sets from a large dataset of a base class. Such a meta-learned\nmodel can be adapted to recognize novel categories with a single training instance per class. Another\nway to address data scarcity is to synthesize additional training examples, for example by using\noff-the-shelf Generative Adversarial Networks (GANs) [3, 13]. However, classi\ufb01ers trained from\nGAN-generated images are typically inferior to those trained with real images, possibly because the\ndistribution of generated images may be biased towards frequent patterns (modes) of the original\nimage distribution [26]. This is especially true in one-shot \ufb01ne-grained recognition where a tiny\ndifference (e.g., beak of a bird) can make a large difference in class.\nTo address these issues, we develop an approach to apply off-the-shelf generative models to synthesize\ntraining data in a way that improves one-shot \ufb01ne-grained classi\ufb01ers. We begin by conducting a\npilot study to transfer a generator pre-trained on ImageNet in a one-shot scenario. We show that\nthe generated images can indeed improve the performance of a one-shot classi\ufb01er when used with\n\n\u21e4Y. Fu is with School of Data Science, and MOE Frontiers Center for Brain Science, Shanghai Key Lab of\n\nIntelligent Information Processing, Fudan University.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fa carefully-designed rule to combine the generated images with the originals. Based on these\npreliminary results, we propose a meta-learning approach to learn these rules to reinforce the\ngenerated images effectively for few-shot classi\ufb01cation.\nOur approach has two steps. First, an off-the-shelf generator trained from ImageNet is updated\ntowards the domain of novel classes by using only a single image (Sec. 4.1). Second, since previous\nwork and our pilot study (Sec. 3) suggest that simply adding synthesized images to the training data\nmay not improve one-shot learning, the synthesized images are \u201cmixed\u201d with the original images in\norder to bridge the domain gap between the two (Sec. 4.2). The effective mixing strategy is learned\nby a meta-learner, which essentially boosts the performance of \ufb01ne-grained categorization with a\nsingle training instance per class. Lastly, we experimentally validate that our approach can achieve\nimproved performance over baselines on \ufb01ne-grained classi\ufb01cation datasets in one-shot situations\n(Sec. 5).\nTo summarize, the contributions of this paper are: (1) a method to transfer a pre-trained generator\nwith a single image, (2) a method to learn to complement real images with synthetic images in a way\nthat bene\ufb01ts one-shot classi\ufb01ers, and (3) to experimentally demonstrate that these methods improve\none-shot classi\ufb01cation accuracy on \ufb01ne-grained visual recognition benchmarks.\n\n2 Related Work\n\nImage Generation. Learning to generate realistic images has many potential applications, but is\nchallenging with typical supervised learning. Supervised learning minimizes a loss function between\nthe predicted output and the desired output but, for image generation, it is not easy to design such\na perceptually-meaningful loss between images. Generative Adversarial Networks (GANs) [13]\naddress this issue by learning not only a generator but also a loss function \u2014 the discriminator \u2014 that\nhelps the generator to synthesize images indistinguishable from real ones. This adversarial learning\nis intuitive but is known to often be unstable [14] in practice. Recent progress includes better CNN\narchitectures [3, 21], training stabilization tips [2, 14, 19], and interesting applications (e.g. [38]).\nIn particular, BigGAN [3] trained on ImageNet has shown visually impressive generated images\nwith stable performance on generic image generation tasks. Several studies [20, 33] have explored\ngenerating images from few examples, but their focus has not been on one shot classi\ufb01cation. Several\npapers [8, 9, 20] use the idea of adjusting batch normalization layers, which helped inspire our work.\nFinally, some work has investigated using GANs to help image classi\ufb01cation [1, 12, 26, 27, 37]; our\nwork differs in that we apply an off-the-shelf generator pre-trained from a large and generic dataset.\nFew-shot Meta-learning. Few shot classi\ufb01cation [4] is a sub-\ufb01eld of meta-learning (or \u201clearning-to-\nlearn\u201d) problems, in which the task is to train a classi\ufb01er with only a few examples per class. Unlike\nthe typical classi\ufb01cation setup, in few-shot classi\ufb01cation the labels in the training and test sets have\nno overlapping categories. Moreover, the model is trained and evaluated by sampling many few-shot\ntasks (or episodes). For example, when training a dog breed classi\ufb01er, an episode might train to\nrecognize \ufb01ve dog species with only a single training image per class \u2014 a 5-way-1-shot setting. A\nmeta-learning method trains a meta-model by sampling many episodes from training classes and is\nevaluated by sampling many episodes from other unseen classes. With this episodic training, we can\nchoose several possible approaches to learn to learn. For example, \u201clearning to compare\u201d methods\nlearn a metric space (e.g., [28,29,31]), while other approaches learn to \ufb01ne-tune (e.g., [10,11,22,23])\nor learn to augment data (e.g., [6, 12, 15, 25, 34]). An advantage of the latter type is that, since it\nis data augmentation, we can use it in combination with any other approaches. Our approach also\nexplores data augmentation by mixing the original images with synthesized images produced by a\n\ufb01ne-tuned generator, but we \ufb01nd that the naive approach of simply adding GAN generated images to\nthe training dataset does not improve performance. But by carefully combining generated images with\nthe original images, we \ufb01nd that we can effectively synthesize examples that contribute to increasing\nthe performance. Thus meta-learning is employed to learn the proper combination strategy.\n\n3 Pilot Study\n\nTo explain how we arrived at our approach, we describe some initial experimentation which motivated\nthe development of our methods.\n\n2\n\n\fTable 1: CUB 5-way-1-shot classi\ufb01cation accuracy (%) using ImageNet features. Simply adding\ngenerated images to the training set does not help, but adding hybrid images, as in Fig. 1 (h), can.\n\nTraining Data\nOriginal\nOriginal + Generated\nOriginal + Mixed\n\nNearest Neighbor Logistic Regression\n\nSoftmax Regression\n\n69.6\n70.1\n70.6\n\n75.0\n74.6\n75.5\n\n74.1\n73.8\n74.8\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\nFigure 1: Samples described in Sec. 3. (a) Original image. (b) Result of tuning noise only. (c) Result\nof tuning the whole network. (d) Result of tuning batch norm only. (e) Result of tuning batch norm\nwith perceptual loss. (f) Result of slightly disturbing noise from (e). (g) a 3 \u21e5 3 block weight matrix\nw. (g) Result of mixing (a) and (f) as w\u21e5(f) + (1  w)\u21e5(a).\n\nHow can we transfer generative knowledge from pre-trained GANs? We aim to quickly gener-\nate training images for few-shot classi\ufb01cation. Performing adversarial learning (i.e. training generator\nand discriminator initializing with pre-trained weights) is not practical when we only have one or two\nexamples per class. Instead, we want to develop a method that does not depend on the number of\nimages at all; in fact, we consider the extreme case where only a single image is available, and want\nto generate variants of the image using a pre-trained GAN. We tried \ufb01xing the generator weights and\noptimizing the noise so that it generates the target image, under the assumption that sightly modifying\nthe optimized noise would produce a variant of the original. However, naively implementing this idea\nwith BigGAN did not reconstruct the image well, as shown in the sample in Fig. 1(b). We then tried\n\ufb01ne-tuning the generator weights also, but this produced even worse images stuck in a local minima,\nas shown in Fig. 1(c).\nWe speculate that the best approach may be somewhere between the two extremes of tuning noise\nonly and tuning both noise and weights. Inspired by previous work [8, 9, 20], we propose to \ufb01ne-tune\nonly scale and shift parameters in the batch normalization layers. This strategy produces better\nimages as shown in Fig. 1(d). Finally, again inspired by previous work [20], we not only minimize\nthe pixel-level distance but also the distance of a pre-trained CNN representation (i.e. perceptual\nloss [17]), and we show the slightly improved results in Fig. 1(e). We can also generate slightly\ndifferent versions by adding random perturbations to the tuned noise (e.g., the \u201cfattened\u201d version of\nthe same bird in Fig. 1(f)). The entire training process needs fewer than 500 iterations and takes less\nthan 20 seconds on an NVidia Titan Xp GPU. We explain the resulting generation strategy developed\nbased on this pilot study in Sec. 4.\n\nAre generated images helpful for few shot learning? Our goal is not to generate images, but\nto augment the training data for few shot learning. A naive way to do this is to apply the above\ngeneration technique for each training image, in order to double the training set. We tested this idea on\na validation set (split the same as [4]) from the Caltech-UCSD bird dataset [32] and computed average\naccuracy on 100 episodes of 5-way-1-shot classi\ufb01cation. We used pre-trained ImageNet features\nfrom ResNet18 [16] with nearest neighbor, one-vs-all logistic regression, and softmax regression\n(or multi-class logistic regression). As shown in Table 1, the accuracy actually drops for two of the\nthree classi\ufb01ers when we double the size of our training set by generating synthetic training images,\nsuggesting that the generated images are harmful for training classi\ufb01ers.\nWhat is the proper way of synthesizing images to help few-shot learning? Given that the syn-\nthetic images appear meaningful to humans, we conjecture that they can bene\ufb01t few shot classi\ufb01cation\nwhen properly mixed with originals to create hybrid images. To empirically test this hypothesis, we\ndevised a random 3 \u21e5 3 grid to combine the images. As shown in Fig. 1(h), images (a) and (f) were\ncombined by taking a linear combination within each cell of the grid of (g). Finally, we added mixed\nimages like (h) into the training data, and discovered that this produced a modest increase in accuracy\n(last row of Table 1). While the increase is marginal, these mixing weights were binary and manually\n\n3\n\n\fAugmented Support Set = {(Original Images), (Fused Images) }\n\nOriginal Image\n\nFused Image\n\nz\nNoise\n\nG\n\nF\n\nimage fusion\nnetwork\n\nQuery Set\n\nGenerated Image\n\nP\n\nprediction\n\nC\none-shot\nclassification\nnetwork\n\nFigure 2: Our Meta Image Reinforcing Network (MetaIRNet) has two modules: an image fusion\nnetwork, and a one-shot classi\ufb01cation network. The image fusion network reinforces generated\nimages to try to make them bene\ufb01cial for the one-shot classi\ufb01er, while the one-shot classi\ufb01er learns\nrepresentations that are suitable to classify unseen examples with few examples. Both networks are\ntrained end-to-end, so the loss back-propagates from classi\ufb01er to the fusion network.\n\nselected, and thus likely not optimal. In Sec. 4.2, we show how to learn this mixing strategy in an\nend-to-end manner using a meta-learning framework.\n\n4 Method\n\nThe results of the pilot study in the last section suggested that producing synthetic images could be\nuseful for few-shot \ufb01ne-grained recognition, but only if it is done in a careful way. In this section,\nwe use these \ufb01ndings to propose a novel technique for doing this effectively. We propose a GAN\n\ufb01ne-tuning method that works with a single image (Sec. 4.1), and an effective meta-learning method\nto not only learn to classify with few examples, but also to learn to effectively reinforce the generated\nimages (Sec. 4.2).\n\n4.1 Fine-tuning Pre-trained Generator for Target Images\nGANs typically have a generator G and a discriminator D. Given an input signal z \u21e0N (0, 1),\na well-trained generator synthesizes an image G(z). In our tasks, we adapt an off-the-shelf GAN\ngenerator G that is pre-trained on the ImageNet-2012 dataset in order to generate more images in a\ntarget, data-scarce domain. Note that we do not use the discriminator, since adversarial training with a\nfew images is unstable and may lead to model collapse. Formally, we \ufb01ne-tune z and the generator G\nsuch that G generates an image Iz from an input vector z by minimizing the distance between G(z)\nand Iz, where the vector z is randomly initialized. Inspired by previous work [2, 5, 20], we minimize\na loss function LG with L1 distance and perceptual loss Lperc with earth mover regularization LEM,\n(1)\nwhere LEM is an earth mover distance between z and random noise r \u21e0N (0, 1) to regularize z to\nbe sampled from a Gaussian, and p and z are coef\ufb01cients of each term.\nSince only a few training images are available in the target domain, only scale and shift parameters of\nthe batch normalization of G are updated in practice. Speci\ufb01cally, only the  and  of each batch\nnormalization layer are updated in each layer,\nx  E(x)\npVar(x) + \u270f\n\nLG (G, Iz, z) = L1 (G(z), Iz) + pLperc (G(z), Iz) + zLEM (z, r) ,\n\n(2)\n\n\u02c6x =\n\nh =  \u02c6x + ,\n\n4\n\n\fwhere x is the input feature from the previous layer, and E and Var indicate the mean and variance\nfunctions, respectively. Intuitively and in principle, updating  and  only is equivalent to adjusting\nthe activation of each neuron in a layer. Once updated, the G(z) would be synthesized to reconstruct\nthe image Iz. Empirically, a small random perturbation \u270f is added to z as G (z + \u270f). Examples of Iz,\nG(z) and G (z + \u270f) are illustrated in in Fig. 1 (a), (e), and (f), respectively.\n\n4.2 Meta Reinforced Synthetic Data for Few-shot Learning\n\nWe propose a meta-learning method to add synthetic data to the originals.\nOne-shot Learning. One-shot classi\ufb01cation is a meta-learning problem that divide a dataset into\ntwo sets: meta-training (or base) set and meta-testing (or novel) set. The classes in the base\nset and the novel sets are disjoint. In other words, we have Dbase = {(Ii, yi) , yi 2C base} and\nDnovel = {(Ii, yi) , yi 2C novel} where Cbase [C novel = ;. The task is to train a classi\ufb01er on Dbase\nthat can quickly generalize to unseen classes in Cnovel with one or few examples. To do this, a\nmeta-learning algorithm performs meta-training by sampling many one-shot tasks from Dbase, and\nis evaluated by sampling many similar tasks from Dnovel. Each sampled task (called an episode) is\nan n-way-m-shot classi\ufb01cation problem with q queries, meaning that we sample n classes with m\ntraining and q test examples for each class. In other words, an episode has a support (or training) set\nS and a query (or test) set Q, where |S| = n \u21e5 m and |Q| = n \u21e5 q. One-shot learning means m = 1.\nThe notation Sc means the support examples only belong to the class c, so |Sc| = m.\nMeta Image Reinforcing Network (MetaIRNet). We propose a Meta Image Reinforcing Network\n(MetaIRNet), which not only learns a few-shot classi\ufb01er, but also learns to reinforce generated images\nby combining real and generated images. MetaIRNet is composed of two modules: an image fusion\nnetwork F , and a one-shot classi\ufb01cation network C.\nThe Image Fusion Network F combines a real image I and a corresponding generated image Ig into\na new image Isyn = F (I, Ig) that is bene\ufb01cial for training a one-shot classi\ufb01er. Among the many\npossible ways to synthesize images, we were inspired by a block augmentation method [6] and use\ngrid-based linear combination. As shown in Figure 1(g), we divide the images into a 3 \u21e5 3 grid and\nlinearly combine the cells with the weights w produced by a CNN conditioned on the two images.\nThat is,\n\nIsyn = w  I + (1  w)  Ig\n\n(3)\nwhere  is element-wise multiplication, and w is resized to the image size keeping the block structure.\nThe CNN to produce w extracts the feature vectors of I and Ig, concatenates them, and uses a fully-\nconnected layer to produce a weight corresponding to each of the nine cells in the 3 \u21e5 3 grid. Finally,\nfor each real image Ii, we generate naug images, producing naug synthetic images, and assign the\nsame class label yi to each synthesized image Ii,j\n\nsyn to obtain an augmented support set,\n\n\u02dcS =nIi, yi ,Ii,j\n\nsyn, yi naug\n\nj=1 on\u21e5m\n\n.\n\n(4)\n\ni=1\n\nThe One-Shot Classi\ufb01cation Network C maps an input image I into feature maps C (I), and performs\none-shot classi\ufb01cation. Although any one-shot classi\ufb01er can be used, we choose the non-parametric\nprototype classi\ufb01er of Snell et al. [28] due to its superior performance and simplicity. During each\nepisode, given the sampled S and Q, the image fusion network produces an augmented support set \u02dcS.\nThis classi\ufb01er computes the prototype vector pc for each class c in \u02dcS as an average feature vector,\n\npc =\n\n1\n\n| \u02dcSc| X(Ii,yi)2 \u02dcSc\n\nC (Ii) .\n\nFor a query image Ii 2 Q, the probability of belonging to a class c is estimated as,\n\n(5)\n\n(6)\n\nP (yi = c|Ii) =\n\nexp (kC (Ii)  pck)\nk=1 exp (kC (Ii)  pkk)\n\nPn\n\nwhere k\u00b7k is the Euclidean distance. Then, for a query image, the class with the highest probability\nbecomes the \ufb01nal prediction of the one-shot classi\ufb01er.\n\n5\n\n\fTraining In the meta-training phase, we jointly train F and C in an end-to-end manner, minimizing\na cross-entropy loss function,\n\nwhere \u2713F and \u2713C are the learnable parameters of F and C.\n\n1\n\n|Q| X(Ii,yi)2Q\n\nmin\n\u2713F ,\u2713C\n\nlogP (yi | Ii) ,\n\n(7)\n\n5 Experiments\n\nTo investigate the effectiveness of our approach, we perform 1-shot-5-way classi\ufb01cation following\nthe meta-learning experimental setup described in Sec. 4.2. We perform 1000 episodes in meta-\ntesting, with 16 query images per class per episode, and report average classi\ufb01cation accuracy and\n95% con\ufb01dence intervals. We use the \ufb01ne-grained classi\ufb01cation dataset of Caltech UCSD Birds\n(CUB) [32] for our main experiments, and another \ufb01ne-grained dataset of North American Birds\n(NAB) [30] for secondary experiments. CUB has 11,788 images with 200 classes, and NAB has\n48,527 images with 555 classes.\n\nImplementation Details\n\n5.1\nWhile our \ufb01ne-tuning method introduced in Sec. 4.1 can generate images for each step in meta-\ntraining and meta-testing, it takes around 20 seconds per image, so we apply the generation method\nahead of time to make our experiments more ef\ufb01cient. We use a BigGAN pre-trained on ImageNet,\nusing the publicly-available weights. We set p = 0.1 and z = 0.1, and perform 500 gradient\ndescent updates with the Adam [18] optimizer with learning rate 0.01 for z and 0.0005 for the\nfully connected layers, to produce scale and shift parameters of the batch normalization layers. We\nmanually chose these hyper-parameters by trying random values from 0.1 to 0.0001 and visually\nchecking the quality of a few generated images. We only train once for each image, generate 10\nrandom images by perturbing z, and randomly use one of them for each episode (naug = 1). For\nimage classi\ufb01cation, we use ResNet18 [16] pre-trained on ImageNet for the two CNNs in F and one\nin C. We train F and C with Adam with a default learning rate of 0.001. We select the best model\nbased on the validation accuracy, and then compute the \ufb01nal accuracy on the test set. For CUB, we\nuse the same train/val/test split used in previous work [4], and for NAB we randomly split with a\nproportion of train:val:test = 2:1:1; see supplementary material for details. Further implementation\ndetails are available as supplemental source code.2\n\n5.2 Comparative and Ablative Study on CUB dataset\nBaselines. We compare our MetaIRNet with three types of baselines. (1) Non-meta learning\nclassi\ufb01ers: We directly train the same ImageNet pre-trained CNN used in F to classify images in\nDbase, and use it as a feature extractor for Dnovel. We then use off-the-shelf classi\ufb01ers nearest\nneighbor, logistic regression (one-vs-all classi\ufb01er), and softmax regression (also called multi-class\nlogistic regression). (2) Meta-learning classi\ufb01ers: We try the meta-learning method of prototypical\nnetwork (ProtoNet [28]). ProtoNet computes an average prototype vector for each class and performs\nnearest neighbor with the prototypes. We note that our MetaIRNet adapts ProtoNet as a choice of\nF so this is an ablative version of our model (MetaIRNet without the image fusion module). (3)\nData augmentation: Because our MetaIRNet learns data-augmentation as a sub-module, we also\ncompare with three data augmentation strategies, Flip, Gaussian, and FinetuneGAN. Flip horizontally\n\ufb02ips the images. Gaussian adds Gaussian noise with standard deviation 0.01 into the CNN features.\nFinetuneGAN (introduced in Sec. 4.1) generates augmented images by \ufb01ne-tuning the ImageNet-\npretrained BigGAN with each support set. Note that we do these augmentations in the meta-testing\nstage to increase the support set. For fair comparison, we use ProtoNet as the base classi\ufb01er of these\ndata augmentation baselines.\n\nResults. As shown in Table 2, our MetaIRNet is superior to all baselines including the meta-learning\nclassi\ufb01er of ProtoNet (84.13% vs. 81.73%) on the CUB dataset. It is notable that while ProtoNet has\nworse accuracy when simply using the generated images as data augmentation, our method shows an\n\n2http://vision.soic.indiana.edu/metairnet/\n\n6\n\n\fTable 2: 5-way-1-shot accuracy (%) on CUB/NAB dataset with ImageNet pre-trained ResNet18\n\nMethod\nNearest Neighbor\nLogistic Regression\nSoftmax Regression\nProtoNet\nProtoNet\nProtoNet\nProtoNet\nMetaIRNet (Ours)\nMetaIRNet (Ours)\n\nData Augmentation\n-\n-\n-\n-\nFinetuneGAN\nFlip\nGaussian\nFinetuneGAN\nFinetuneGAN, Flip\n\nCUB Acc.\n79.00 \u00b1 0.62\n81.17 \u00b1 0.60\n80.77 \u00b1 0.60\n81.73 \u00b1 0.63\n79.40 \u00b1 0.69\n82.66 \u00b1 0.61\n81.75 \u00b1 0.63\n84.13 \u00b1 0.58\n84.80 \u00b1 0.56\n\nNAB Acc.\n80.58 \u00b1 0.59\n82.70 \u00b1 0.57\n82.38 \u00b1 0.57\n87.91 \u00b1 0.52\n85.40 \u00b1 0.59\n88.55 \u00b1 0.50\n87.90 \u00b1 0.52\n89.19 \u00b1 0.51\n89.57 \u00b1 0.49\n\nTable 3: 5-way-1-shot accuracy (%) on CUB dataset with Conv4 without ImageNet pre-training\n\nMetaIRNet\n65.86 \u00b1 0.72\n\nProtoNet [28] MatchingNet [31]\n61.16 \u00b1 0.89 [4]\n63.50 \u00b1 0.70\n\nMAML [10]\n55.92 \u00b1 0.95 [4]\n\nRelationNet [29]\n62.45 \u00b1 0.98 [4]\n\naccuracy increase from ProtoNet, which is equivalent to MetaIRNet without the image fusion module.\nThis indicates that our image fusion module can effectively complement the original images while\nremoving harmful elements from generated ones.\nInterestingly, horizontal \ufb02ip augmentation yields nearly a 1% accuracy increase for ProtoNet. Because\n\ufb02ipping augmentation cannot be learned directly by our method, we conjectured that our method\ncould also bene\ufb01t from it. The \ufb01nal line of the table shows an additional experiment with our\nMetaIRNet combined with random \ufb02ip augmentation, showing an additional accuracy increase from\n84.13% to 84.80%. This suggests that our method provides an improvement that is orthogonal to \ufb02ip\naugmentation.\n\nCase Studies. We show some sample visualizations in Fig. 4. We ob-\nserve that image generation often works well, but sometimes completely\nfails. An advantage of our technique is that even in these failure cases,\nour fused images often maintain some of the object\u2019s shape, even if the\nimages themselves do not look realistic. In order to investigate the quality\nof generated images in more detail, we randomly pick two classes, sample\n100 images for each class, and a show t-SNE visualization of real images\n(\u2022), generated images (N), and augmented fused images (+) in Fig. 3,\nwith classes shown in red and blue. It is reasonable that the generated\nimages are closer to the real ones, because our loss function (equation 1)\nencourages this to be so. Interestingly, perhaps due to artifacts of 3 \u21e5 3\npatches, the fused images are distinctive from the real/generated images,\nextending the decision boundary.\n\nFigure 3:\n\nt-SNE plot\n\nComparing with state-of-the-art meta-learning classi\ufb01ers.\nIt is a convention in the machine\nlearning community to compare any new technique with the performance of many state-of-the-art\nmethods reported in the literature. This is somewhat dif\ufb01cult for us to do fairly, however: we use\nImageNet-pre-trained features as a starting point (which is a natural design decision considering that\nour focus is how to use ImageNet pre-trained generators for improving \ufb01ne-grained one-shot classi\ufb01-\ncation), but much of the one/few-shot learning literature focuses on algorithmic improvements and\nthus trains from scratch (often with non-\ufb01ne-grained datasets). The Delta Encoder [25], which uses\nthe idea of learning data augmentation in the feature space, reports 82.2% on one-shot classi\ufb01cation\non the CUB dataset with ImageNet-pre-trained features, but this is an average of only 10 episodes.\nTo provide more stable comparison, we cite a benchmark study [4] reporting accuracy of other\nmeta-learners [10, 29, 31] on the CUB dataset with 600 episodes. To compare with these scores,\nwe experimented with our MetaIRNet and the ProtoNet baseline using the same four-layered CNN.\nAs shown in Table 3, our MetaIRNet performs better than the other methods with more than 2%\n\n7\n\n\fOriginal Generated\n\nFused\n\nWeight\n\nOriginal Generated\n\nFused\n\nWeight\n\nFigure 4: Samples of original image, generated image, fused image, and mixing weight w. Higher\nweight (red) means more original image used, and lower weight (blue) means more generated image\nused. We show three types of samples based on the quality of generated images: very good (top row),\nrelatively good (middle row), and very bad or broken (last row).\n\nabsolute improvement. We note that this comparison is not totally fair because we use images\ngenerated from a generator pre-trained from ImageNet. However, our contribution is not to establish\na new state-of-the-art score but to present the idea of transferring an ImageNet pre-trained GAN for\nimproving one shot classi\ufb01ers, so we believe this comparison is still informative.\n\n5.3 Results on NAB Dataset\nWe also performed similar experiments on the NAB dataset, which is more than four times larger than\nCUB, and the results are shown in the last column of Table 2. We observe similar results as CUB,\nand that our method improves classi\ufb01cation accuracy from a ProtoNet baseline (89.19% vs. 87.91%).\n\n6 Conclusion\n\nWe introduce an effective way to employ an ImageNet-pre-trained image generator for the purpose\nof improving \ufb01ne-grained one-shot classi\ufb01cation when data is scarce. As a way to \ufb01ne-tune the\npre-trained generator, our pilot study \ufb01nds that adjusting only scale and shift parameters in batch\nnormalization can produce a visually realistic images. This technique works with a single image,\nmaking the method less dependent on the number of available images. Furthermore, although naively\nadding the generated images into the training set does not improve performance, we show that it can\nimprove performance if we mix generated with original images to create hybrid training exemplars.\nIn order to learn the parameters of this mixing, we adapt a meta-learning framework. We implement\nthis idea and demonstrate a consistent and signi\ufb01cant improvement over several classi\ufb01ers on two\n\ufb01ne-grained benchmark datasets.\n\nAcknowledgments\nWe would like to thank Yi Li for drawing Figure 2, and Minjun Li and Atsuhiro Noguchi for helpful\ndiscussions. Part of this work was done while Satoshi Tsutsui was an intern at Fudan University.\nYanwei Fu was supported in part by the NSFC project (#61572138), and Science and Technology\nCommission of Shanghai Municipality Project (#19511120700). David Crandall was supported\nin part by the National Science Foundation (CAREER IIS-1253549), and the Indiana University\nOf\ufb01ce of the Vice Provost for Research, the College of Arts and Sciences, and the Luddy School of\nInformatics, Computing, and Engineering through the Emerging Areas of Research Project \u201cLearning:\nBrains, Machines, and Children.\u201d Yanwei Fu is the corresponding author.\n\n8\n\n\fReferences\n[1] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Augmenting image classi\ufb01ers using\ndata augmentation generative adversarial networks. In International Conference on Arti\ufb01cial\nNeural Networks, 2018.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high \ufb01delity\nnatural image synthesis. In International Conference on Learning Representations (ICLR),\n2019.\n\n[4] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A\ncloser look at few-shot classi\ufb01cation. In International Conference on Learning Representations\n(ICLR), 2019.\n\n[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems (NIPS), 2016.\n\n[6] Zitian Chen, Yanwei Fu, Kaiyu Chen, and Yu-Gang Jiang. Image block augmentation for\n\none-shot learning. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2019.\n\n[7] Zitian Chen, Yanwei Fu, Yu-Xiong Wang, Lin Ma, Wei Liu, and Martial Hebert.\n\nImage\ndeformation meta-networks for one-shot learning. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2019.\n\n[8] Harm De Vries, Florian Strub, J\u00e9r\u00e9mie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C\nCourville. Modulating early visual processing by language. In Advances in Neural Information\nProcessing Systems (NIPS), 2017.\n\n[9] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for\n\nartistic style. In International Conference on Learning Representations (ICLR), 2017.\n\n[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. In International Conference on Machine Learning (ICML), 2017.\n\n[11] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[12] Hang Gao, Zheng Shou, Alireza Zareian, Hanwang Zhang, and Shih-Fu Chang. Low-shot\nlearning via covariance-preserving adversarial augmentation networks. In Advances in Neural\nInformation Processing Systems (NeurIPS), 2018.\n\n[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural\nInformation Processing Systems (NIPS), 2014.\n\n[14] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems\n(NIPS), 2017.\n\n[15] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinat-\n\ning features. In IEEE International Conference on Computer Vision (ICCV), 2017.\n\n[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n[17] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer\n\nand super-resolution. In European Conference on Computer Vision (ECCV), 2016.\n\n[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2015.\n\n[19] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\nfor generative adversarial networks. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[20] Atsuhiro Noguchi and Tatsuya Harada. Image generation from small datasets via batch statistics\n\nadaptation. In IEEE International Conference on Computer Vision (ICCV), 2019.\n\n9\n\n\f[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. In International Conference on Learning\nRepresentations (ICLR), 2016.\n\n[22] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2017.\n\n[23] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon\nOsindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In International\nConference on Learning Representations (ICLR), 2018.\n\n[24] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.\nIn International Conference on\n\nMeta-learning with memory-augmented neural networks.\nMachine Learning (ICML), 2016.\n\n[25] Eli Schwartz, Leonid Karlinsky, Joseph Shtok, Sivan Harary, Mattias Marder, Abhishek Kumar,\nRogerio Feris, Raja Giryes, and Alex Bronstein. Delta-encoder: an effective sample synthesis\nmethod for few-shot object recognition. In Advances in Neural Information Processing Systems\n(NeurIPS), 2018.\n\n[26] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. How good is my gan?\n\nEuropean Conference on Computer Vision (ECCV), 2018.\n\nIn\n\n[27] Ashish Shrivastava, Tomas P\ufb01ster, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell\nWebb. Learning from simulated and unsupervised images through adversarial training. In IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[28] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[29] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.\nLearning to compare: Relation network for few-shot learning. In IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2018.\n\n[30] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro\nPerona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen\nscientists: The \ufb01ne print in \ufb01ne-grained dataset collection. In IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2015.\n\n[31] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks\n\nfor one shot learning. In Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[32] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011\n\nDataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n[33] Yaxing Wang, Chenshen Wu, Luis Herranz, Joost van de Weijer, Abel Gonzalez-Garcia, and\nBogdan Raducanu. Transferring gans: generating images from limited data. In Proceedings of\nthe European Conference on Computer Vision (ECCV), 2018.\n\n[34] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning\nfrom imaginary data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2018.\n\n[35] Yuxiong Wang and Martial Hebert. Learning from small sample sets by combining unsupervised\nmeta-training with cnns. In Advances in Neural Information Processing Systems (NIPS), 2016.\n[36] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. Mixup: Beyond\nempirical risk minimization. In International Conference on Learning Representations (ICLR),\n2018.\n\n[37] Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. MetaGAN:\nAn Adversarial Approach to Few-Shot Learning. In Advances in Neural Information Processing\nSystems (NeurIPS), 2018.\n\n[38] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networkss. In IEEE International Conference on\nComputer Vision (ICCV), 2017.\n\n10\n\n\f", "award": [], "sourceid": 1744, "authors": [{"given_name": "Satoshi", "family_name": "Tsutsui", "institution": "Indiana University"}, {"given_name": "Yanwei", "family_name": "Fu", "institution": "Fudan University, Shanghai;"}, {"given_name": "David", "family_name": "Crandall", "institution": "Indiana University"}]}