{"title": "Improved Techniques for Training GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 2234, "page_last": 2242, "abstract": "We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: Our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21.3%. We also present ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes.", "full_text": "Improved Techniques for Training GANs\n\nTim Salimans\n\ntim@openai.com\n\nIan Goodfellow\nian@openai.com\n\nWojciech Zaremba\nwoj@openai.com\n\nVicki Cheung\n\nvicki@openai.com\n\nAlec Radford\n\nalec@openai.com\n\nXi Chen\n\npeter@openai.com\n\nAbstract\n\nWe present a variety of new architectural features and training procedures that we\napply to the generative adversarial networks (GANs) framework. Using our new\ntechniques, we achieve state-of-the-art results in semi-supervised classi\ufb01cation on\nMNIST, CIFAR-10 and SVHN. The generated images are of high quality as con-\n\ufb01rmed by a visual Turing test: our model generates MNIST samples that humans\ncannot distinguish from real data, and CIFAR-10 samples that yield a human error\nrate of 21.3%. We also present ImageNet samples with unprecedented resolution\nand show that our methods enable the model to learn recognizable features of\nImageNet classes.\n\n1\n\nIntroduction\n\nGenerative adversarial networks [1] (GANs) are a class of methods for learning generative models\nbased on game theory. The goal of GANs is to train a generator network G(z; \u03b8(G)) that produces\nsamples from the data distribution, pdata(x), by transforming vectors of noise z as x = G(z; \u03b8(G)).\nThe training signal for G is provided by a discriminator network D(x) that is trained to distinguish\nsamples from the generator distribution pmodel(x) from real data. The generator network G in turn\nis then trained to fool the discriminator into accepting its outputs as being real.\nRecent applications of GANs have shown that they can produce excellent samples [2, 3]. However,\ntraining GANs requires \ufb01nding a Nash equilibrium of a non-convex game with continuous, high-\ndimensional parameters. GANs are typically trained using gradient descent techniques that are\ndesigned to \ufb01nd a low value of a cost function, rather than to \ufb01nd the Nash equilibrium of a game.\nWhen used to seek for a Nash equilibrium, these algorithms may fail to converge [4].\nIn this work, we introduce several techniques intended to encourage convergence of the GANs game.\nThese techniques are motivated by a heuristic understanding of the non-convergence problem. They\nlead to improved semi-supervised learning peformance and improved sample generation. We hope\nthat some of them may form the basis for future work, providing formal guarantees of convergence.\nAll code and hyperparameters may be found at https://github.com/openai/improved-gan.\n\n2 Related work\n\nSeveral recent papers focus on improving the stability of training and the resulting perceptual quality\nof GAN samples [2, 3, 5, 6]. We build on some of these techniques in this work. For instance, we\nuse some of the \u201cDCGAN\u201d architectural innovations proposed in Radford et al. [3], as discussed\nbelow.\nOne of our proposed techniques, feature matching, discussed in Sec. 3.1, is similar in spirit to\napproaches that use maximum mean discrepancy [7, 8, 9] to train generator networks [10, 11].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAnother of our proposed techniques, minibatch features, is based in part on ideas used for batch\nnormalization [12], while our proposed virtual batch normalization is a direct extension of batch\nnormalization.\nOne of the primary goals of this work is to improve the effectiveness of generative adversarial\nnetworks for semi-supervised learning (improving the performance of a supervised task, in this case,\nclassi\ufb01cation, by learning on additional unlabeled examples). Like many deep generative models,\nGANs have previously been applied to semi-supervised learning [13, 14], and our work can be seen\nas a continuation and re\ufb01nement of this effort. In concurrent work, Odena [15] proposes to extend\nGANs to predict image labels like we do in Section 5, but without our feature matching extension\n(Section 3.1) which we found to be critical for obtaining state-of-the-art performance.\n\n3 Toward Convergent GAN Training\n\nTraining GANs consists in \ufb01nding a Nash equilibrium to a two-player non-cooperative game.\nEach player wishes to minimize its own cost function, J (D)(\u03b8(D), \u03b8(G)) for the discriminator and\nJ (G)(\u03b8(D), \u03b8(G)) for the generator. A Nash equilibirum is a point (\u03b8(D), \u03b8(G)) such that J (D) is at a\nminimum with respect to \u03b8(D) and J (G) is at a minimum with respect to \u03b8(G). Unfortunately, \ufb01nd-\ning Nash equilibria is a very dif\ufb01cult problem. Algorithms exist for specialized cases, but we are not\naware of any that are feasible to apply to the GAN game, where the cost functions are non-convex,\nthe parameters are continuous, and the parameter space is extremely high-dimensional.\nThe idea that a Nash equilibrium occurs when each player has minimal cost seems to intuitively mo-\ntivate the idea of using traditional gradient-based minimization techniques to minimize each player\u2019s\ncost simultaneously. Unfortunately, a modi\ufb01cation to \u03b8(D) that reduces J (D) can increase J (G), and\na modi\ufb01cation to \u03b8(G) that reduces J (G) can increase J (D). Gradient descent thus fails to converge\nfor many games. For example, when one player minimizes xy with respect to x and another player\nminimizes \u2212xy with respect to y, gradient descent enters a stable orbit, rather than converging to\nx = y = 0, the desired equilibrium point [16]. Previous approaches to GAN training have thus\napplied gradient descent on each player\u2019s cost simultaneously, despite the lack of guarantee that this\nprocedure will converge. We introduce the following techniques that are heuristically motivated to\nencourage convergence:\n\n3.1 Feature matching\nFeature matching addresses the instability of GANs by specifying a new objective for the generator\nthat prevents it from overtraining on the current discriminator. Instead of directly maximizing the\noutput of the discriminator, the new objective requires the generator to generate data that matches\nthe statistics of the real data, where we use the discriminator only to specify the statistics that we\nthink are worth matching. Speci\ufb01cally, we train the generator to match the expected value of the\nfeatures on an intermediate layer of the discriminator. This is a natural choice of statistics for the\ngenerator to match, since by training the discriminator we ask it to \ufb01nd those features that are most\ndiscriminative of real data versus data generated by the current model.\nLetting f (x) denote activations on an intermediate layer of the discriminator, our new objective for\nthe generator is de\ufb01ned as: ||Ex\u223cpdataf (x) \u2212 Ez\u223cpz(z)f (G(z))||2\n2. The discriminator, and hence\nf (x), are trained in the usual way. As with regular GAN training, the objective has a \ufb01xed point\nwhere G exactly matches the distribution of training data. We have no guarantee of reaching this\n\ufb01xed point in practice, but our empirical results indicate that feature matching is indeed effective in\nsituations where regular GAN becomes unstable.\n\n3.2 Minibatch discrimination\nOne of the main failure modes for GAN is for the generator to collapse to a parameter setting where\nit always emits the same point. When collapse to a single mode is imminent, the gradient of the\ndiscriminator may point in similar directions for many similar points. Because the discriminator\nprocesses each example independently, there is no coordination between its gradients, and thus no\nmechanism to tell the outputs of the generator to become more dissimilar to each other. Instead,\nall outputs race toward a single point that the discriminator currently believes is highly realistic.\nAfter collapse has occurred, the discriminator learns that this single point comes from the generator,\nbut gradient descent is unable to separate the identical outputs. The gradients of the discriminator\n\n2\n\n\fthen push the single point produced by the generator around space forever, and the algorithm cannot\nconverge to a distribution with the correct amount of entropy. An obvious strategy to avoid this type\nof failure is to allow the discriminator to look at multiple data examples in combination, and perform\nwhat we call minibatch discrimination.\nThe concept of minibatch discrimination is quite general: any discriminator model that looks\nat multiple examples in combination, rather than in isolation, could potentially help avoid col-\nlapse of the generator.\nIn fact, the successful application of batch normalization in the dis-\ncriminator by Radford et al. [3] is well explained from this perspective. So far, however, we\nhave restricted our experiments to models that explicitly aim to identify generator samples that\nare particularly close together. One successful speci\ufb01cation for modelling the closeness between\nexamples in a minibatch is as follows: Let f (xi) \u2208 RA denote a vector of features for in-\nput xi, produced by some intermediate layer in the discriminator. We then multiply the vector\nf (xi) by a tensor T \u2208 RA\u00d7B\u00d7C, which results in a matrix Mi \u2208 RB\u00d7C. We then compute\nthe L1-distance between the rows of the resulting matrix Mi across samples i \u2208 {1, 2, . . . , n}\ncb(xi, xj) = exp(\u2212||Mi,b \u2212 Mj,b||L1) \u2208 R.\nand apply a negative exponential (Fig. 1):\nThe output o(xi) for this minibatch layer for a sample xi\nis then de\ufb01ned as the sum of the cb(xi, xj)\u2019s to all other\nsamples:\n\no(xi)b =\n\ncb(xi, xj) \u2208 R\n\nn(cid:88)\n(cid:104)\n\nj=1\n\n(cid:105) \u2208 RB\n\no(xi)1, o(xi)2, . . . , o(xi)B\n\no(xi) =\no(X) \u2208 Rn\u00d7B\n\nNext, we concatenate the output o(xi) of the minibatch\nlayer with the intermediate features f (xi) that were its\ninput, and we feed the result into the next layer of the\ndiscriminator. We compute these minibatch features sep-\narately for samples from the generator and from the train-\ning data. As before, the discriminator is still required to\noutput a single number for each example indicating how\nlikely it is to come from the training data: The task of the discriminator is thus effectively still to\nclassify single examples as real data or generated data, but it is now able to use the other examples in\nthe minibatch as side information. Minibatch discrimination allows us to generate visually appealing\nsamples very quickly, and in this regard it is superior to feature matching (Section 6). Interestingly,\nhowever, feature matching was found to work much better if the goal is to obtain a strong classi\ufb01er\nusing the approach to semi-supervised learning described in Section 5.\n\nFigure 1: Figure sketches how mini-\nbatch discrimination works. Features\nf (xi) from sample xi are multiplied\nthrough a tensor T , and cross-sample\ndistance is computed.\n\n(cid:80)t\n3.3 Historical averaging\nWhen applying this technique, we modify each player\u2019s cost to include a term ||\u03b8 \u2212 1\ni=1 \u03b8[i]||2,\nwhere \u03b8[i] is the value of the parameters at past time i. The historical average of the parameters can\nbe updated in an online fashion so this learning rule scales well to long time series. This approach is\nloosely inspired by the \ufb01ctitious play [17] algorithm that can \ufb01nd equilibria in other kinds of games.\nWe found that our approach was able to \ufb01nd equilibria of low-dimensional, continuous non-convex\ngames, such as the minimax game with one player controlling x, the other player controlling y, and\nvalue function (f (x) \u2212 1)(y \u2212 1), where f (x) = x for x < 0 and f (x) = x2 otherwise. For\nthese same toy games, gradient descent fails by going into extended orbits that do not approach the\nequilibrium point.\n\nt\n\n3.4 One-sided label smoothing\nLabel smoothing, a technique from the 1980s recently independently re-discovered by Szegedy et.\nal [18], replaces the 0 and 1 targets for a classi\ufb01er with smoothed values, like .9 or .1, and was\nrecently shown to reduce the vulnerability of neural networks to adversarial examples [19].\nReplacing positive classi\ufb01cation targets with \u03b1 and negative targets with \u03b2, the optimal discriminator\nbecomes D(x) = \u03b1pdata(x)+\u03b2pmodel(x)\npdata(x)+pmodel(x) . The presence of pmodel in the numerator is problematic\nbecause, in areas where pdata is approximately zero and pmodel is large, erroneous samples from\n\n3\n\n\fpmodel have no incentive to move nearer to the data. We therefore smooth only the positive labels to\n\u03b1, leaving negative labels set to 0.\n\n3.5 Virtual batch normalization\nBatch normalization greatly improves optimization of neural networks, and was shown to be highly\neffective for DCGANs [3]. However, it causes the output of a neural network for an input example\nx to be highly dependent on several other inputs x(cid:48) in the same minibatch. To avoid this problem\nwe introduce virtual batch normalization (VBN), in which each example x is normalized based on\nthe statistics collected on a reference batch of examples that are chosen once and \ufb01xed at the start\nof training, and on x itself. The reference batch is normalized using only its own statistics. VBN is\ncomputationally expensive because it requires running forward propagation on two minibatches of\ndata, so we use it only in the generator network.\n\n4 Assessment of image quality\n\nto\nGenerative adversarial networks lack an objective function, which makes it dif\ufb01cult\ncompare performance of different models.\nOne intuitive metric of performance can be\nobtained by having human annotators judge the visual quality of samples [2]. We\nautomate this process using Amazon Mechanical Turk (MTurk),\nusing the web in-\nterface in \ufb01gure Fig. 2 (live at http://infinite-chamber-35121.herokuapp.com/\ncifar-minibatch/), which we use to ask annotators to distinguish between generated data\nand real data. The resulting quality assessments of our models are described in Section 6.\n\nFigure 2: Web interface given to anno-\ntators. Annotators are asked to distin-\nguish computer generated images from\nreal ones.\n\nA downside of using human annotators is that the metric\nvaries depending on the setup of the task and the moti-\nvation of the annotators. We also \ufb01nd that results change\ndrastically when we give annotators feedback about their\nmistakes: By learning from such feedback, annotators are\nbetter able to point out the \ufb02aws in generated images, giv-\ning a more pessimistic quality assessment. The left col-\numn of Fig. 2 presents a screen from the annotation pro-\ncess, while the right column shows how we inform anno-\ntators about their mistakes.\n\nthe model to generate varied images, so the marginal(cid:82) p(y|x = G(z))dz should have high entropy.\n\nAs an alternative to human annotators, we propose an automatic method to evaluate samples, which\nwe \ufb01nd to correlate well with human evaluation: We apply the Inception model1 [20] to every\ngenerated image to get the conditional label distribution p(y|x). Images that contain meaningful\nobjects should have a conditional label distribution p(y|x) with low entropy. Moreover, we expect\nCombining these two requirements, the metric that we propose is: exp(ExKL(p(y|x)||p(y))), where\nwe exponentiate results so the values are easier to compare. Our Inception score is closely related\nto the objective used for training generative models in CatGAN [14]: Although we had less success\nusing such an objective for training, we \ufb01nd it is a good metric for evaluation that correlates very\nwell with human judgment. We \ufb01nd that it\u2019s important to evaluate the metric on a large enough\nnumber of samples (i.e. 50k) as part of this metric measures diversity.\n\n5 Semi-supervised learning\n\nConsider a standard classi\ufb01er for classifying a data point x into one of K possible classes. Such\na model takes in x as input and outputs a K-dimensional vector of logits {l1, . . . , lK}, that can\n(cid:80)K\nbe turned into class probabilities by applying the softmax: pmodel(y = j|x) =\nk=1 exp(lk). In\nsupervised learning, such a model is then trained by minimizing the cross-entropy between the\nobserved labels and the model predictive distribution pmodel(y|x).\n\nexp(lj )\n\n1We use the pretrained Inception model from http://download.tensorflow.org/models/image/\nimagenet/inception-2015-12-05.tgz. Code to compute the Inception score with this model will be made\navailable by the time of publication.\n\n4\n\n\fWe can do semi-supervised learning with any standard classi\ufb01er by simply adding samples from\nthe GAN generator G to our data set, labeling them with a new \u201cgenerated\u201d class y = K + 1, and\ncorrespondingly increasing the dimension of our classi\ufb01er output from K to K + 1. We may then\nuse pmodel(y = K + 1 | x) to supply the probability that x is fake, corresponding to 1 \u2212 D(x) in\nthe original GAN framework. We can now also learn from unlabeled data, as long as we know that\nit corresponds to one of the K classes of real data by maximizing log pmodel(y \u2208 {1, . . . , K}|x).\nAssuming half of our data set consists of real data and half of it is generated (this is arbitrary), our\nloss function for training the classi\ufb01er then becomes\n\nL = \u2212Ex,y\u223cpdata(x,y)[log pmodel(y|x)] \u2212 Ex\u223cG[log pmodel(y = K + 1|x)]\n= Lsupervised + Lunsupervised, where\n\nLsupervised = \u2212Ex,y\u223cpdata(x,y) log pmodel(y|x, y < K + 1)\nLunsupervised = \u2212{Ex\u223cpdata(x) log[1 \u2212 pmodel(y = K + 1|x)] + Ex\u223cG log[pmodel(y = K + 1|x)]},\n\nwhere we have decomposed the total cross-entropy loss into our standard supervised loss function\nLsupervised (the negative log probability of the label, given that the data is real) and an unsupervised\nloss Lunsupervised which is in fact the standard GAN game-value as becomes evident when we substi-\ntute D(x) = 1 \u2212 pmodel(y = K + 1|x) into the expression:\n\nLunsupervised = \u2212{Ex\u223cpdata(x) log D(x) + Ez\u223cnoise log(1 \u2212 D(G(z)))}.\n\nis\n\nto\n\noptimal\n\nsolution\n\nfor minimizing\n\nboth Lsupervised\n\nand Lunsupervised\n\nThe\nhave\nexp[lj(x)] = c(x)p(y=j, x)\u2200j<K+1 and exp[lK+1(x)] = c(x)pG(x) for some undeter-\nmined scaling function c(x). The unsupervised loss is thus consistent with the supervised loss in\nthe sense of Sutskever et al. [13], and we can hope to better estimate this optimal solution from\nthe data by minimizing these two loss functions jointly. In practice, Lunsupervised will only help if\nit is not trivial to minimize for our classi\ufb01er and we thus need to train G to approximate the data\ndistribution. One way to do this is by training G to minimize the GAN game-value, using the\ndiscriminator D de\ufb01ned by our classi\ufb01er. This approach introduces an interaction between G and\nour classi\ufb01er that we do not fully understand yet, but empirically we \ufb01nd that optimizing G using\nfeature matching GAN works very well for semi-supervised learning, while training G using GAN\nwith minibatch discrimination does not work at all. Here we present our empirical results using this\napproach; developing a full theoretical understanding of the interaction between D and G using this\napproach is left for future work.\nFinally, note that our classi\ufb01er with K + 1 outputs is over-parameterized: subtracting a general\nfunction f (x) from each output logit, i.e. setting lj(x) \u2190 lj(x) \u2212 f (x)\u2200j, does not change the\noutput of the softmax. This means we may equivalently \ufb01x lK+1(x) = 0\u2200x, in which case Lsupervised\nbecomes the standard supervised loss function of our original classi\ufb01er with K classes, and our\ndiscriminator D is given by D(x) = Z(x)\n\nZ(x)+1 , where Z(x) =(cid:80)K\n\nk=1 exp[lk(x)].\n\nImportance of labels for image quality\n\n5.1\nBesides achieving state-of-the-art results in semi-supervised learning, the approach described above\nalso has the surprising effect of improving the quality of generated images as judged by human\nannotators. The reason appears to be that the human visual system is strongly attuned to image\nstatistics that can help infer what class of object an image represents, while it is presumably less\nsensitive to local statistics that are less important for interpretation of the image. This is supported\nby the high correlation we \ufb01nd between the quality reported by human annotators and the Inception\nscore we developed in Section 4, which is explicitly constructed to measure the \u201cobjectness\u201d of a\ngenerated image. By having the discriminator D classify the object shown in the image, we bias it to\ndevelop an internal representation that puts emphasis on the same features humans emphasize. This\neffect can be understood as a method for transfer learning, and could potentially be applied much\nmore broadly. We leave further exploration of this possibility for future work.\n\n5\n\n\f6 Experiments\n\nWe performed semi-supervised experiments on MNIST, CIFAR-10 and SVHN, and sample gener-\nation experiments on MNIST, CIFAR-10, SVHN and ImageNet. We provide code to reproduce the\nmajority of our experiments.\n\n6.1 MNIST\nThe MNIST dataset contains 60, 000 labeled\nimages of digits. We perform semi-supervised\ntraining with a small randomly picked fraction\nof these, considering setups with 20, 50, 100,\nand 200 labeled examples. Results are averaged\nover 10 random subsets of labeled data, each\nchosen to have a balanced number of examples\nfrom each class. The remaining training images\nare provided without labels. Our networks have\n5 hidden layers each. We use weight normaliza-\ntion [21] and add Gaussian noise to the output\nof each layer of the discriminator. Table 1 sum-\nmarizes our results.\nSamples generated by the generator during\nsemi-supervised learning using feature match-\ning (Section 3.1) do not look visually appealing\n(left Fig. 3). By using minibatch discrimination\ninstead (Section 3.2) we can improve their visual quality. On MTurk, annotators were able to dis-\ntinguish samples in 52.4% of cases (2000 votes total), where 50% would be obtained by random\nguessing. Similarly, researchers in our institution were not able to \ufb01nd any artifacts that would al-\nlow them to distinguish samples. However, semi-supervised learning with minibatch discrimination\ndoes not produce as good a classi\ufb01er as does feature matching.\n\nFigure 3: (Left) samples generated by model dur-\ning semi-supervised training. Samples can be\nclearly distinguished from images coming from\nMNIST dataset. (Right) Samples generated with\nminibatch discrimination.\nSamples are com-\npletely indistinguishable from dataset images.\n\nModel\n\nDGN [22]\n\nVirtual Adversarial [23]\n\nCatGAN [14]\n\nSkip Deep Generative Model [24]\n\nLadder network [25]\n\nAuxiliary Deep Generative Model [24]\n\nOur model\n\nEnsemble of 10 of our models\n\nNumber of incorrectly predicted test examples\n\nfor a given number of labeled samples\n\n20\n\n50\n\n100\n\n333 \u00b1 14\n\n200\n\n212\n\n191 \u00b1 10\n132 \u00b1 7\n106 \u00b1 37\n96 \u00b1 2\n93 \u00b1 6.5\n86 \u00b1 5.6\n\nTable 1: Number of incorrectly classi\ufb01ed test examples for the semi-supervised setting on permuta-\ntion invariant MNIST. Results are averaged over 10 seeds.\n\n1677 \u00b1 452\n1134 \u00b1 445\n\n221 \u00b1 136\n142 \u00b1 96\n\n90 \u00b1 4.2\n81 \u00b1 4.3\n\n6.2 CIFAR-10\n\nModel\n\nLadder network [25]\n\nCatGAN [14]\nOur model\n\nTest error rate for\n\na given number of labeled samples\n\n1000\n\n2000\n\n4000\n\n8000\n\nEnsemble of 10 of our models\n\n21.83\u00b12.01\n19.22\u00b10.54\n\n19.61\u00b12.09\n17.25\u00b10.66\n\n20.40\u00b10.47\n19.58\u00b10.46\n18.63\u00b12.32\n15.59\u00b10.47\n\n17.72\u00b11.82\n14.87\u00b10.89\n\nTable 2: Test error on semi-supervised CIFAR-10. Results are averaged over 10 splits of data.\n\nCIFAR-10 is a small, well studied dataset of 32 \u00d7 32 natural images. We use this data set to study\nsemi-supervised learning, as well as to examine the visual quality of samples that can be achieved.\nFor the discriminator in our GAN we use a 9 layer deep convolutional network with dropout and\nweight normalization. The generator is a 4 layer deep CNN with batch normalization. Table 2\nsummarizes our results on the semi-supervised learning task.\n\n6\n\n\fFigure 4: Samples generated during semi-supervised training on CIFAR-10 with feature matching\n(Section 3.1, left) and minibatch discrimination (Section 3.2, right).\n\nWhen presented with 50% real and 50% fake data generated by our best CIFAR-10 model, MTurk\nusers correctly categorized 78.7% of images correctly. However, MTurk users may not be suf\ufb01-\nciently familiar with CIFAR-10 images or suf\ufb01ciently motivated; we ourselves were able to catego-\nrize images with > 95% accuracy. We validated the Inception score described above by observing\nthat MTurk accuracy drops to 71.4% when the data is \ufb01ltered by using only the top 1% of samples\naccording to the Inception score. We performed a series of ablation experiments to demonstrate that\nour proposed techniques improve the Inception score, presented in Table 3. We also present images\nfor these ablation experiments\u2014in our opinion, the Inception score correlates well with our subjec-\ntive judgment of image quality. Samples from the dataset achieve the highest value. All the models\nthat even partially collapse have relatively low scores. We caution that the Inception score should be\nused as a rough guide to evaluate models that were trained via some independent criterion; directly\noptimizing Inception score will lead to the generation of adversarial examples [26].\n\n-VBN+BN\n7.54 \u00b1 .07\n\n-L+HA\n\n6.86 \u00b1 .06\n\n-LS\n\n6.83 \u00b1 .06\n\n-L\n\n4.36 \u00b1 .04\n\n-MBF\n\n3.87 \u00b1 .03\n\nSamples\nModel\nScore \u00b1 std.\n\nReal data\n\n11.24 \u00b1 .12\n\nOur methods\n8.09 \u00b1 .07\n\nTable 3: Table of Inception scores for samples generated by various models for 50, 000 images.\nScore highly correlates with human judgment, and the best score is achieved for natural images.\nModels that generate collapsed samples have relatively low score. This metric allows us to avoid\nrelying on human evaluations. \u201cOur methods\u201d includes all the techniques described in this work,\nexcept for feature matching and historical averaging. The remaining experiments are ablation exper-\niments showing that our techniques are effective. \u201c-VBN+BN\u201d replaces the VBN in the generator\nwith BN, as in DCGANs. This causes a small decrease in sample quality on CIFAR. VBN is more\nimportant for ImageNet. \u201c-L+HA\u201d removes the labels from the training process, and adds historical\naveraging to compensate. HA makes it possible to still generate some recognizable objects. Without\nHA, sample quality is considerably reduced (see \u201d-L\u201d). \u201c-LS\u201d removes label smoothing and incurs a\nnoticeable drop in performance relative to \u201cour methods.\u201d \u201c-MBF\u201d removes the minibatch features\nand incurs a very large drop in performance, greater even than the drop resulting from removing the\nlabels. Adding HA cannot prevent this problem.\n\n6.3 SVHN\nFor the SVHN data set, we used the same architecture and experimental setup as for CIFAR-10.\nFigure 5 compares against the previous state-of-the-art, where it should be noted that the model\n\n7\n\n\fof [24] is not convolutional, but does use an additional data set of 531131 unlabeled examples. The\nother methods, including ours, are convolutional and do not use this data.\n\nModel\n\nPercentage of incorrectly predicted test examples\n\nfor a given number of labeled samples\n\n500\n\n1000\n\n2000\n\nVirtual Adversarial [23]\n\nStacked What-Where Auto-Encoder [27]\n\nDCGAN [3]\n\nSkip Deep Generative Model [24]\n\nOur model\n\nEnsemble of 10 of our models\n\n24.63\n23.56\n22.48\n\n16.61\u00b10.24\n8.11 \u00b1 1.3\n5.88 \u00b1 1.0\n\n18.44 \u00b1 4.8\n\n6.16 \u00b1 0.58\n\nFigure 5: (Left) Error rate on SVHN. (Right) Samples from the generator for SVHN.\n\nImageNet\n\n6.4\nWe tested our techniques on a dataset of unprecedented scale: 128 \u00d7 128 images from the\nILSVRC2012 dataset with 1,000 categories. To our knowledge, no previous publication has ap-\nplied a generative model to a dataset with both this large of a resolution and this large a number\nof object classes. The large number of object classes is particularly challenging for GANs due to\ntheir tendency to underestimate the entropy in the distribution. We extensively modi\ufb01ed a publicly\navailable implementation of DCGANs2 using TensorFlow [28] to achieve high performance, using\na multi-GPU implementation. DCGANs without modi\ufb01cation learn some basic image statistics and\ngenerate contiguous shapes with somewhat natural color and texture but do not learn any objects.\nUsing the techniques described in this paper, GANs learn to generate objects that resemble animals,\nbut with incorrect anatomy. Results are shown in Fig. 6.\n\nFigure 6: Samples generated from the ImageNet dataset. (Left) Samples generated by a DCGAN.\n(Right) Samples generated using the techniques proposed in this work. The new techniques enable\nGANs to learn recognizable features of animals, such as fur, eyes, and noses, but these features are\nnot correctly combined to form an animal with realistic anatomical structure.\n\n7 Conclusion\n\nGenerative adversarial networks are a promising class of generative models that has so far been\nheld back by unstable training and by the lack of a proper evaluation metric. This work presents\npartial solutions to both of these problems. We propose several techniques to stabilize training\nthat allow us to train models that were previously untrainable. Moreover, our proposed evaluation\nmetric (the Inception score) gives us a basis for comparing the quality of these models. We apply\nour techniques to the problem of semi-supervised learning, achieving state-of-the-art results on a\nnumber of different data sets in computer vision. The contributions made in this work are of a\npractical nature; we hope to develop a more rigorous theoretical understanding in future work.\n\n2https://github.com/carpedm20/DCGAN-tensor\ufb02ow\n\n8\n\n\fReferences\n\n[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. Generative adversarial nets. In NIPS, 2014.\n[2] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a\n\nlaplacian pyramid of adversarial networks. arXiv preprint arXiv:1506.05751, 2015.\n\n[3] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convo-\n\nlutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[4] Ian J Goodfellow. On distinguishability criteria for estimating generative models.\n\narXiv preprint\n\narXiv:1412.6515, 2014.\n\n[5] Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. Generating images with\n\nrecurrent adversarial networks. arXiv preprint arXiv:1602.05110, 2016.\n\n[6] Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S Paek, and In So Kweon. Pixel-level domain\n\ntransfer. arXiv preprint arXiv:1603.07442, 2016.\n\n[7] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch\u00a8olkopf. Measuring statistical depen-\n\ndence with hilbert-schmidt norms. In Algorithmic learning theory, pages 63\u201377. Springer, 2005.\n\n[8] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch\u00a8olkopf. Kernel measures of conditional\n\ndependence. In NIPS, volume 20, pages 489\u2013496, 2007.\n\n[9] Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch\u00a8olkopf. A hilbert space embedding for distribu-\n\ntions. In Algorithmic learning theory, pages 13\u201331. Springer, 2007.\n\n[10] Yujia Li, Kevin Swersky, and Richard S. Zemel. Generative moment matching networks. CoRR,\n\nabs/1502.02761, 2015.\n\n[11] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks\n\nvia maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[13] Ilya Sutskever, Rafal Jozefowicz, Karol Gregor, et al. Towards principled unsupervised learning. arXiv\n\npreprint arXiv:1511.06440, 2015.\n\n[14] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adver-\n\nsarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n[15] Augustus Odena.\n\nSemi-supervised learning with generative adversarial networks.\n\narXiv preprint\n\n[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. 2016. MIT Press.\n[17] George W Brown.\n\nIterative solution of games by \ufb01ctitious play. Activity analysis of production and\n\narXiv:1606.01583, 2016.\n\nallocation, 13(1):374\u2013376, 1951.\n\n[18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for\n\nComputer Vision. ArXiv e-prints, December 2015.\n\n[19] David Warde-Farley and Ian Goodfellow. Adversarial perturbations of deep neural networks. In Tamir\nHazan, George Papandreou, and Daniel Tarlow, editors, Perturbations, Optimization, and Statistics, chap-\nter 11. 2016. Book in preparation for MIT Press.\n\n[20] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking\n\nthe inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.\n\n[21] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate\n\ntraining of deep neural networks. arXiv preprint arXiv:1602.07868, 2016.\n\n[22] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\n\nlearning with deep generative models. In Neural Information Processing Systems, 2014.\n\n[23] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing\n\nby virtual adversarial examples. arXiv preprint arXiv:1507.00677, 2015.\n\n[24] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep generative\n\nmodels. arXiv preprint arXiv:1602.05473, 2016.\n\n[25] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised\n\nlearning with ladder networks. In Advances in Neural Information Processing Systems, 2015.\n\n[26] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, et al.\n\nIntriguing properties of neural networks.\n\n[27] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-encoders.\n\narXiv preprint arXiv:1312.6199, 2013.\n\narXiv preprint arXiv:1506.02351, 2015.\n\n[28] Mart\u00b4\u0131n Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-scale machine learning on het-\n\nerogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n9\n\n\f", "award": [], "sourceid": 1146, "authors": [{"given_name": "Tim", "family_name": "Salimans", "institution": "Algoritmica"}, {"given_name": "Ian", "family_name": "Goodfellow", "institution": "OpenAI"}, {"given_name": "Wojciech", "family_name": "Zaremba", "institution": "OpenAI"}, {"given_name": "Vicki", "family_name": "Cheung", "institution": "OpenAI"}, {"given_name": "Alec", "family_name": "Radford", "institution": "OpenAI"}, {"given_name": "Xi", "family_name": "Chen", "institution": "UC Berkeley and OpenAI"}, {"given_name": "Xi", "family_name": "Chen", "institution": "UC Berkeley and OpenAI"}]}