{"title": "Mining GOLD Samples for Conditional GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 6170, "page_last": 6181, "abstract": "Conditional generative adversarial networks (cGANs) have gained a considerable attention in recent years due to its class-wise controllability and superior quality for complex generation tasks. We introduce a simple yet effective approach to improving cGANs by measuring the discrepancy between the data distribution and the model distribution on given samples. The proposed measure, coined the gap of log-densities (GOLD), provides an effective self-diagnosis for cGANs while being efficiently, computed from the discriminator. We propose three applications of the GOLD: example re-weighting, rejection sampling, and active learning, which improve the training, inference, and data selection of cGANs, respectively. Our experimental results demonstrate that the proposed methods outperform corresponding baselines for all three applications on different image datasets.", "full_text": "Mining GOLD Samples for Conditional GANs\n\nSangwoo Mo\u2217\n\nKAIST\n\nChiheon Kim\nKakao Brain\n\nSungwoong Kim\n\nKakao Brain\n\nswmo@kaist.ac.kr\n\nchiheon.kim@kakaobrain.com\n\nswkim@kakaobrain.com\n\nMinsu Cho\nPOSTECH\n\nJinwoo Shin\nKAIST, AItrics\n\nmscho@postech.ac.kr\n\njinwoos@kaist.ac.kr\n\nAbstract\n\nConditional generative adversarial networks (cGANs) have gained a considerable\nattention in recent years due to its class-wise controllability and superior quality\nfor complex generation tasks. We introduce a simple yet effective approach to\nimproving cGANs by measuring the discrepancy between the data distribution and\nthe model distribution on given samples. The proposed measure, coined the gap\nof log-densities (GOLD), provides an effective self-diagnosis for cGANs while\nbeing ef\ufb01ciently computed from the discriminator. We propose three applications\nof the GOLD: example re-weighting, rejection sampling, and active learning, which\nimprove the training, inference, and data selection of cGANs, respectively. Our\nexperimental results demonstrate that the proposed methods outperform corre-\nsponding baselines for all three applications on different image datasets.\n\n1\n\nIntroduction\n\nThe generative adversarial network (GAN) [15] is arguably the most successful generative model\nin recent years, which have shown a remarkable progress across a broad range of applications, e.g.,\nimage synthesis [5, 21, 40], data augmentation [49, 18] and style transfer [58, 10, 34]. In particular,\nas its advanced variant, the conditional GANs (cGANs) [31] have gained a considerable attention\ndue to its class-wise controllability [9, 42, 10] and superior quality for complex generation tasks\n[39, 33, 5]. Training GANs (including cGANs), however, are known to be often hard and highly\nunstable [46]. Numerous techniques have thus been proposed to tackle the issue from different\nangles, e.g., improving architectures [32, 56, 7], losses and regularizers [16, 38, 20] and other\ntraining heuristics [46, 51, 8]. One promising direction for improving GANs would be to make\nGANs diagnose their own training and prescribe proper remedies. This is related to another branch\nof research on evaluating the performance of GANs, i.e., measuring the discrepancy of the data\ndistribution and the model distribution. One may utilize the measure to quantify better models [29] or\ndirectly use it as an objective function to optimize [37, 1]. However, measuring the discrepancy of\nGANs (and cGANs) is another challenging problem, since the data distribution remains unknown and\nthe distribution GANs learn is implicit [35]. Common approaches to the discrepancy measurement of\nGANs include estimating the variational bounds of statistical distances [37, 1] and using an external\npre-trained network as a surrogate evaluator [46, 17, 45]. Most previous methods on this line focus\non classic unconditional GANs (i.e., data-only densities), whereas discrepancy measures specialized\nfor cGANs (i.e., data-attribute joint densities) have rarely been explored.\n\nContribution.\nIn this paper, we propose a novel discrepancy measure for cGANs, that estimates\nthe gap of log-densities (GOLD) of data and model distributions on given samples, thus being called\n\n\u2217This work was done as an intern at Kakao Brain.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe GOLD estimator. We show that it decomposes into two terms, marginal and conditional ones, that\ncan be ef\ufb01ciently computed by two branches of the discriminator of cGAN. The two terms represent\ngeneration quality and class accuracy of generated samples, respectively, and the overall estimator\nmeasures the quality of conditional generation. We also propose a simple heuristic to balance the two\nterms, considering suboptimality levels of the two branches.\n\nWe present three applications of the GOLD estimator: example re-weighting, rejection sampling, and\nactive learning, which improve the training, inference, and data selection of cGANs, respectively.\nAll proposed methods require only a few lines of modi\ufb01cation of the original code. We conduct our\nexperiments on various datasets including MNIST [25], SVHN [36] and CIFAR-10 [23], and show\nthat the GOLD-based schemes improve over the corresponding baselines for all three applications.\nFor example, the GOLD-based re-weighting and rejection sampling schemes improve the \ufb01tting\ncapacity [41] of cGAN trained under SVHN from 74.43 to 76.71 (+3.06%) and 73.58 to 75.06\n(+2.01%), respectively. The GOLD-based active learning strategy improves the \ufb01tting capacity of\ncGAN trained under MNIST from 92.65 to 94.60 (+2.10%).\n\nOrganization.\nIn Section 2, we brie\ufb02y revisit cGAN models. In Section 3, we propose our main\nmethod, the gap of log-densities (GOLD) and its applications. In Section 4, we present the experi-\nmental results. Finally, in Section 5, we discuss more related work and conclude this paper.\n\n2 Preliminary: Conditional GANs\n\nThe goal of cGANs is to learn the model distribution pg(x, c) to match with the attribute-augmented\ndata distribution pdata(x, c). To this end, a variety of architectures have been proposed to incorporate\nadditional attributes [31, 46, 39, 57, 33]. The generator G : (z, c) 7\u2192 x maps a pair of a latent z and an\nattribute c to generate a sample x whereas the discriminator D guides the generator to learn the joint\ndistribution p(x, c). Typically, there are two ways to use the attribute information: (a) providing it as\nan additional input to the discriminator (i.e., D : (x, c) 7\u2192 {real/generated}) [31, 33], or (b) using\nit to train an auxiliary classi\ufb01er for the attribute (i.e., D : x 7\u2192 ({real/generated}, c)) [46, 39, 57].\nThe main difference between the two approaches can be viewed as whether to directly learn the joint\ndistribution p(x, c) or to separately learn the marginal p(x) and the conditional p(c|x).2\n\nIn this paper, we address training cGANs in a semi-supervised setting where a large amount of\nunlabeled data are available with only a small amount of labeled data. It is more attractive and\npractical than a fully-supervised setting in the sense that labeling attributes of all samples is often\nexpensive while unlabeled data can be easily obtained. It is thus natural to utilize unlabeled data for\nimproving the model, e.g., via semi-supervised learning and active learning (see Section 3.2). While\nboth of the two approaches above, (a) and (b), can be used in a semi-supervised setting, cGANs of\n(b) provide a more natural framework for using both labeled and unlabeled data;3 one can use the\nunlabeled data to learn p(x), and the labeled data to learn both p(x) and p(c|x). Therefore, we focus\non evaluating the second type of architectures, e.g., the auxiliary classi\ufb01er GAN (ACGAN) [39]. We\nremark that our main idea in this paper is applicable to both types of cGANs in general.\n\nThe ACGAN model consists of the generator G : (z, c) 7\u2192 x and the discriminator D : x 7\u2192\n({real/generated}, c) consisting of the real/generated part DG : x 7\u2192 {real/generated} and the\nauxiliary classi\ufb01er part DC : x 7\u2192 c. Then, ACGAN is trained by optimizing both the GAN loss\nLGAN and the auxiliary classi\ufb01er loss LAC:\n\nLGAN = E(x,c)\u223cpdata(x,c)[\u2212 log DG(x)] + E(z,c)\u223cpg(z,c)[log DG(G(z, c))],\n\nLAC = E(x,c)\u223cpdata(x,c)[\u2212 log DC(c|x)] + \u03bbcE(z,c)\u223cpg(z,c)[\u2212 log DC(c|G(z, c))],\n\n(1)\n\nwhere \u03bbc \u2265 0 is a hyper-parameter. Here, the generator and the discriminator minimize \u2212LGAN +LAC\nand LGAN +LAC, respectively.4 The original work [39] simply sets \u03bbc = 1, but we empirically observe\nthat using a smaller value often improves the performance: it strengthens the wrong signal of the\ngenerator when the generator produces bad samples with incorrect attributes. Such an issue has also\nbeen reported in related work of AMGAN[57] where the authors thus use \u03bbc = 0. On the other hand,\n\n2 Projection discriminator [33] is of type (a), but it decomposes the marginal and conditional terms in their\n\narchitecture. It results another estimator form of the gap of log-densities.\n\n3 (a) requires some modi\ufb01cations in the architecture and/or the loss function [52, 30].\n4 In experiments, we use the non-saturating GAN loss [15] to improve the stability in training.\n\n2\n\n\funder a small amount of labeled data, a strictly positive value \u03bbc > 0 can be effective as it provides\nan effect of data augmentation to train the classi\ufb01er DC . In our experiments, we indeed observe that\nusing a proper value (e.g., \u03bbc = 0.1) improves the performance of ACGAN depending on datasets.\n\n3 Gap of Log-Densities (GOLD)\n\nIn this section, we introduce a general formula of the gap of log-densities (GOLD) that measures\nthe discrepancy between the data distribution and the model distribution on given samples. We then\npropose three applications: example re-weighting, rejection sampling, and active learning.\n\n3.1 GOLD estimator: Measuring the discrepancy of cGANs\n\nWhile cGANs can converge to the true joint distribution theoretically [15, 37], they are often far from\nbeing optimal in practice, particularly when trained with limited labels. The degree of suboptimality\ncan be measured by the discrepancy between the true distribution pdata(x, c) and the model distribution\npg(x, c). Here, we consider the gap of log-densities (GOLD)5, log pdata(x, c) \u2212 log pg(x, c), which\ncan be rewritten as the sum of two log-ratio terms, marginal and conditional ones:\n\nlog pdata(x, c) \u2212 log pg(x, c) = log\n\npdata(x)\npg(x)\n\n+ log\n\npdata(c|x)\npg(c|x)\n\n.\n\n(2)\n\n|\n\n{z\n\n}\n\n|\n\nmarginal\n\nconditional\n\n{z\n\n}\n\nRecall that cGANs are designed to achieve two goals jointly: generating a sample drawn from p(x)\nand the distribution of its class is p(c|x). The marginal and conditional terms measure the discrepancy\non those two effects, respectively.\n\npg(x) is approximated by log DG(x)\n\nThe exact computation of (2) is infeasible because we have no direct access to the true distribution\nand the implicit model distribution. Hence, we propose the GOLD estimator as follows. First,\nthe marginal term log pdata(x)\n1\u2212DG(x) since the optimal discriminator D\u2217\nG\npdata(x)\nsatis\ufb01es D\u2217\npg(c|x) using the\nclassi\ufb01er DC as follows. When a generated sample x is given with its ground-truth label cx, pg(cx|x)\nis assumed be 1 and pdata(cx|x) is approximated by DC(cx|x). When a real sample x is given with\nthe ground-truth label cx, pdata(cx|x) is assumed to be 1 and pg(cx|x) is approximated by DC(cx|x).\nTo sum up, the GOLD estimator can be de\ufb01ned as\n\npdata(x)+pg(x) [15]. Second, we estimate the conditional term log pdata(c|x)\n\nG(x) =\n\nd(x, cx) :=(log DG(x)\n\nlog DG(x)\n\n1\u2212DG(x) + log DC(cx|x)\n1\u2212DG(x) \u2212 log DC(cx|x)\n\nif x is a generated sample of class cx\n\nif x is a real sample of class cx\n\n.\n\n(3)\n\nNote that the conditional terms above for generated and real samples have opposite signs each other.\nThis matches the signs of marginal and conditional terms for both generated and real samples as their\nmarginal terms log DG(x)\n1\u2212DG(x) tend to be negative and positive, respectively.6 Hence, (3) is reasonable\nto measure the joint quality of two effects of conditional generation.\nFor the derivation of (3), we assume the ideal (or optimal) discriminator D\u2217 = (D\u2217\nC), which\ndoes not hold in practice. We often observe that the scale of marginal term is signi\ufb01cantly larger than\nthe conditional term because the density p(x) is harder to learn than the class-predictive distribution\np(c|x) (see Figure 1a). This leads the GOLD estimator to be biased toward the generation part\n(marginal term), ignoring the class-condition part (conditional term). To address the imbalance issue,\nwe develop a balanced variant of the GOLD estimator:\n\nG, D\u2217\n\ndbal(x, cx) :=(log DG(x)\n\n1\u2212DG(x) + \u03c3G\n1\u2212DG(x) \u2212 \u03c3G\n\nlog DG(x)\n\n\u03c3C\n\n\u03c3C\n\nlog DC(cx|x)\nlog DC(cx|x)\n\nif x is a generated sample of class cx\n\nif x is a real sample of class cx\n\n,\n\n(4)\n\nwhere \u03c3G and \u03c3C are the standard deviations of marginal and conditional terms (among samples),\nrespectively.\n\n5 We measure the gap of log-densities, since it leads to a computationally ef\ufb01cient estimator.\n6 The discriminator DG is trained to predict 0 and 1 for generated and real samples, respectively.\n\n3\n\n\f3.2 Applications of the GOLD estimator\n\nExample re-weighting. A high value of the GOLD estimator suggests that the sample (x, cx) is\nunder-estimated with respect to the joint distribution p(x, c), and vice versa. Motivated by this, we\npropose an example re-weighting scheme for cGAN training, that guides the generator to focus on\nunder-estimated samples during training. Formally, we consider the following re-weighted loss;\n\nL\u2032\nGAN = E(x,c)\u223cpdata(x,c)[\u2212 log DG(x)] + E(z,c)\u223cpg(z,c)[d(G(z, c), c)\u03b2 \u00b7 log DG(G(z, c))],\nL\u2032\nAC = E(x,c)\u223cpdata(x,c)[\u2212 log DC(c|x)] + \u03bbcE(z,c)\u223cpg(z,c)[\u2212d(G(z, c), c)\u03b2 \u00b7 log DC(c|G(z, c))],\n(5)\nwhere \u03b2 \u2265 0 is a hyper-parameter to control the level of re-weighting and we use x\u03b2 = \u2212|x|\u03b2 for\nx < 0. Our intuition is that minimizing L\u2032\nAC encourages the discriminator D to learn stronger\nfeedbacks from the under-estimated (generated) samples, thus indirectly guiding the generator G to\nemphasize their region. When the GOLD estimator d(x, cx) is negative, D is trained to suppress the\nover-estimated samples, which indirectly regularizes G to less focus on the corresponding region.\n\nGAN + L\u2032\n\nSince the GOLD estimator only becomes meaningful with suf\ufb01ciently trained discriminators, we\napply the re-weighting scheme with the loss of (5) after suf\ufb01ciently training the model with the\noriginal loss of (1). We \ufb01nd that the GOLD estimator of generated samples stably converges to zero\nwith the re-weighting scheme, while those only with the original loss do not converge (see Figure\n1b). Note that one may also use the balanced version of the GOLD estimator dbal in (5). In our\nexperiments, however, we simply use d because dbal requires computing the standard deviations\n\u03c3G and \u03c3C along training, which signi\ufb01cantly increases the computational burden. Improving the\nscheduling and/or re-weighting for training would be an interesting future direction.\n\nRejection sampling. Rejection sampling [44] is a useful technique to improve the inference of\ngenerative models, i.e., the quality of generated samples. Instead of directly sampling from p(x, c),\nwe \ufb01rst obtain a sample from a (reasonably good) proposal distribution q(x, c), and then accept it with\nprobability p(x,c)\nM q(x,c) for some constant M > 0 while rejecting otherwise. Given a proper estimator for\nthe discrepancy, this can improve the quality of generated samples by rejecting unrealistic ones. For a\ngiven generated sample x = G(z, cx) with the corresponding class cx, the GOLD rejection sampling\nis de\ufb01ned as using the following acceptance rate:\n\nr(x) :=\n\n1\nM\n\nexp (dbal(x, cx)) =\n\n1\nM\n\nexp(cid:18)log\n\nDG(x)\n\n1 \u2212 DG(x)\n\n+\n\n\u03c3G\n\u03c3C\n\nlog DC(cx|x)(cid:19) ,\n\n(6)\n\nwhere M is set to be the maximum of exp(dbal(x, cx)) among samples. This helps in recovering the\ntrue data distribution pdata(x, c), although the model distribution pg(x, c) is suboptimal.7\n\nWhile the recent work [2] studies a rejection sampling for unconditional GANs, we focus on improving\ncGANs and our formula (6) of the acceptance rate is different. We also remark that in order to\navoid extremely low acceptance rates, following the strategy in [2], we \ufb01rst pullback the ratio with\nf \u22121(r(x)) (f is the sigmoid function), subtract a constant \u03b3, and pushforward to f (f \u22121(r(x)) \u2212 \u03b3).\nAs in [2], we set the constant \u03b3 to be a p-th percentile of the batch, where p is tuned for datasets.\nNote that \u03b3 controls the precision-recall trade-off [45] of samples, as the low acceptance rate (high \u03b3)\nimproves the quality and the high acceptance rate (low \u03b3) improves the diversity.\n\nActive learning. The goal of active learning [48] is to reduce the cost of labeling by predicting the\nbest real samples (i.e., queries) to label to improve the current model. In training cGANs with active\nlearning, it is natural to \ufb01nd and label samples with high GOLD values since they can be viewed as\nunder-estimated ones with respect to the current model. For unlabeled samples, however, we do not\nhave access to ground-truth class cx and thus d(x, cx) (or dbal(x, cx)). To tackle this issue, we take\nan expectation of cx over the class probability using DC and estimate the conditional term as\n\n\u2212 log DC(cx|x) \u2248 E\n\nc\u223cDC (c|x)[\u2212 log DC(c|x))] = H[DC(c|x)],\n\n(7)\n\n7 One may use advanced sampling strategy, e.g., Metropolis-Hastings GAN (MH-GAN) [53]. As MH-GAN\n\nrequires the density ratio information pdata/pg to run, one can naturally apply the GOLD estimator.\n\n4\n\n\f(a) Marginal/conditional terms\n\n(b) GOLD estimator\n\n(c) Fitting capacity\n\nFigure 1: (a) Histogram of the marginal/conditional terms of the GOLD estimator. Training curve\nof the mean of the GOLD estimator (of generated samples) (b) and the \ufb01tting capacity (c), for the\nbaseline model and that trained by the re-weighting scheme (GOLD) under MNIST dataset.\n\nwhere H is the entropy function. Using the approximation above, the GOLD estimator for the\nunlabeled real samples can be de\ufb01ned as:\n\ndunlabel(x) := log\n\ndunlabel\u2212bal(x) := log\n\nDG(x)\n\n1 \u2212 DG(x)\n\n+ H[DC(c|x)],\n\nDG(x)\n\n1 \u2212 DG(x)\n\n+\n\n\u03c3G\n\u03c3C\n\n\u00b7 H[DC(c|x)],\n\n(8)\n\n(9)\n\nwhere \u03c3G and \u03c3C are the standard deviations of marginal and conditional (i.e., entropy) terms.\nAs in the conventional active learning for classi\ufb01ers, one can view the \ufb01rst term log DG(x)\n1\u2212DG(x) in (8) as\na density (or representativeness) score [14, 50], which measures how well the sample x represents the\ndata distribution. The second term H[DC(c|x)] is an uncertainty (or informativeness) score [13, 3],\nwhich measures how informative the label c is for the current model. Hence, our method can be\ninterpreted as a combination of the density and uncertainty scores [19] in a principled, yet scalable\nway. We \ufb01nally remark that we also utilize all unlabeled samples in the pool to train our model, i.e.,\nsemi-supervised learning, which can be naturally done in the cGAN framework of our interest.\n\n4 Experiments\n\nIn this section, we demonstrate the effectiveness of the GOLD estimator for three applications:\nexample re-weighting, rejection sampling, and active learning. We conduct experiments on one\nsynthetic point dataset and six image datasets: MNIST [25], FMNIST [54], SVHN [36], CIFAR-10\n[23], STL-10 [11], and LSUN [55]. The synthetic dataset consists of random samples drawn from\na Gaussian mixture with 6 clusters, where we assign the clusters binary labels to obtain 2 groups\nof 3 clusters (see Figure 3). As the choice of cGAN models to evaluate, we use the InfoGAN [9]\nmodel for 1-channel images (MNIST and FMNIST), the ACGAN [39] model for 3-channel images\n(SVHN, CIFAR-10, STL-10, and LSUN), and the GAN model of [16] with an auxiliary classi\ufb01er\nfor the synthetic dataset. For all experiments, the spectral normalization (SN) [32] is used for more\nstable training. We set the balancing factor to \u03bbc = 0.1 in most of our experiments but lower the\nvalue when training cGANs on small datasets.8 For all experiments on example re-weighting and\nrejection sampling, we choose the default value \u03bbc = 0.1. For experiments on active learning, we\nchoose \u03bbc = 0.01 and \u03bbc = 0 for synthetic/MNIST and FMNIST/SVHN datasets, respectively. The\nreported results are averaged over 5 trials for image datasets and 25 trials for the synthetic dataset.\n\nAs the evaluation metric for data generation, we choose to use the \ufb01tting capacity recently proposed\nin [41, 27]. It measures the accuracy of the real samples under a classi\ufb01er trained with generated\nsamples of cGAN, where we use LeNet [25] as the classi\ufb01er.9 Intuitively, \ufb01tting capacity should\nmatch to the \u2018true\u2019 classi\ufb01er accuracy (trained with real samples) if the model distribution perfectly\nmatches to the real distribution. It is a natural evaluation metric for cGANs, as it directly measures\nthe performance of conditional generation. Here, one may also suggest other popular metrics, e.g.,\nInception score (IS) [46] or Fr\u00e9chet Inception distance (FID) [17], but the work of [41] have recently\nshown that when IS/FID of generated samples match to those of real ones, the \ufb01tting capacity is\n\n8 This is because the generator is more likely to produce bad samples with incorrect attributes for small\n\ndatasets, which strengthens the wrong signal.\n\n9 We use training data to train ACGAN and test data to evaluate the \ufb01tting capacity, except LSUN that we\n\nuse validation data for both training and evaluation due to the class imbalance of the training data.\n\n5\n\n\fTable 1: Fitting capacity (%) [41] for example re-weighting under various datasets.\n\nMNIST\n\nFMNIST\n\nSVHN\n\nCIFAR-10\n\nSTL-10\n\nLSUN\n\nBaseline\nGOLD\n\n96.43\u00b10.17\n96.62\u00b10.15\n\n77.97\u00b11.24\n78.34\u00b11.11\n\n74.43\u00b10.71\n76.71\u00b10.94\n\n36.76\u00b10.99\n37.06\u00b11.38\n\n36.73\u00b10.64\n37.65\u00b10.71\n\n26.35\u00b10.82\n28.21\u00b10.86\n\nTable 2: Fitting capacity (%) for example re-weighting under various levels of supervision.\n\nDataset\n\n1%\n\n5%\n\n10%\n\n20%\n\n50%\n\n100%\n\nBaseline\nGOLD\n\nBaseline\nGOLD\n\nSVHN\n\n72.41\u00b11.30\n75.01\u00b11.93\n\n72.99\u00b11.65\n75.58\u00b10.86\n\n73.15\u00b10.96\n75.78\u00b10.74\n\n73.18\u00b11.28\n76.04\u00b11.93\n\n74.04\u00b11.26\n76.25\u00b11.40\n\n74.33\u00b10.71\n76.71\u00b10.94\n\nCIFAR-10\n\n17.99\u00b10.78\n18.28\u00b10.65\n\n18.42\u00b10.71\n19.15\u00b10.97\n\n21.84\u00b11.14\n21.91\u00b12.56\n\n23.13\u00b11.95\n23.89\u00b12.02\n\n35.41\u00b11.03\n34.95\u00b11.11\n\n36.76\u00b10.99\n37.06\u00b11.38\n\noften much lower than the real classi\ufb01er accuracy (i.e., low correlation between IS/FID and \ufb01tting\ncapacity). Furthermore, IS/FID are not suitable for non-ImageNet-like images, e.g., MNIST or SVHN.\nNevertheless, we provide some FID results in Supplementary Material for the interest of readers.\n\n4.1 Example re-weighting\n\nWe \ufb01rst evaluate the effect of the re-weighting scheme using the loss (5). We train the model for 20\nand 200 epochs for 1-channel and 3-channel images, respectively. We use the baseline loss (1) for\nthe \ufb01rst half of epochs and the re-weighting scheme for the next half of epochs. We simply choose\n\u03b2 = 1 for the discriminator loss and \u03b2 = 0 for the generator loss. This is because a large \u03b2 for the\ngenerator loss unstabilizes training by incurring high variance of gradients.10 We train the LeNet\nclassi\ufb01er (for \ufb01tting capacity) for 40 epochs, using 10,000 newly generated samples for each epoch.\nFigure 1b and Figure 1c report the training curves of the GOLD estimator (of generated samples) and\nthe \ufb01tting capacity respectively, under MNIST dataset. Figure 1b shows that the GOLD estimator\nunder the re-weighting scheme stably converges to zero, while that of baseline model monotonically\ndecreases. As a result, in Figure 1c, one can observe that the re-weighting scheme improves the \ufb01tting\ncapacity, while that of the baseline model become worse as training proceeds. Table 1 and Table 2\nreport the \ufb01tting capacity for fully-supervised settings (i.e., use full labels of datasets to train cGANs)\nand semi-supervised settings (i.e., use only x% supervision of datasets to train cGANs), respectively.\nIn most reported cases, our method outperforms the baseline model. For example, ours improves the\n\ufb01tting capacity from 74.43 to 76.71 (+3.06%) under the full labels of SVHN.\n\n4.2 Rejection sampling\n\nNext, we demonstrate the effect of the rejection sampling. We use the model trained by the original\nloss (1) with fully labeled datasets.11 To emphasize the sampling effect, we use the \ufb01xed 50,000\nsamples instead of re-sampling for each epoch. We use p = 0.1 for 1-channel images, and p = 0.5\nfor 3-channel images. Table 3 presents the \ufb01tting capacity of the rejection sampling under various\ndatasets. Our method shows a consistent improvement over the baseline (random sampling without\nrejection), e.g., ours improves from 73.58 to 75.06 (+2.01%) for SVHN. We also study the effect\nof p, the control parameter of the acceptance ratio for the rejection sampling (high p rejects more\nsamples). As high p harms the diversity and low p harms the quality, we see the proper p (e.g., 0.5 for\nCIFAR-10) shows the best performance. Table 4 and Figure 5 in Supplementary Material present\nthe \ufb01tting capacity and the precision and recall on distributions (PRD) [45] plot, respectively, under\nCIFAR-10 and various p values. Indeed, both low (p = 0.1) and high (p = 0.9) values harm the\nperformance, and p = 0.5 is of the best choice among them.\n\nWe also qualitatively analyze the effect of the rejection sampling. The \ufb01rst row of Figure 2 visualizes\nthe generated samples with high marginal, conditional, and combined (GOLD) values. We observe that\nthe random samples (without rejection) often contain low-quality samples with uncertain and/or wrong\nclasses. On the other hand, samples with high marginal values improve the quality (or vividness),\nand samples with high conditional values improve the class accuracy (but loses the diversity). The\nsamples with high GOLD values get the best of the both worlds, and produce diverse images with\nonly a few wrong classes.\n\n10 We do not make much effort in choosing \u03b2 as the choice \u03b2 \u2208 {0, 1} is enough to show the improvement.\n11 One can also use the model trained by the re-weighting scheme of loss (5) for further improvement.\n\n6\n\n\fTable 3: Fitting capacity (%) for rejection sampling under various datasets.\n\nMNIST\n\nFMNIST\n\nSVHN\n\nCIFAR-10\n\nSTL-10\n\nLSUN\n\nBaseline\nGOLD\n\n96.05\u00b10.41\n96.17\u00b10.63\n\n77.94\u00b10.83\n78.25\u00b10.30\n\n73.58\u00b10.72\n75.06\u00b10.71\n\n35.15\u00b10.51\n35.98\u00b11.15\n\n34.33\u00b10.30\n35.21\u00b11.02\n\n26.43\u00b10.14\n26.79\u00b10.42\n\nTable 4: Fitting capacity (%) for rejection sampling under CIFAR-10 and various p values.\n\nBaseline\n\np = 0.1\n\np = 0.3\n\np = 0.5\n\np = 0.7\n\np = 0.9\n\n35.15\u00b10.51\n\n35.80\u00b10.42\n\n35.87\u00b10.61\n\n35.98\u00b11.15\n\n35.85\u00b10.53\n\n35.33\u00b10.53\n\nFigure 2: Generated and real samples with high marginal, conditional, and combined (GOLD) values.\nGenerated samples are aligned by the class (each row), and the red box indicates the uncertain and/or\nwrong classes. See Section 4.2 and Section 4.3 for the detailed explanation.\n\n4.3 Active learning\n\nFinally, we demonstrate the active learning results. We conduct our experiments on a synthetic dataset\nand 3 image datasets (MNIST, FMNIST, SVHN). We train on the semi-supervised setting, as we\nhave a large pool of unlabeled samples. We run 4 query acquisition steps (i.e., 5 training steps),\nwhere the triplet of initial (labeled) training set size, query size, and the \ufb01nal (labeled) training\nset size are set by (4,1,8), (10,2,18), (20,5,40), and (20,20,100) for synthetic, MNIST, FMNIST,\nand SVHN, respectively. We train the model for 100 epochs, and choose the model with the best\n\ufb01tting capacity on the validation set (of size 100), to compute the GOLD estimator for the query\nacquisition. Interestingly, we found that keeping the parameters of the generator (while re-initializing\nthe discriminator) for the next model in the active learning scenario improves the performance. This\nis because the discriminator is easily over\ufb01tted and hard to escape from the local optima, but the\ngenerator is relatively easy to spread out the generated samples. We use this re-initialization scheme\n(i.e., keep G and re-initialize D) for all active learning experiments. For query acquisition, we use the\nvanilla version of the GOLD estimator (8) for image datasets, but use the balanced version (9) for the\nsynthetic dataset, as the synthetic dataset suffers from the over-con\ufb01dence problem.\n\nFigure 3 visualizes the selected queries based on the GOLD estimator under the synthetic dataset. The\nGOLD estimator has high values on the uncovered or the uncertain (i.e., samples are not obtained)\nregions, in which high marginal and conditional values occur, respectively. See the leftmost region of\ncolumn 2 and the upmost region of column 3 for each case. Indeed, both components of the GOLD\nestimator contribute to the query selection. Consequently, the GOLD estimator effectively selects\nqueries and learn the true joint distribution. In contrast, the random selection often picks redundant\nor less important regions, which makes the convergence slower. Figure 4 presents the quantitative\nresults. Our method outperforms the random query selection, e.g., the \ufb01nal \ufb01tting capacity of our\nmethod on MNIST is 94.60, which improves 92.65 of the baseline by 2.10%.\n\n7\n\n\fFigure 3: Visualization of the query selection based on the GOLD estimator. The \ufb01rst and second row\nare selected queries and generated samples, respectively. The third row is the GOLD estimator values,\nthat the sample with the highest value is selected for the next iteration.\n\n(a) Synthetic\n\n(b) MNIST\n\n(c) FMNIST\n\n(d) SVHN\n\nFigure 4: Fitting capacity for active learning under various datasets.\n\nIn addition, we qualitatively analyze the effect of two (marginal and conditional) terms of the GOLD\nestimator. The second row of Figure 2 presents the real samples with high marginal, conditional, and\ncombined (GOLD) values. We observe that samples picked under high marginal values have multiple\ndigits (which are hard to generate) and those picked under high conditional values have uncertain\nclasses. On the other hand, the GOLD estimator picks the uncertain samples with multiple digits,\nwhich takes the advantage of both.\n\n5 Discussion and Conclusion\n\nWe have proposed a novel, yet simple GOLD estimator which measures the discrepancy of the\ndata distribution and the model distribution on given samples, which can be ef\ufb01ciently computed\nunder the conditional GAN (cGAN) framework. We also propose three applications of the GOLD\nestimator: example re-weighting, rejection sampling, and active learning, which improves the training,\ninference, and data selection of cGANs, respectively. We are the \ufb01rst one studying these problems\nof cGAN in the literature, while those of classi\ufb01cation models or the (original unconditional) GAN\nhave been investigated in the literature. First, re-weighting [43] or re-sampling [6, 22] examples\nare studied to improve the performance, convergence speed, and/or robustness of the convolutional\nneural networks (CNNs). From the line of the research, we show that the re-weighting scheme can\nalso improve the performance of cGANs. To this end, we use the higher weights for the samples\nwith the larger discrepancy, which resembles the prior work on the hard example mining [49, 28]\nfor classi\ufb01ers/detectors. Designing a better re-weighting scheme or a better scheduling technique\n[4, 24] would be an interesting future research direction. Second, active learning [48] has been also\nwell studied for the classi\ufb01cation models [13, 47]. Finally, there is a recent work which proposes the\nrejection sampling [44] for the original (unconditional) GANs [2]. In contrast to the prior work, we\nfocus on the conditional generation, i.e., consider both the generation quality and the class accuracy.\nWe \ufb01nally remark that investigating other applications of the GOLD estimator, e.g., outlier detection\n[26] or training under noisy labels [43], would also be an interesting future direction.\n\n8\n\n\fAcknowledgments\n\nThis research was supported by the Information Technology Research Center (ITRC) support program\n(IITP-2019-2016-0-00288), Next-Generation Information Computing Development Program (NRF-\n2017M3C4A7069369), and Institute of Information & communications Technology Planning &\nEvaluation (IITP) grant (No.2017-0-01779, A machine learning and statistical inference framework\nfor explainable arti\ufb01cial intelligence), funded by the Ministry of Science and ICT, Korea (MSIT). We\nalso appreciate GPU support from Brain Cloud team at Kakao Brain.\n\nReferences\n\n[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In\n\nInternational Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[2] S. Azadi, C. Olsson, T. Darrell, I. Goodfellow, and A. Odena. Discriminator rejection sampling.\n\narXiv preprint arXiv:1810.06758, 2018.\n\n[3] W. H. Beluch, T. Genewein, A. N\u00fcrnberger, and J. M. K\u00f6hler. The power of ensembles for\nactive learning in image classi\ufb01cation. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 9368\u20139377, 2018.\n\n[4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of\n\nthe 26th annual international conference on machine learning, pages 41\u201348. ACM, 2009.\n\n[5] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high \ufb01delity natural image\n\nsynthesis. arXiv preprint arXiv:1809.11096, 2018.\n\n[6] H.-S. Chang, E. Learned-Miller, and A. McCallum. Active bias: Training more accurate neural\nnetworks by emphasizing high variance samples. In Advances in Neural Information Processing\nSystems, pages 1002\u20131012, 2017.\n\n[7] T. Chen, M. Lucic, N. Houlsby, and S. Gelly. On self modulation for generative adversarial\n\nnetworks. arXiv preprint arXiv:1810.01365, 2018.\n\n[8] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby. Self-supervised generative adversarial\n\nnetworks. arXiv preprint arXiv:1811.11212, 2018.\n\n[9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In Advances in\nneural information processing systems, pages 2172\u20132180, 2016.\n\n[10] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Uni\ufb01ed generative\nadversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 8789\u20138797, 2018.\n\n[11] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature\nlearning. In Proceedings of the fourteenth international conference on arti\ufb01cial intelligence\nand statistics, pages 215\u2013223, 2011.\n\n[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In 2009 IEEE conference on computer vision and pattern recognition, pages\n248\u2013255. Ieee, 2009.\n\n[13] Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. In\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1183\u20131192. JMLR. org, 2017.\n\n[14] D. Gissin and S. Shalev-Shwartz. Discriminative active learning, 2019.\n\n[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n9\n\n\f[16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\nwasserstein gans. In Advances in Neural Information Processing Systems, pages 5767\u20135777,\n2017.\n\n[17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two\ntime-scale update rule converge to a local nash equilibrium. In Advances in Neural Information\nProcessing Systems, pages 6626\u20136637, 2017.\n\n[18] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher. Augmented cyclic adversarial learning for\n\ndomain adaptation. arXiv preprint arXiv:1807.00374, 2018.\n\n[19] S.-J. Huang, R. Jin, and Z.-H. Zhou. Active learning by querying informative and representative\n\nexamples. In Advances in neural information processing systems, pages 892\u2013900, 2010.\n\n[20] A. Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard\n\ngan. arXiv preprint arXiv:1807.00734, 2018.\n\n[21] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial\n\nnetworks. arXiv preprint arXiv:1812.04948, 2018.\n\n[22] A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with\n\nimportance sampling. arXiv preprint arXiv:1803.00942, 2018.\n\n[23] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, Citeseer, 2009.\n\n[24] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In\n\nAdvances in Neural Information Processing Systems, pages 1189\u20131197, 2010.\n\n[25] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[26] K. Lee, H. Lee, K. Lee, and J. Shin. Training con\ufb01dence-calibrated classi\ufb01ers for detecting\n\nout-of-distribution samples. arXiv preprint arXiv:1711.09325, 2017.\n\n[27] T. Lesort, J.-F. Goudou, and D. Filliat. Training discriminative models to evaluate generative\n\nones. arXiv preprint arXiv:1806.10840, 2018.\n\n[28] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\u00e1r. Focal loss for dense object detection. In\nProceedings of the IEEE international conference on computer vision, pages 2980\u20132988, 2017.\n\n[29] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are gans created equal? a\nlarge-scale study. In Advances in neural information processing systems, pages 700\u2013709, 2018.\n\n[30] M. Lucic, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and S. Gelly. High-\ufb01delity image\n\ngeneration with fewer labels. arXiv preprint arXiv:1903.02271, 2019.\n\n[31] M. Mirza and S. Osindero. Conditional generative adversarial nets.\n\narXiv preprint\n\narXiv:1411.1784, 2014.\n\n[32] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative\n\nadversarial networks. arXiv preprint arXiv:1802.05957, 2018.\n\n[33] T. Miyato and M. Koyama.\n\ncgans with projection discriminator.\n\narXiv preprint\n\narXiv:1802.05637, 2018.\n\n[34] S. Mo, M. Cho, and J. Shin. Instagan: Instance-aware image-to-image translation. In Interna-\n\ntional Conference on Learning Representations, 2019.\n\n[35] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv preprint\n\narXiv:1610.03483, 2016.\n\n[36] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural\n\nimages with unsupervised feature learning. 2011.\n\n10\n\n\f[37] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using\nvariational divergence minimization. In Advances in neural information processing systems,\npages 271\u2013279, 2016.\n\n[38] A. Odena, J. Buckman, C. Olsson, T. B. Brown, C. Olah, C. Raffel, and I. Goodfellow. Is\ngenerator conditioning causally related to gan performance? arXiv preprint arXiv:1802.08768,\n2018.\n\n[39] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classi\ufb01er gans.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n2642\u20132651. JMLR. org, 2017.\n\n[40] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image synthesis with spatially-adaptive\n\nnormalization. arXiv preprint arXiv:1903.07291, 2019.\n\n[41] S. Ravuri and O. Vinyals. Seeing is not necessarily believing: Limitations of biggans for data\n\naugmentation, 2019.\n\n[42] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to\n\ndraw. In Advances in Neural Information Processing Systems, pages 217\u2013225, 2016.\n\n[43] M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep\n\nlearning. arXiv preprint arXiv:1803.09050, 2018.\n\n[44] C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media,\n\n2013.\n\n[45] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via\nprecision and recall. In Advances in Neural Information Processing Systems, pages 5228\u20135237,\n2018.\n\n[46] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\ntechniques for training gans. In Advances in neural information processing systems, pages\n2234\u20132242, 2016.\n\n[47] O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set\n\napproach. arXiv preprint arXiv:1708.00489, 2017.\n\n[48] B. Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison\n\nDepartment of Computer Sciences, 2009.\n\n[49] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online\nhard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 761\u2013769, 2016.\n\n[50] S. Sinha, S. Ebrahimi, and T. Darrell. Variational adversarial active learning. arXiv preprint\n\narXiv:1904.00370, 2019.\n\n[51] C. K. S\u00f8nderby, J. Caballero, L. Theis, W. Shi, and F. Husz\u00e1r. Amortised map inference for\n\nimage super-resolution. arXiv preprint arXiv:1610.04490, 2016.\n\n[52] K. Sricharan, R. Bala, M. Shreve, H. Ding, K. Saketh, and J. Sun. Semi-supervised conditional\n\ngans. arXiv preprint arXiv:1708.05789, 2017.\n\n[53] R. Turner, J. Hung, Y. Saatci, and J. Yosinski. Metropolis-hastings generative adversarial\n\nnetworks. arXiv preprint arXiv:1811.11357, 2018.\n\n[54] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking\n\nmachine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[55] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun: Construction of a large-scale\nimage dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365,\n2015.\n\n11\n\n\f[56] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial\n\nnetworks. arXiv preprint arXiv:1805.08318, 2018.\n\n[57] Z. Zhou, H. Cai, S. Rong, Y. Song, K. Ren, W. Zhang, Y. Yu, and J. Wang. Activation\n\nmaximization generative adversarial nets. arXiv preprint arXiv:1703.02000, 2017.\n\n[58] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using\ncycle-consistent adversarial networks. In Proceedings of the IEEE international conference on\ncomputer vision, pages 2223\u20132232, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3320, "authors": [{"given_name": "Sangwoo", "family_name": "Mo", "institution": "KAIST"}, {"given_name": "Chiheon", "family_name": "Kim", "institution": "Kakao Brain"}, {"given_name": "Sungwoong", "family_name": "Kim", "institution": "Kakao Brain"}, {"given_name": "Minsu", "family_name": "Cho", "institution": "POSTECH"}, {"given_name": "Jinwoo", "family_name": "Shin", "institution": "KAIST; AITRICS"}]}