{"title": "BourGAN: Generative Networks with Metric Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 2269, "page_last": 2280, "abstract": "This paper addresses the mode collapse for generative adversarial networks (GANs). We view modes as a geometric structure of data distribution in a metric space. Under this geometric lens, we embed subsamples of the dataset from an arbitrary metric space into the L2 space, while preserving their pairwise distance distribution. Not only does this metric embedding determine the dimensionality of the latent space automatically, it also enables us to construct a mixture of Gaussians to draw latent space random vectors. We use the Gaussian mixture model in tandem with a simple augmentation of the objective function to train GANs. Every major step of our method is supported by theoretical analysis, and our experiments on real and synthetic data confirm that the generator is able to produce samples spreading over most of the modes while avoiding unwanted samples, outperforming several recent GAN variants on a number of metrics and offering new features.", "full_text": "BourGAN: Generative Networks with Metric\n\nEmbeddings\n\nChang Xiao\n\nChangxi Zheng\n\nPeilin Zhong\n\nColumbia University\n\n{chang, peilin, cxz}@cs.columbia.edu\n\nAbstract\n\nThis paper addresses the mode collapse for generative adversarial networks (GANs).\nWe view modes as a geometric structure of data distribution in a metric space.\nUnder this geometric lens, we embed subsamples of the dataset from an arbitrary\nmetric space into the `2 space, while preserving their pairwise distance distribution.\nNot only does this metric embedding determine the dimensionality of the latent\nspace automatically, it also enables us to construct a mixture of Gaussians to draw\nlatent space random vectors. We use the Gaussian mixture model in tandem with a\nsimple augmentation of the objective function to train GANs. Every major step of\nour method is supported by theoretical analysis, and our experiments on real and\nsynthetic data con\ufb01rm that the generator is able to produce samples spreading over\nmost of the modes while avoiding unwanted samples, outperforming several recent\nGAN variants on a number of metrics and offering new features.\n\n1\n\nIntroduction\n\nIn unsupervised learning, Generative Adversarial Networks (GANs) [1] is by far one of the most\nwidely used methods for training deep generative models. However, dif\ufb01culties of optimizing GANs\nhave also been well observed [2, 3, 4, 5, 6, 7, 8]. One of the most prominent issues is mode collapse,\na phenomenon in which a GAN, after learning from a data distribution of multiple modes, generates\nsamples landed only in a subset of the modes. In other words, the generated samples lack the diversity\nas shown in the real dataset, yielding a much lower entropy distribution.\nWe approach this challenge by questioning two fundamental properties of GANs. i) We question\nthe commonly used multivariate Gaussian that generates random vectors for the generator network.\nWe show that in the presence of separated modes, drawing random vectors from a single Gaussian\nmay lead to arbitrarily large gradients of the generator, and a better choice is by using a mixture of\nGaussians. ii) We consider the geometric interpretation of modes, and argue that the modes of a\ndata distribution should be viewed under a speci\ufb01c distance metric of data items \u2013 different metrics\nmay lead to different distributions of modes, and a proper metric can result in interpretable modes.\nFrom this vantage point, we address the problem of mode collapse in a general metric space. To\nour knowledge, despite the recent attempts of addressing mode collapse [3, 9, 10, 6, 11, 12], both\nproperties remain unexamined.\n\nTechnical contributions. We introduce BourGAN, an enhancement of GANs to avoid mode col-\nlapse in any metric space. In stark contrast to all existing mode collapse solutions, BourGAN draws\nrandom vectors from a Gaussian mixture in a low-dimensional latent space. The Gaussian mixture is\nconstructed to mirror the mode structure of the provided dataset under a given distance metric. We\nderive the construction algorithm from metric embedding theory, namely the Bourgain Theorem [13].\nNot only is using metric embeddings theoretically sound (as we will show), it also brings signi\ufb01cant\nadvantages in practice. Metric embeddings enable us to retain the mode structure in the `2 latent space\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fX\n\nZ\n\nx1\n\nx2\n\nz1 z2\n(a)\n\nX\n\nZ\n\n(b)\n\nX\n\nZ\n\n(c)\n\nFigure 1: Multi-mode challenge. We train a generator G that maps a latent-space distribution Z to\nthe data distribution X with two modes. (a) Suppose Z is a Gaussian, and G can \ufb01t both modes. If\nwe draw two i.i.d. samples z1, z2 from Z, then with at least a constant probability, G(z1) is close to\nthe center x1 of the \ufb01rst mode, and G(z2) is close to another center x2. By the Mean Value Theorem,\nthere exists a z between z1 and z2 that has the absolute gradient value, |G0(z)| = | x2x1\nz2z1 |, which can\nbe arbitrarily large, as |x2  x1| can be arbitrarily far. (b) Since G is Lipschitz continuous, using it to\nmap a Gaussian distribution to both modes unavoidably results in unwanted samples between the\nmodes (highlighted by the red dots). (c) Both challenges are resolved if we can construct a mixture\nof Gaussian in latent space that captures the same modal structure as in the data distribution.\n\ndespite the metric used to measure modes in the dataset. In turn, the Gaussian mixture sampling in the\nlatent space eases the optimization of GANs, and unlike existing GANs that assume a user-speci\ufb01ed\ndimensionality of the latent space, our method automatically decides the dimensionality of the latent\nspace from the provided dataset.\nTo exploit the constructed Gaussian mixture for addressing mode collapse, we propose a simple\nextension to the GAN objective that encourages the pairwise `2 distance of latent-space random\nvectors to match the distance of the generated data samples in the metric space. That is, the\ngeometric structure of the Gaussian mixture is respected in the generated samples. Through a series\nof (nontrivial) theoretical analyses, we show that if BourGAN is fully optimized, the logarithmic\npairwise distance distribution of its generated samples closely match the logarithmic pairwise distance\ndistribution of the real data items. In practice, this implies that mode collapse is averted.\nWe demonstrate the ef\ufb01cacy of our method on both synthetic and real datasets. We show that our\nmethod outperforms several recent GAN variants in terms of generated data diversity. In particular,\nour method is robust to handle data distributions with multiple separated modes \u2013 challenging\nsituations where all existing GANs that we have experimented with produce unwanted samples (ones\nthat are not in any modes), whereas our method is able to generate samples spreading over all modes\nwhile avoiding unwanted samples.\n\n2 Related Work\n\nGANs and variants. The main goal of generative models in unsupervised learning is to produce\nsamples that follow an unknown distribution X , by learning from a set of unlabelled data items\ni=1 drawn from X . In recent years, Generative Adversarial Networks (GANs) [1] have attracted\n{xi}n\ntremendous attention for training generative models. A GAN uses a neural network, called generator\nG, to map a low-dimensional latent-space vector z 2 Rd, drawn from a standard distribution Z (e.g.,\na Gaussian or uniform distribution), to generate data items in a space of interest such as natural images\nand text. The generator G is trained in tandem with another neural network, called the discriminator\nD, by solving a minmax optimization with the following objective.\n\nLgan(G, D) = Ex\u21e0X [log D(x)] + Ez\u21e0Z [log(1  D(G(z)))] .\n\n(1)\nThis objective is minimized over G and maximized over D. Initially, GANs are demonstrated to\ngenerate locally appreciable but globally incoherent images. Since then, they have been actively\nimproving. For example, DCGAN [8] proposes a class of empirically designed network architectures\nthat improve the naturalness of generated images. By extending the objective (1), InfoGAN [14] is\nable to learn interpretable representations in latent space, Conditional GAN [15] can produce more\nrealistic results by using additional supervised label. Several later variants have applied GANs to\na wide array of tasks [16, 17] such as image-style transfer [18, 19], super-resolution [20], image\nmanipulation [21], video synthesis [22], and 3D-shape synthesis [23], to name a few.\n\nAddressing dif\ufb01culties. Despite tremendous success, GANs are generally hard to train. Prior\nresearch has aimed to improve the stability of training GANs, mostly by altering its objective\n\n2\n\n\ffunction [24, 4, 25, 26, 27, 28]. In a different vein, Salimans et al. [3] proposed a feature-matching\ntechnique to stabilize the training process, and another line of work [5, 6, 29] uses an additional\nnetwork that maps generated samples back to latent vectors to provide feedback to the generator.\nA notable problem of GANs is mode collapse, which is the focus of this work. For instance, when\ntrained on ten hand-written digits (using MNIST dataset) [30], each digit represents a mode of data\ndistribution, but the generator often fails to produce a full set of the digits [25]. Several approaches\nhave been proposed to mitigate mode collapse, by modifying either the objective function [4, 12]\nor the network architectures [9, 5, 11, 10, 31]. While these methods are evaluated empirically,\ntheoretical understanding of why and to what extent these methods work is often lacking. More\nrecently, PacGAN [11] introduces a mathematical de\ufb01nition of mode collapse, which they used to\nformally analyze their GAN variant. Very few previous works consider the construction of latent\nspace: VAE-GAN [29] constructs the latent space using variational autoencoder, and GLO [32] tries\nto optimize both the generator network and latent-space representation using data samples. Yet, all\nthese methods still draw the latent random vectors from a multivariate Gaussian.\n\nDifferences from prior methods. Our approach differs from prior methods in several important\ntechnical aspects. Instead of using a standard Gaussian to sample latent space, we propose to use\na Gaussian mixture model constructed using metric embeddings (e.g., see [33, 34, 35] for metric\nembeddings in both theoretical and machine learning fronts). Unlike all previous methods that require\nthe latent-space dimensionality to be speci\ufb01ed a priori, our algorithm automatically determines\nits dimensionality from the real dataset. Moreover, our method is able to incorporate any distance\nmetric, allowing the \ufb02exibility of using proper metrics for learning interpretable modes. In addition\nto empirical validation, the steps of our method are grounded by theoretical analysis.\n\n3 Bourgain Generative Networks\n\nWe now introduce the algorithmic details of BourGAN, starting by describing the rationale behind the\nproposed method. The theoretical understanding of our method will be presented in the next section.\n\nRationale and overview. We view modes in a dataset as a geometric structure embodied under a\nspeci\ufb01c distance metric. For example, in the widely tested MNIST dataset, only two modes emerge\nunder the pixel-wise `2 distance (Figure 2-left): images for the digit \u201c1\u201d are clustered in one mode,\nwhile all other digits are landed in another mode. In contrast, under the classi\ufb01er distance metric\n(de\ufb01ned in Appendix F.3), it appears that there exist 10 modes each corresponding to a different digit.\nConsequently, the modes are interpretable (Figure 2-right). In this work, we aim to incorporate any\ndistance metric when addressing mode collapse, leaving the \ufb02exibility of choosing a speci\ufb01c metric\nto the user.\nWhen there are multiple separated modes in a data distribution, mapping a Gaussian random variable\nin latent space to the data distribution is fundamentally ill-posed. For example, as illustrated in\nFigure 1-a and 1-b, this mapping imposes arbitrarily large gradients (at some latent space locations)\nin the generator network, and large gradients render the generator unstable to train, as pointed out\nby [37].\nA natural choice is to use a mixture of Gaussians. As long as the Gaussian mixture is able to mirror\nthe mode structure of the given dataset, the problem of mapping it to the data distribution becomes\nwell-posed (Figure 1-c). To this end, our main idea is to use metric embeddings, one that map data\nitems under any metric to a low-dimensional `2 space with bounded pairwise distance distortion\n(Section 3.3). After the embedding, we construct a Gaussian mixture in the `2 space, regardless of\nthe distance metric for the data items. In this process, the dimensionality of the latent space is also\nautomatically decided.\nOur embedding algorithm, building upon the Bourgain Theorem, requires us to compute the pairwise\ndistances of data items, resulting in an O(n2) complexity, where n is the number of data items. When\nn is large, we \ufb01rst uniformly subsample m data items from the dataset to reduce the computational\ncost of our metric embedding algorithm (Section 3.2). The subsampling step is theoretically sound:\nwe prove that when m is suf\ufb01ciently large yet still much smaller than n, the geometric structure (i.e.,\nthe pairwise distance distribution) of data items is preserved in the subsamples.\n\n3\n\n\f012345678904812160.10.20.00.00.0060.0120.0010020004800.40.81.21.6Figure2:(Top)PairwisedistancedistributiononMNISTdatasetunderdifferentdistancemetrics.Left:`2distance,Middle:EarthMover\u2019sdistance(EMD)withaquadraticgroundmetric,Right:classi\ufb01erdistance(de\ufb01nedinAppendixF.3).Under`2andEMDdistances,fewseparatedmodesemerges,andthepairwisedistancedistributionsresembleaGaussian.Undertheclassi\ufb01erdistance,thepairwisedistancedistributionbecomesbimodal,indicatingthatthereareseparatedmodes.(Bottom)t-SNEvisualization[36]ofdataitemsafterembeddedfromtheirmetricspaceto`2space.ColorindicateslabelsofMNISTimages(\u201c1\u201d-\u201c9\u201d).When`2distance(left)isused,onlytwomodesareidenti\ufb01ed:digit\u201c1\u201dandallothers,butclassi\ufb01erdistance(right)cangroupdataitemsinto10individualmodes.Lastly,whentrainingaBourGAN,weencouragethegeometricstructureembodiedinthelatent-spaceGaussianmixturetobepreservedbythegeneratornetwork.Thereby,themodestructureofthedatasetislearnedbythegenerator.ThisisrealizedbyaugmentingGAN\u2019sobjectivetofosterthepreservationofthepairwisedistancedistributioninthetrainingprocess(Section3.4).3.1MetricsofDistanceandDistributionsBeforedelvingintoourmethod,weintroduceafewtheoreticaltoolstoconcretizethegeometricstructureinadatadistribution,pavingthewaytowardunderstandingouralgorithmicdetailsandsubsequenttheoreticalanalysis.Intherestofthispaper,weborrowafewnotationalconventionsfromtheoreticalcomputerscience:weuse[n]todenotetheset{1,2,\u00b7\u00b7\u00b7,n},R0todenotethesetofallnon-negativerealnumbers,andlog(\u00b7)todenotelog2(\u00b7)forshort.Metricspace.Ametricspaceisdescribedbyapair(M,d),whereMisasetandd:M\u21e5M!R0isadistancefunctionsuchthat8x,y,z2M,wehavei)d(x,y)=0,x=y,ii)d(x,y)=d(y,x),andiii)d(x,z)\uf8ffd(x,y)+d(y,z).IfMisa\ufb01niteset,thenwecall(M,d)a\ufb01nitemetricspace.Wasserstein-1distance.Wasserstein-1distance,alsoknownastheEarth-Moverdistance,isoneofthedistancemeasurestoquantifythesimilarityoftwodistributions,de\ufb01nedasW(Pa,Pb)=inf2\u21e7(Pa,Pb)E(x,y)\u21e0(|xy|),wherePaandPbaretwodistributionsonrealnumbers,and\u21e7(Pa,Pb)isthesetofalljointdistributions(x,y)ontworealnumberswhosemarginaldistributionsarePaandPb,respectively.Wasserstein-1distancehasbeenusedtoaugmentGAN\u2019sobjectiveandimprovetrainingstability[4].Wewilluseittounderstandthetheoreticalguaranteesofourmethod.Logarithmicpairwisedistancedistribution(LPDD).Weproposetousethepairwisedistancedistributionofdataitemstore\ufb02ectthemodestructureinadataset(Figure2-top).Sincethepairwisedistanceismeasuredunderaspeci\ufb01cmetric,itsdistributionalsodependsonthemetricchoice.Indeed,ithasbeenusedin[9]toquantifyhowwellUnrolledGANaddressesmodecollapse.Concretely,givenametricspace(M,d),letXbeadistributionoverM,and(,\u21e4)betworealvaluessatisfying0<2\uf8ff\u21e4.Considertwosamplesx,yindependentlydrawnfromX,andlet\u2318bethelogarithmicdistancebetweenxandy(i.e.,\u2318=log(d(x,y))).Wecallthedistributionof\u2318conditionedond(x,y)2[,\u21e4]the(,\u21e4)logarithmicpairwisedistancedistribution(LPDD)ofthe4\fdistribution X . Throughout our theoretical analysis, LPDD of the distributions generated at various\nsteps of our method will be measured in Wasserstein-1 distance.\nRemark. We choose to use logarithmic distance in order to reasonably compare two pairwise distance\ndistributions. The rationale is illustrated in Figure 6 in the appendix. Using logarithmic distance is\nalso bene\ufb01cial for training our GANs, which will become clear in Section 3.4. The (, \u21e4) values in the\nabove de\ufb01nition are just for the sake of theoretical rigor, irrelevant from our practical implementation.\nThey are meant to avoid the theoretical situation where two samples are identical and then taking the\nlogarithm becomes no sense. In this section, the reader can skip these values and refer back when\nreading our theoretical analysis (in Section 4 and the supplementary material).\n\n3.2 Preprocessing: Subsample of Data Items\nWe now describe how to train BourGAN step by step. Provided with a multiset of data items\ni=1 drawn independently from an unknown distribution X , we \ufb01rst subsample m (m < n)\nX = {xi}n\ndata items uniformly at random from X. This subsampling step is essential, especially when n is\nlarge, for reducing the computational cost of metric embeddings as well as the number of dimensions\nof the latent space (both described in Section 3.3). From now on, we use Y to denote the multiset of\ndata items subsampled from X (i.e., Y \u2713 X and |Y | = m). Elements in Y will be embedded in `2\nspace in the next step.\nThe subsampling strategy, while simple, is theoretically sound. Let P be the (, \u21e4)-LPDD of the data\ndistribution X , and P0 be the LPDD of the uniform distribution on Y . We will show in Section 4 that\ntheir Wasserstein-1 distance W (P,P0) is tightly bounded if m is suf\ufb01ciently large but much smaller\nthan n. In other words, the mode structure of the real data can be captured by considering only the\nsubsamples in Y . In practice, m is chosen automatically by a simple algorithm, which we describe in\nAppendix F.1. In all our examples, we \ufb01nd m = 4096 suf\ufb01cient.\n\n3.3 Construction of Gaussian Mixture in Latent Space\nNext, we construct a Gaussian mixture model for generating random vectors in latent space. First, we\nembed data items from Y to an `2 space, one that the latent random vectors reside in. We want the\nlatent vector dimensionality to be small, while ensuring that the mode structure be well re\ufb02ected in the\nlatent space. This requires the embedding to introduce minimal distortion on the pairwise distances of\ndata items. For this purpose, we propose an algorithm that leverages Bourgain\u2019s embedding theorem.\n\nMetric embeddings. Bourgain [13] introduced a method that can embeds any \ufb01nite metric space\ninto a small `2 space with minimal distortion. The theorem is stated as follows:\nTheorem 1 (Bourgain\u2019s theorem). Consider a \ufb01nite metric space (Y, d) with m = |Y |. There exists a\nmapping g : Y ! Rk for some k = O(log2 m) such that 8y, y0 2 Y, d(y, y0) \uf8ff kg(y)  g(y0)k2 \uf8ff\n\u21b5 \u00b7 d(y, y0), where \u21b5 is a constant satisfying \u21b5 \uf8ff O(log m).\nThe mapping g is constructed using a randomized algorithm also given by Bourgain [13]. Directly\napplying Bourgain\u2019s theorem results in a latent space of O(log2 m) dimensions. We can further\nreduce the number of dimensions down to O(log m) through the following corollary.\nCorollary 2 (Improved Bourgain embedding). Consider a \ufb01nite metric space (Y, d) with m = |Y |.\nThere exist a mapping f : Y ! Rk for some k = O(log m) such that 8y, y0 2 Y, d(y, y0) \uf8ff\nkf (y)  f (y0)k2 \uf8ff \u21b5 \u00b7 d(y, y0), where \u21b5 is a constant satisfying \u21b5 \uf8ff O(log m).\nProved in Appendix B, this corollary is obtained by combining Theorem 1 with the Johnson-\nLindenstrauss (JL) lemma [38]. The mapping f is computed through a combination of the algorithms\nfor Bourgain\u2019s theorem and the JL lemma. This algorithm of computing f is detailed in Appendix A.\nRemark. Instead of using Bourgain embedding, one can \ufb01nd a mapping f : Y ! Rk with bounded\ndistortion, namely, 8y, y0 2 Y, d(y, y0) \uf8ff kf (y)  f (y0)k2 \uf8ff \u21b5 \u00b7 d(y, y0), by solving a semide\ufb01nite\nprogramming problem (e.g., see [39, 33]). This approach can \ufb01nd an embedding with the least distor-\ntion \u21b5. However, solving semide\ufb01nite programming problem is much more costly than computing\nBourgain embeddings. Even if the optimal distortion factor \u21b5 is found, it can still be as large as\nO(log m) in the worst case [40]. Indeed, Bourgain embedding is optimal in the worst case.\nUsing the mapping f, we embed data items from Y (denoted as {yi}m\nsions (k = O(log m)). Let F be the multiset of the resulting vectors in Rk (i.e., F = {f (yi)}m\n\ni=1) into the `2 space of k dimen-\ni=1).\n\n5\n\n\fAs we will formally state in Section 4, the Wasserstein-1 distance between the (, \u21e4)LPDD of the\nreal data distribution X and the LPDD of the uniform distribution on F is tightly bounded. Simply\nspeaking, the mode structure in the real data is well captured by F in `2 space.\n\nLatent-space Gaussian mixture. Now, we construct a distribution using F to draw random vectors\nin latent space. A simple choice is the uniform distribution over F , but such a distribution is not\ncontinuous over the latent space. Instead, we construct a mixture of Gaussians, each of which is\ncentered at a vector f (yi) in F . In particular, we generate a latent vector z 2 Rk in two steps: We \ufb01rst\nsample a vector \u00b5 2 F uniformly at random, and then draw a vector z from the Gaussian distribution\nN (\u00b5, 2), where  is a smoothing parameter that controls the smoothness of the distribution of the\nlatent space. In practice, we choose  empirically ( = 0.1 for all our examples). We discuss our\nchoice of  in Appendix F.1.\ni=1).\nRemark. By this de\ufb01nition, the Gaussian mixture consists of m Gaussians (recall F = {f (yi)}m\nBut this does not mean that we construct m \u201cmodes\u201d in the latent space. If two Gaussians are close\nto each other in the latent space, they should be viewed as if they are from the same mode. It is\nthe overall distribution of the m Gaussians that re\ufb02ects the distribution of modes. In this sense, the\nnumber of modes in the latent space is implicitly de\ufb01ned, and the m Gaussians are meant to enable\nus to sample the modes in the latent space.\n\n3.4 Training\nThe Gaussian mixture distribution Z in the latent space guarantees that the LPDD of Z is close\nto (, \u21e4)LPDD of the target distribution X (shown in Section 4). To exploit this property for\navoiding mode collapse, we encourage the generator network to match the pairwise distances of\ngenerated samples with the pairwise `2 distances of latent vectors in Z. This is realized by a simple\naugmentation of the GAN\u2019s objective function, namely,\n\nL(G, D) = Lgan + Ldist,\n\nwhere Ldist(G) = Ezi,zj\u21e0Zh(log(d(G(zi), G(zj)))  log(kzi  zjk2))2i ,\n\n(2)\n(3)\n\nLgan is the objective of the standard GAN in Eq. (1), and  is a parameter to balance the two terms.\nIn Ldist, zi and zj are two i.i.d. samples from Z conditioned on zi 6= zj. Here the advantages of\nusing logarithmic distances are threefold: i) When there exists \u201coutlier\u201d modes that are far away\nfrom others, logarithmic distance prevents those modes from being overweighted in the objective.\nii) Logarithm turns a uniform scale of the distance metric into a constant addend that has no effect\nto the optimization. This is desired as the structure of modes is invariant under a uniform scale of\ndistance metric. iii) Logarithmic distances ease our theoretical analysis, which, as we will formalize\nin Section 4, states that when Eq. (3) is optimized, the distribution of generated samples will closely\nresemble the real distribution X . That is, mode collapse will be avoided.\nIn practice, when experimenting with real datasets, we \ufb01nd that a simple pre-training step using the\ncorrespondence between {yi}m\ni=1 helps to improve the training stability. Although\nnot a focus of this paper, this step is described in Appendix C.\n\ni=1 and {f (yi)}m\n\n4 Theoretical Analysis\n\nThis section offers an theoretical analysis of our method presented in Section 3. We will state the main\ntheorems here while referring to the supplementary material for their rigorous proofs. Throughout,\nwe assume a property of the data distribution X : if two samples, a and b, are drawn independently\nfrom X , then with a high probability (> 1/2) they are distinct (i.e., Pra,b\u21e0X (a 6= b)  1/2).\nRange of pairwise distances. We \ufb01rst formalize our de\ufb01nition of (, \u21e4)LPDD in Section 3.1.\nRecall that the multiset X = {xi}n\ni=1 is our input dataset regarded as i.i.d. samples from X . We\nwould like to \ufb01nd a range [, \u21e4] such that the pairwise distances of samples from X is in this range\nwith a high probability (see Example-7 and -8 in Appendix D). Then, when considering the LPDD of\nX , we account only for the pairwise distances in the range [, \u21e4] so that the logarithmic pairwise\ndistance is well de\ufb01ned. The values  and \u21e4 are chosen by the following theorem, which we prove in\nAppendix G.2.\n\n6\n\n\fTheorem3.Let=mini2[n1]:xi6=xi+1d(xi,xi+1)and\u21e4=maxi2[n1]d(xi,xi+1).8,2(0,1),ifnC/()forsomesuf\ufb01cientlylargeconstantC>0,thenwithprobabilityatleast1,Pra,b\u21e0X(d(a,b)2[,\u21e4]|,\u21e4)Pra,b\u21e0X(a6=b).Simplyspeaking,thistheoremstatesthatifwechooseand\u21e4asdescribedabove,thenwehavePra,b\u21e0X(d(a,b)2[,\u21e4]|a6=b)1O(1/n),meaningthatifnislarge,thepairwisedistanceofanytwoi.i.d.samplesfromXisalmostcertainlyintherange[,\u21e4].Therefore,(,\u21e4)LPDDisareasonablemeasureofthepairwisedistancedistributionofX.Inthispaper,wealwaysusePtodenotethe(,\u21e4)LPDDoftherealdatadistributionX.Numberofsubsamples.Withthechoicesofand\u21e4,wehavethefollowingtheoremtoguaranteethesoundnessofoursubsamplingstepdescribedinSection3.2.Theorem4.LetY={yi}mi=1beamultisetofm=logO(1)(\u21e4/)\u00b7log(1/)i.i.d.samplesdrawnfromX,andletP0betheLPDDoftheuniformdistributiononY.Forany2(0,1),withprobabilityatleast1,wehaveW(P,P0)\uf8ffO(1).ProvedinAppendixG.3,thistheoremstatesthatweonlyneedm(ontheorderoflogO(1)(\u21e4/))subsamplestoformamultisetYthatwellcapturesthemodestructureintherealdata.2460.61.20FrequencyLog PairwiseDistance8Figure3:LPDDofuniformdistri-butionF(orange)andofsamplesfromGaussianmixture(blue).Discretelatentspace.Next,welayatheoreticalfoundationforourmetricembeddingstepdescribedinSection3.3.RecallthatFisthemultisetofvectorsresultedfromembeddingdataitemsfromYtothe`2space(i.e.,F={f(yi)}mi=1).AsprovedinAppendixG.4,wehave:Theorem5.LetFbetheuniformdistributiononthemultisetF.Thenwithprobabilityatleast0.99,wehaveW(P,\u02c6P)\uf8ffO(logloglog(\u21e4/)),where\u02c6PistheLPDDofF.Herethetriple-logfunction(logloglog(\u21e4/))indicatesthattheWassersteindistanceboundcanbeverytight.AlthoughthistheoremstatesabouttheuniformdistributiononF,notpreciselytheGaussianmixtureweconstructed,itisaboutthecasewhenoftheGaussianmixtureapproacheszero.Wealsoempiricallyveri\ufb01edtheconsistencyofLPDDfromGaussianmixturesamples(Figure3).GANobjective.Next,wetheoreticallyjustifytheobjectivefunction(i.e.,Eq.(3)inSection3.4).Let\u02dcXbethedistributionofgeneratedsamplesG(z)forz\u21e0Zand\u02dcPbethe(,\u21e4)LPDDof\u02dcX.Goodfellowetal.[1]showedthattheglobaloptimumoftheGANobjective(1)isreachedifandonlyif\u02dcX=X.Then,whenthisoptimumisachieved,wemustalsohaveW(P,\u02dcP)=0andW(\u02dcP,\u02c6P)\uf8ffO(logloglog(\u21e4/)).ThelatterisbecauseW(P,\u02c6P)\uf8ffO(logloglog(\u21e4/))fromTheorem5.Asaresult,theGAN\u2019sminmaxproblem(1)isequivalenttotheconstrainedminmaxproblem,minGmaxDLgan(G,D),subjecttoW(\u02dcP,\u02c6P)\uf8ff,whereisontheorderofO(logloglog(\u21e4/)).Apparently,thisconstraintrenderstheminmaxproblemharder.Wethereforeconsidertheminmaxproblem,minGmaxDLgan(G,D),subjectedtoslightlystrengthenedconstraints,8z16=z22supp(Z),d(G(z1),G(z2))2[,\u21e4],and(4)[log(d(G(z1),G(z2)))logkz1z2k2]2\uf8ff2.(5)AsprovedinAppendixE,iftheaboveconstraintsaresatis\ufb01ed,thenW(\u02dcP,\u02c6P)\uf8ffisautomaticallysatis\ufb01ed.Inourtrainingprocess,weassumethattheconstraint(4)isautomaticallysatis\ufb01ed,supportedbyTheorem3.Lastly,insteadofusingEq.(5)asahardconstraint,wetreatitasasoftconstraintshowingupintheobjectivefunction(3).Fromthisperspective,thesecondterminourproposedobjective(2)canbeinterpretedasaLagrangemultiplieroftheconstraint.LPDDofthegeneratedsamples.Now,ifthegeneratornetworkistrainedtosatisfythecon-straint(5),wehaveW(\u02dcP,\u02c6P)\uf8ffO(logloglog(\u21e4/)).NotethatthissatisfactiondoesnotimplythattheglobaloptimumoftheGANinEq.(1)hastobereached\u2013suchaglobaloptimumishardtoachieveinpractice.Finally,usingthetriangleinequalityoftheWasserstein-1distanceandTheorem5,wereachtheconclusionthatW(\u02dcP,P)\uf8ffW(\u02dcP,\u02c6P)+W(P,\u02c6P)\uf8ffO(logloglog(\u21e4/)).(6)7\f2D Ring2D Grid2D CircleGANUnrolled GANVEEGANPacGANBourGANTargetFigure4:Syntheticdatatests.Inallthreetests,ourmethodclearlycapturesallthemodespresentedinthetargets,whileproducingnounwantedsampleslocatedbetweentheregionsofmodes.ThismeansthattheLPDDofgeneratedsamplescloselyresemblesthatofthedatadistribution.Toputtheboundinaconcretecontext,inExample9ofAppendixD,weanalyzeatoycaseinathoughtexperimenttoshow,ifthemodecollapseoccurs(evenpartially),howlargeW(\u02dcP,P)wouldbeincomparisontoourtheoreticalboundhere.5ExperimentsThissectionpresentstheempiricalevaluationsofourmethod.TherehasnotbeenaconsensusonhowtoevaluateGANsinthemachinelearningcommunity[41,42],andquantitativemeasureofmodecollapseisalsonotstraightforward.Wethereforeevaluateourmethodusingbothsyntheticandrealdatasets,mostofwhichhavebeenusedbyrecentGANvariants.WereferthereadertoAppendixFfordetailedexperimentsetupsandcompleteresults,whilehighlightingourmain\ufb01ndingshere.Overview.Westartwithanoverviewofourexperiments.i)Onsyntheticdatasets,wequantitativelycompareourmethodwithfourtypesofGANs,includingtheoriginalGAN[1]andmorerecentVEEGAN[10],UnrolledGANs[9],andPacGAN[11],followingtheevaluationmetricsusedbythosemethods(AppendixF.2).ii)Wealsoexamineineachmodehowwellthedistributionofgeneratedsamplesmatchesthedatadistribution(AppendixF.2)\u2013anewtestnotpresentedpreviously.iii)WecomparethetrainingconvergencerateofourmethodwithexistingGANs(AppendixF.2),examiningtowhatextenttheGaussianmixturesamplingisbene\ufb01cial.iv)Wechallengeourmethodwiththedif\ufb01cultstackedMNISTdataset(AppendixF.3),testinghowmanymodesitcancover.v)Mostnotably,weexamineifthereare\u201cfalsepositive\u201dsamplesgeneratedbyourmethodandothers(Figure4).Thoseareunwantedsamplesnotlocatedinanymodes.Inallthesecomparisons,we\ufb01ndthatBourGANclearlyproduceshigher-qualitysamples.Inaddition,weshowthatvi)ourmethodisabletoincorporatedifferentdistancemetrics,onesthatleadtodifferentmodeinterpretations(AppendixF.3);andvii)ourpre-trainingstep(describedinAppendixC)furtheracceleratesthetrainingconvergenceinrealdatasets(AppendixF.2).Lastly,viii)wepresentsomequalitativeresults(AppendixF.4).2DRing2DGrid2DCircle#modes(max8)W1lowquality#modes(max25)W1lowqualitycentercapturedW1lowqualityGAN1.038.600.06%17.71.61717.70%No32.590.14%Unrolled7.64.67812.03%14.92.23195.11%No0.3600.50%VEEGAN8.04.90413.23%24.40.83622.84%Yes0.46610.72%PacGAN7.81.4121.79%24.30.89820.54%Yes0.2631.38%BourGAN8.00.6870.12%25.00.2484.09%Yes0.0810.35%Table1:StatisticsofExperimentsonSyntheticDatasets8\fFigure 5: Qualitative results on CelebA dataset using DCGAN (Left) and BourGAN (Right) under\n`2 metric. It appears that DCGAN generates some samples that are visually more implausible (e.g.,\nred boxes) in comparison to BourGAN. Results are fairly sampled at random, not cherry-picked.\n\nQuantitative evaluation. We compare BourGAN with other methods on three synthetic datasets:\neight 2D Gaussian distributions arranged in a ring (2D Ring), twenty-\ufb01ve 2D Gaussian distributions\narranged in a grid (2D Grid), and a circle surrounding a Gaussian placed in the center (2D Circle). The\n\ufb01rst two were used in previous methods [9, 10, 11], and the last is proposed by us. The quantitative\nperformance of these methods are summarized in Table 1, where the column \u201c# of modes\u201d indicates\nthe average number of modes captured by these methods, and \u201clow quality\u201d indicates number of\nsamples that are more than 3\u21e5 standard deviations away from the mode centers. Both metrics are\nused in previous methods [10, 11]. For the 2D circle case, we also check if the central mode is\ncaptured by the methods. Notice that all these metrics measure how many modes are captured, but\nnot how well the data distribution is captured. To understand this, we also compute the Wasserstein-1\ndistances between the distribution of generated samples and the data distribution (reported in Table 1).\nIt is evident that our method performs the best on all these metrics (see Appendix F.2 for more\ndetails).\n\nAvoiding unwanted samples. A notable advantage offered by our method is the ability to avoid\nunwanted samples, ones that are located between the modes. We \ufb01nd that all the four existing GANs\nsuffer from this problem (see Figure 4), because they use Gaussian to draw latent vectors (recall\nFigure 1). In contrast, our method generates no unwanted samples in all three test cases. We refer\nthe reader to Appendix F.3 for a detailed discussion of this feature and several other quantitative\ncomparisons.\n\nQualitative results. We further test our algorithm on real image datasets. Figure 5 illustrates\na qualitative comparison between DCGAN and our method, both using the same generator and\ndiscriminator architectures and default hyperparameters. Appendix F.4 includes more experiments\nand details.\n\n6 Conclusion\n\nThis paper introduces BourGAN, a new GAN variant aiming to address mode collapse in generator\nnetworks. In contrast to previous approaches, we draw latent space vectors using a Gaussian mixture,\nwhich is constructed through metric embeddings. Supported by theoretical analysis and experiments,\nour method enables a well-posed mapping between latent space and multi-modal data distributions.\nIn future, our embedding and Gaussian mixture sampling can also be readily combined with other\nGAN variants and even other generative models to leverage their advantages.\n\nAcknowledgements\nWe thank Daniel Hsu, Carl Vondrick and Henrique Maia for the helpful feedback. Chang Xiao\nand Changxi Zheng are supported in part by the National Science Foundation (CAREER-1453101,\n1717178 and 1816041) and generous donations from SoftBank and Adobe. Peilin Zhong is supported\nin part by National Science Foundation (CCF-1703925, CCF-1421161, CCF-1714818, CCF-1617955\nand CCF-1740833), Simons Foundation (#491119 to Alexandr Andoni) and Google Research Award.\n\n9\n\n\fReferences\n[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[2] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems, pages 271\u2013279, 2016.\n\n[3] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[4] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[5] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Ar-\njovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704,\n2016.\n\n[6] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv\n\npreprint arXiv:1605.09782, 2016.\n\n[7] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak\nLee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.\n\n[8] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[9] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. arXiv preprint arXiv:1611.02163, 2016.\n\n[10] Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles Sutton.\nVeegan: Reducing mode collapse in gans using implicit variational learning. In Advances in\nNeural Information Processing Systems, pages 3310\u20133320, 2017.\n\n[11] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples\n\nin generative adversarial networks. arXiv preprint arXiv:1712.04086, 2017.\n\n[12] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized\n\ngenerative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.\n\n[13] Jean Bourgain. On lipschitz embedding of \ufb01nite metric spaces in hilbert space. Israel Journal\n\nof Mathematics, 52(1-2):46\u201352, 1985.\n\n[14] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[15] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[16] Ashish Bora, Eric Price, and Alexandros G Dimakis. Ambientgan: Generative models from\nlossy measurements. In International Conference on Learning Representations (ICLR), 2018.\n\n[17] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using\n\ngenerative models. arXiv preprint arXiv:1703.03208, 2017.\n\n[18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. arXiv preprint, 2017.\n\n[19] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n10\n\n\f[20] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro\nAcosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic\nsingle image super-resolution using a generative adversarial network. arXiv preprint, 2016.\n\n[21] Jun-Yan Zhu, Philipp Kr\u00e4henb\u00fchl, Eli Shechtman, and Alexei A Efros. Generative visual\nmanipulation on the natural image manifold. In European Conference on Computer Vision,\npages 597\u2013613. Springer, 2016.\n\n[22] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics.\n\nIn Advances In Neural Information Processing Systems, pages 613\u2013621, 2016.\n\n[23] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a\nprobabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances\nin Neural Information Processing Systems, pages 82\u201390, 2016.\n\n[24] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.\nLeast squares generative adversarial networks. In 2017 IEEE International Conference on\nComputer Vision (ICCV), pages 2813\u20132821. IEEE, 2017.\n\n[25] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\n[26] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network.\n\narXiv preprint arXiv:1609.03126, 2016.\n\n[27] Yunus Saatci and Andrew G Wilson. Bayesian gan.\n\nprocessing systems, pages 3622\u20133631, 2017.\n\nIn Advances in neural information\n\n[28] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and\n\nequilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.\n\n[29] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther.\nAutoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300,\n2015.\n\n[30] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[31] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\n\nimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[32] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing the latent\n\nspace of generative networks. arXiv preprint arXiv:1707.05776, 2017.\n\n[33] Ji\u02c7r\u00ed Matou\u0161ek. Embedding \ufb01nite metric spaces into normed spaces. In Lectures on Discrete\n\nGeometry, pages 355\u2013400. Springer, 2002.\n\n[34] Nicolas Courty, R\u00e9mi Flamary, and M\u00e9lanie Ducoffe. Learning wasserstein embeddings. arXiv\n\npreprint arXiv:1710.07457, 2017.\n\n[35] Piotr Indyk and Jir\u0131 Matou\u0161ek. Low-distortion embeddings of \ufb01nite metric spaces. Handbook\n\nof discrete and computational geometry, 37:46, 2004.\n\n[36] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[37] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In\n\nAdvances in Neural Information Processing Systems, pages 5591\u20135600, 2017.\n\n[38] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert\n\nspace. Contemporary mathematics, 26(189-206):1, 1984.\n\n[39] Nathan Linial, Eran London, and Yuri Rabinovich. The geometry of graphs and some of its\n\nalgorithmic applications. Combinatorica, 15(2):215\u2013245, 1995.\n\n11\n\n\f[40] Tom Leighton and Satish Rao. An approximate max-\ufb02ow min-cut theorem for uniform multi-\ncommodity \ufb02ow problems with applications to approximation algorithms. In Foundations of\nComputer Science, 1988., 29th Annual Symposium on, pages 422\u2013431. IEEE, 1988.\n\n[41] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015.\n\n[42] Ali Borji. Pros and cons of gan evaluation measures. arXiv preprint arXiv:1802.03446, 2018.\n[43] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. Mmd\ngan: Towards deeper understanding of moment matching network. In Advances in Neural\nInformation Processing Systems, pages 2200\u20132210, 2017.\n\n[44] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard\nSch\u00f6lkopf. Adagan: Boosting generative models. In Advances in Neural Information Processing\nSystems, pages 5430\u20135439, 2017.\n\n[45] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[46] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[47] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[48] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[49] Friedrich Pukelsheim. The three sigma rule. The American Statistician, 48(2):88\u201391, 1994.\n[50] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[51] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n12\n\n\f", "award": [], "sourceid": 1146, "authors": [{"given_name": "Chang", "family_name": "Xiao", "institution": "Columbia University"}, {"given_name": "Peilin", "family_name": "Zhong", "institution": "Columbia University"}, {"given_name": "Changxi", "family_name": "Zheng", "institution": "Columbia University"}]}