{"title": "On GANs and GMMs", "book": "Advances in Neural Information Processing Systems", "page_first": 5847, "page_last": 5858, "abstract": "A longstanding problem in machine learning is to find unsupervised methods that can learn the statistical structure of high dimensional signals. In recent years, GANs have gained much attention as a possible solution to the problem, and in particular have shown the ability to generate remarkably realistic high resolution sampled images. At the same time, many authors have pointed out that GANs may fail to model the full distribution (\"mode collapse\") and that using the learned models for anything other than generating samples may be very difficult.\n\nIn this paper, we examine the utility of GANs in learning statistical models of images by comparing them to perhaps the simplest statistical model, the Gaussian Mixture Model. First, we present a simple method to evaluate generative models based on relative proportions of samples that fall into predetermined bins. Unlike previous automatic methods for evaluating models, our method does not rely on an additional neural network nor does it require approximating intractable computations. Second, we compare the performance of GANs to GMMs trained on the same datasets. While GMMs have previously been shown to be successful in modeling small patches of images, we show how to train them on full sized images despite the high dimensionality. Our results show that GMMs can generate realistic samples (although less sharp than those of GANs) but also capture the full distribution, which GANs fail to do. Furthermore, GMMs allow efficient inference and explicit representation of the underlying statistical structure. Finally, we discuss how GMMs can be used to generate sharp images.", "full_text": "On GANs and GMMs\n\nEitan Richardson\n\nYair Weiss\n\nSchool of Computer Science and Engineering\n\nSchool of Computer Science and Engineering\n\nThe Hebrew University of Jerusalem\n\nThe Hebrew University of Jerusalem\n\nJerusalem, Israel\n\neitanrich@cs.huji.ac.il\n\nJerusalem, Israel\n\nyweiss@cs.huji.ac.il\n\nAbstract\n\nA longstanding problem in machine learning is to \ufb01nd unsupervised methods that\ncan learn the statistical structure of high dimensional signals. In recent years,\nGANs have gained much attention as a possible solution to the problem, and in\nparticular have shown the ability to generate remarkably realistic high resolution\nsampled images. At the same time, many authors have pointed out that GANs\nmay fail to model the full distribution (\"mode collapse\") and that using the learned\nmodels for anything other than generating samples may be very dif\ufb01cult.\nIn this paper, we examine the utility of GANs in learning statistical models of\nimages by comparing them to perhaps the simplest statistical model, the Gaussian\nMixture Model. First, we present a simple method to evaluate generative models\nbased on relative proportions of samples that fall into predetermined bins. Unlike\nprevious automatic methods for evaluating models, our method does not rely\non an additional neural network nor does it require approximating intractable\ncomputations. Second, we compare the performance of GANs to GMMs trained\non the same datasets. While GMMs have previously been shown to be successful\nin modeling small patches of images, we show how to train them on full sized\nimages despite the high dimensionality. Our results show that GMMs can generate\nrealistic samples (although less sharp than those of GANs) but also capture the\nfull distribution, which GANs fail to do. Furthermore, GMMs allow ef\ufb01cient\ninference and explicit representation of the underlying statistical structure. Finally,\nwe discuss how GMMs can be used to generate sharp images. 1\n\n1\n\nIntroduction\n\nNatural images take up only a tiny fraction of the space of possible images. Finding a way to\nexplicitly model the statistical structure of such images is a longstanding problem with applications\nto engineering and to computational neuroscience. Given the abundance of training data, this would\nalso seem a natural problem for unsupervised learning methods and indeed many papers apply\nunsupervised learning to small patches of images [42, 4, 32]. Recent advances in deep learning,\nhave also enabled unsupervised learning of full sized images using various models: Variational Auto\nEncoders [21, 17], PixelCNN [40, 39, 23, 38], Normalizing Flow [9, 8] and Flow GAN [14]. 2\nPerhaps the most dramatic success in modeling full images has been achieved by Generative Ad-\nversarial Networks (GANs) [13], which can learn to generate remarkably realistic samples at high\nresolution [34, 26], (Fig. 1). A recurring criticism of GANs, at the same time, is that while they are\nexcellent at generating pretty pictures, they often fail to model the entire data distribution, a phe-\nnomenon usually referred to as mode collapse: \u201cBecause of the mode collapse problem, applications\n\n1Code is available at https://github.com/eitanrich/gans-n-gmms\n2Flow GAN discusses a full-image GMM, but does not actually learn a meaningful model: the authors use a\n\n\u201cGMM consisting of m isotropic Gaussians with equal weights centered at each of the m training points\u201d.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Samples from three datasets (\ufb01rst two rows) and samples generated by GANs (last two\nrows): CelebA - WGAN-GP, MNIST - DCGAN, SVHN - WGAN\n\nof GANs are often limited to problems where it is acceptable for the model to produce a small number\nof distinct outputs\u201d [12]. (see also [35, 29, 34, 26].) Another criticism is the lack of a robust and\nconsistent evaluation method for GANs [18, 10, 28].\nTwo evaluation methods that are widely accepted [28, 1] are Inception Score (IS) [34] and Fr\u00e9chet\nInception Distance (FID) [16]. Both methods rely on a deep network, pre-trained for classi\ufb01cation,\nto provide a low-dimensional representation of the original and generated samples that can be\ncompared statistically. There are two signi\ufb01cant drawbacks to this approach: the deep representation\nis insensitive to image properties and artifacts that the underlying classi\ufb01cation network is trained\nto be invariant to [28, 18] and when the evaluated domain (e.g. faces, digits) is very different\nfrom the dataset used to train the deep representation (e.g. ImageNet) the validity of the test is\nquestionable [10, 28].\nAnother family of methods are designed with the speci\ufb01c goal of evaluating the diversity of the\ngenerated samples, regardless of the data distribution. Two examples are applying a perceptual multi-\nscale similarity metric (MS-SSIM) on random patches [31] and, basing on the Birthday Paradox\n(BP), looking for the most similar pair of images in a batch [3]. While being able to detect severe\ncases of mode collapse, these methods do not manage (or aim) to measure how well the generator\ncaptures the true data distribution [20].\nMany unsupervised learning methods are evaluated using log likelihood on held out data [42] but\napplying this to GANs is problematic. First, since GANs by de\ufb01nition only output samples on a\nmanifold within the high dimensional space, converting them into full probability models requires an\narbitrary noise model [2]. Second, calculating the log likelihood for a GAN requires integrating out\nthe latent variable and this is intractable in high dimensions (although encouraging results have been\nobtained for smaller image sizes [41]). As an alternative to log likelihood, one could calculate the\nWasserstein distance betweeen generated samples and the training data, but this is again intractable in\nhigh dimensions so approximations must be used [20].\nOverall, the current situation is that while many authors criticize GANs for \"mode collapse\" and\ndecry the lack of an objective evaluation measure, the focus of much of the current research is on\nimproved learning procedures for GANs that will enable generating high quality images of increasing\nresolution, and papers often include sentences of the type \u201cwe feel the quality of the generated images\nis at least comparable to the best published results so far.\u201d [20].\nThe focus on the quality of the generated images has perhaps decreased the focus on the original\nquestion: to what extent are GANs learning useful statistical models of the data? In this paper, we\ntry to address this question more directly by comparing GANs to perhaps the simplest statistical\nmodel, the Gaussian Mixture Model. First, we present a simple method to evaluate generative models\nbased on relative proportions of samples that fall into predetermined bins. Unlike previous automatic\nmethods for evaluating models, our method does not rely on an additional neural network nor does it\nrequire approximating intractable computations. Second, we compare the performance of GANs to\nGMMs trained on the same datasets. While GMMs have previously been shown to be successful in\nmodeling small patches of images, we show how to train them on full sized images despite the high\ndimensionality. Our results show that GMMs can generate realistic samples (although less sharp than\nthose of GANs) but also capture the full distribution which GANs fail to do. Furthermore, GMMs\nallow ef\ufb01cient inference and explicit representation of the underlying statistical structure. Finally, we\ndiscuss two methods in which sharp and realistic images can be generated with GMMs.\n\n2\n\n\fFigure 2: Our proposed evaluation method on a toy example in R2. Top-left: The training data (blue)\nand binning result - Voronoi cells (numbered by bin size). Bottom-left: Samples (red) drawn from a\nGAN trained on the data. Right: Comparison of bin proportions between the training data and the\nGAN samples. Black lines = standard error (SE) values.\n\n2 A New Evaluation Method\n\n(cid:80)\n\ni IB(sp\n\ni ) \u2248 1\n\nNq\n\nj IB(sq\nj ).\n\nOur proposed evaluation method is based on a very simple observation: If we have two sets of\nsamples and they both represent the same distribution, then the number of samples that fall into a\ngiven bin should be the same up to sampling noise. More formally, we de\ufb01ne IB(s) as an indicator\nfunction for bin B. IB(s) = 1 if the sample s falls into the bin B and zero otherwise. Let {sp\ni } be\nNp samples from distribution p and {sq\nj} be Nq samples from distribution q, then if p = q, we expect\n\n(cid:80)\nportion that falls into B in the joined sets) and its standard error: SE =(cid:112)P (1 \u2212 P )[1/Np + 1/Nq].\n\n1\nNp\nThe decision whether the number of samples in a given bin are statistically different is a classic\ntwo-sample problem for Bernoulli variables [7]. We calculate the pooled sample proportion P (the pro-\nThe test statistic is the z-score: z = Pp\u2212Pq\nSE , where Pp and Pq are the proportions from each sample\nthat fall into bin B. If the probability of the observed test statistic is smaller than a threshold (deter-\nmined by the signi\ufb01cance level) then the number is statistically different. There is still the question of\nwhich bin to use to compare the two distributions. In high dimensions, a randomly chosen bin in a\nuniform grid is almost always going to be empty. We propose to use Voronoi cells. This guarantees\nthat each bin is expected to contain some samples.\nOur binning-based evaluation method is demonstrated in Fig. 2, using a toy example where the data\nis in R2. We have a set of Np training samples from the reference distribution p and a set of Nq\nsamples with distribution q, generated by the model we wish to evaluate. To de\ufb01ne the Voronoi cells,\nwe perform K-means clustering of the Np training samples to some arbitrary number of clusters K\n(K (cid:28) Np, Nq). Each training sample sp\ni is assigned to one of the K cells (bins). We then assign\neach generated sample sq\nj to the nearest (L2) of the K centroids. We perform the two-sample test\non each cell separately and report the number of statistically-different bins (NDB). According to the\nclassical theory of hypothesis testing, if the two samples do come from the same distribution, then\nthe NDB score divided by K should be equal to the signi\ufb01cance level (0.05 in our experiments).\nNote that unlike the popular IS and FID, our NDB method is applied directly on the image pixels\nand does not rely on a representation learned for other tasks. This makes our metric domain agnostic\nand sensitive to different image properties the deep-representation is insensitive to. Compared to\nMS-SSIM and BP, our method has the advantage of providing a metric between the data and generated\ndistributions and not just measuring the general diversity of the generated sample.\nA possible concern about using Voronoi cells as bins is that this essentially treats images as vectors\nin pixel spaces, where L2 distance may not be meaningful. In the supplementary material we show\nthat for the datasets we used, the bins are usually semantically meaningful. Even in cases where the\nbins do not correspond to semantic categories, we still expect a good generative model to preserve\n\n3\n\n051015202530Bin Number0.000.010.020.030.040.050.060.07Bin ProportionTrainGAN\fFigure 3: NDB (divided by K) vs In-\nception Score during training iterations\nof WGAN-GP on CIFAR-10 [24]. The\ntwo metrics correlate, except towards the\nend of the training, possibly indicating\nsensitivity to different image attributes.\n\nthe statistics of the training set. Fig. 3 demonstrates the validity of NDB to a dataset with a more\ncomplex image structure, such as CIFAR-10, by comparing it to IS.\n\n3 Full Image Gaussian Mixture Model\n\nIn order to provide context on the utility of GANs in learning statistical models of images, we\ncompare it to perhaps the simplest possible statistical model: the Gaussian Mixture Model (GMM)\ntrained on the same datasets.\nThere are two possible concerns with training GMMs on full images. The \ufb01rst is the dimensionality.\nIf we work with 64 \u00d7 64 color images then a single covariance matrix will have 7.5 \u00d7 107 free\nparameters and during training we will need to store and invert matrices of this size. The second\nconcern is the complexity of the distribution. While a GMM can approximate many densities with\na suf\ufb01ciently large number of Gaussians, it is easy to construct densities for which the number of\nGaussians required grows exponentially with the dimension.\nIn order to address the computational concern, we use a GMM training algorithm where the memory\nand complexity grow linearly with dimension (not quadratically as in the standard GMM). Speci\ufb01cally\nwe use the Mixture of Factor Analyzers [11], as described in the next paragraph. Regarding the second\nconcern, our experiments (section 4) show that for the tested datasets, a relatively modest number of\ncomponents is suf\ufb01cient to approximate the data distribution, despite the high dimensionality. Of\ncourse, this may not be necessarily true for every dataset.\nProbabilistic PCA [37, 36] and Factor Analyzers [22, 11] both use a rectangular scale matrix Ad\u00d7l\nmultiplying the latent vector z of dimension l (cid:28) d, which is sampled from a standard normal\ndistribution. Both methods model a normal distribution on a low-dimensional subspace embedded in\nthe full data space. For stability, isotropic (PPCA) or diagonal-covariance (Factor Analyzers) noise is\nadded. We chose to use the more general setting of Factor Analyzers, allowing to model higher noise\nvariance in speci\ufb01c pixels (for example, pixels containing mostly background).\nThe model for a single Factor Analyzers component is:\n\nx = Az + \u00b5 + \u0001 , z \u223c N (0, I) , \u0001 \u223c N (0, D) ,\n\n(1)\n\nwhere \u00b5 is the mean and \u0001 is the added noise with a diagonal covariance D. This results in the\nGaussian distribution x \u223c N (\u00b5, AAT + D). The number of free parameters in a single Factor\nAnalyzers component is d(l + 2), and K[d(l + 2) + 1] in a Mixture of Factor Analyzers (MFA) model\nwith K components, where d and l are the data and latent dimensions.\n\n3.1 Avoiding Inversion of Large Matrices\n\nThe log-likelihood of a set of N data points in a Mixture of K Factor Analyzers is:\n\nN(cid:88)\n\nK(cid:88)\n\nN(cid:88)\n\nK(cid:88)\n\nL =\n\nlog\n\n\u03c0iP (xn|\u00b5i, \u03a3i) =\n\nlog\n\ne[log(\u03c0i)+log P (xn|\u00b5i,\u03a3i)],\n\n(2)\n\nn=1\n\ni=1\n\nn=1\n\ni=1\n\n4\n\n25005000750010000125001500017500WGAN-GP Train Iteration0.20.30.40.50.60.7NDB/KNDB, K=100NDB, K=2002.02.53.03.54.04.55.0Inception ScoreIS\fwhere \u03c0i are the mixing coef\ufb01cients. Because of the high dimensionality, we calculate the log of\nthe normal probability and the last expression is evaluated using log sum exp operation over the K\ncomponents.\nThe log-probability of a data point x given the component is evaluated as follows:\n\n(cid:2)d log(2\u03c0) + log det(\u03a3) + (x \u2212 \u00b5)T \u03a3\u22121(x \u2212 \u00b5)(cid:3)\n\n(3)\n\nlogP (x|\u00b5, \u03a3) = \u2212 1\n2\n\nUsing the Woodbury matrix inversion lemma:\n\u03a3\u22121 = (AAT + D)\u22121 = D\u22121 \u2212 D\u22121A(I + AT D\u22121A)\u22121AT D\u22121 = D\u22121 \u2212 D\u22121AL\u22121\n\nl\u00d7lAT D\u22121\n(4)\nTo avoid storing the d \u00d7 d matrix \u03a3\u22121 and performing large matrix multiplications, we evaluate the\nMahalanobis distance as follows (denoting \u02c6x = (x \u2212 \u00b5)):\n\n\u02c6xT \u03a3\u22121 \u02c6x = \u02c6xT [D\u22121 \u2212 D\u22121AL\u22121AT D\u22121]\u02c6x = \u02c6xT [D\u22121 \u02c6x \u2212 D\u22121AL\u22121(AT D\u22121 \u02c6x)]\n\n(5)\n\nThe log-determinant is calculated using the matrix determinant lemma:\n\nlog det(AAT + D) = log det(I + AT D\u22121A) + log det D = log det Ll\u00d7l +\n\nd(cid:88)\n\nj=1\n\nlog dj\n\n(6)\n\nUsing equations 4 - 6, the complexity of the log-likelihood computation is linear in the image\ndimension d, allowing to train the MFA model ef\ufb01ciently on full-image datasets.\nRather than using EM [22, 11] (which is problematic with large datasets) we decided to optimize\nthe log-likelihood (equation 2) using Stochastic Gradient Descent and utilize available differentiable\nprogramming frameworks [1] that perform the optimization on GPU. The model is initialized by\nK-means clustering of the data and then Factor Analyzers parameters estimation for each component\nseparately. The supplementary material provides additional details about the training process.\n\n4 Experiments\n\nWe conduct our experiments on three popular datasets of natural images: CelebA [27] (aligned,\ncropped and resized to 64\u00d764), SVHN [30] and MNIST [25]. On these three datasets we compare the\nMFA model to the following generative models: GANs (DCGAN [33], BEGAN [5] and WGAN [2].\nOn the more challenging CelebA dataset we also compared to WGAN-GP [15]) and Variational\nAuto-encoders (VAE [21], VAE-DFC [17]). We compare the GMM model to the GAN models along\nthree attributes: (1) visual quality of samples (2) our quantitative NDB score and (3) ability to capture\nthe statistical structure and perform ef\ufb01cient inference.\nRandom samples from our MFA models trained on the three datasets are shown in Fig. 4. Although\nthe results are not as sharp as the GAN samples, the images look realistic and diverse. As discussed\nearlier, one of the concerns about GMMs is the number of components required. In the supplementary\nmaterial, we show the log-likelihood of the test set and the quality of a reconstructed random test\nimage as a function of the number of components. As can be seen, they both converge with a relatively\nsmall number of components.\nWe now turn to comparing the models using our proposed new evaluation metric.\nWe trained all models, generated 20,000 new samples and evaluated them using our evaluation\nmethod (section 2). Tables 1 - 3 present the evaluation scores for 20,000 samples from each model.\nWe also included, for reference, the score of 20,000 samples from the training and test sets. The\nsimple MFA model has the best (lowest) score for all values of K. Note that neither the bins nor the\nnumber of bins is known to any of the generative models. The evaluation result is consistent over\nmultiple runs and is insensitive to the speci\ufb01c NDB clustering mechanism (e.g. replacing K-means\nwith agglomerative clustering). In addition, initializing MFA differently (e.g. with k-subspaces or\nrandom models) makes the NDB scores slightly worse but still better than most GANs.\nThe results show clear evidence of mode collapse (large distortion from the train bin-proportions) in\nBEGAN and DCGAN and some distortion in WGAN. The improved training in WGAN-GP seems\nto reduce the distortion.\n\n5\n\n\fFigure 4: Random samples generated by our MFA model trained on CelebA, MNIST and SVHN\n\nFigure 5: Examples for mode-collapse in BEGAN trained on CelebA, showing three over-allocated\nbins and three under-allocated ones. The \ufb01rst image in each bin is the cell centroid (marked in red).\n\nTable 1: Bin-proportions NDB/K\nscores for different models trained\non CelebA, using 20,000 samples\nfrom each model or set, for differ-\nent number of bins (K). The listed\nvalues are NDB \u2013 numbers of sta-\ntistically different bins, with signif-\nicance level of 0.05, divided by the\nnumber of bins K (lower is better).\n\nMODEL\nTRAIN\nTEST\nMFA\nMFA+pix2pix\nADVERSARIAL MFA\nVAE\nVAE-DFC\nDCGAN\nBEGAN\nWGAN\nWGAN-GP\n\nK=100 K=200 K=300\n\n0.01\n0.12\n0.21\n0.34\n0.33\n0.78\n0.77\n0.68\n0.94\n0.76\n0.42\n\n0.03\n0.07\n0.12\n0.34\n0.30\n0.73\n0.65\n0.69\n0.85\n0.66\n0.32\n\n0.03\n0.08\n0.16\n0.33\n0.22\n0.72\n0.62\n0.65\n0.82\n0.62\n0.27\n\nTable 2: NDB/K scores for MNIST\n\nTable 3: NDB/K scores for SVHN\n\nMODEL\nTRAIN\nMFA\nDCGAN\nWGAN\n\nK=100 K=200 K=300\n\n0.06\n0.14\n0.41\n0.16\n\n0.04\n0.13\n0.38\n0.20\n\n0.05\n0.14\n0.46\n0.21\n\nMODEL\nTRAIN\nMFA\nDCGAN\nWGAN\n\nK=100 K=200 K=300\n\n0.03\n0.32\n0.78\n0.87\n\n0.03\n0.23\n0.74\n0.83\n\n0.03\n0.24\n0.76\n0.82\n\n6\n\n\f(a)\n\n(b)\n\nFigure 6: (a) Examples of learned MFA components trained on CelebA and MNIST: Mean image (\u00b5)\nand noise variance (D) are shown on top. Each row represents a column-vector of the rectangular scale\nmatrix A \u2013 the learned changes from the mean (showing vectors 1-5 of 10). The three images shown\nin row i are: \u00b5 + A(i), 0.5 + A(i), \u00b5 \u2212 A(i). (b) Combinations of two column-vectors (A(i), A(j)):\nzi changes with the horizontal axis and zj with the vertical axis, controlling the combination. Both\nvariables are zero in the central image, showing the component mean.\n\nOur evaluation method can provide visual insight into the mode collapse problem. Fig. 5 shows\nrandom samples generated by BEGAN that were assigned to over-allocated and under allocated bins.\nAs can be seen, each bin represents some prototype and the GAN failed to generate samples belonging\nto some of them. Note that the simple binning process (in the original image space) captures both\nsemantic properties such as sunglasses and hats, and physical properties such as colors and pose.\nInterestingly, our metric also reveals that VAE also suffers from \"mode collapse\" on this dataset.\nFinally, we compare the models in terms of disentangling the manifold the and ability to perform\ninference.\nIt has often been reported that the latent representation z in most GANs does not correspond to\nmeaningful directions on the statistical manifold [6] (see supplementary materia for a demonstration\nin 2D). Fig. 6(a) shows that in contrast, in the learned MFA model both the components and the\ndirections are meaningful. For CelebA, two of 1000 learned components are shown, each having a\nlatent dimension l of 10. Each component represents some prototype, and the learned column-vectors\nof the rectangular scale matrix A represent changes from the mean image, which span the component\non a 10-dimensional subspace in the full image dimension of 64 \u00d7 64 \u00d7 3 = 12, 288. As can be\nseen, the learned vectors affect different aspects of the represented faces such as facial hair, glasses,\nillumination direction and hair color and style. For MNIST, we learned 256 components with a\nlatent dimension of 4. Each component typically learns a digit and the vectors affect different style\nproperties, such as the angle and the horizontal stroke in the digit 7. Very different styles of the same\ndigit will be represented by different components.\nThe latent variable z controls the combination of column-vectors added to the mean image. As shown\nin Fig. 6(a), adding a column-vector to the mean with either a positive or a negative sign results in a\nrealistic image. In fact, since the latent variable z is sampled from a standard-normal (iid) distribution,\nany linear combination of column vectors from the component should result in a realistic image,\nas guaranteed by the log-likelihood training objective. This property is demonstrated in Fig. 6(b).\nEven though the manifold of face images is very nonlinear, the GMM successfully models it as a\ncombination of local linear manifolds. Additional examples in the supplementary material.\nAs discussed earlier, one the the main advantages of an explicit model is the ability to calculate the\nlikelihood and perform different inference tasks. Fig. 7(a) shows images from CelebA that have low\nlikelihood according to the MFA. Our model managed to detect outliers. Fig. 7(b) demonstrates the\ntask of image reconstruction from partially observed data (in-painting). For both tasks, the MFA\nmodel provides a closed-form expression \u2013 no optimization or re-training is needed. Both inpainting\n\n7\n\n\f(a)\n\n(b)\n\nFigure 7: Inference using the explicit MFA model: (a) Samples from the 100 images in CelebA with\nthe lowest likelihood given our MFA model (outliers) (b) Image reconstruction \u2013 in-painting: In each\nrow, the original image is shown \ufb01rst and then pairs of partially-visible image and reconstruction of\nthe missing (black) part conditioned on the observed part.\n\nFigure 8: Pairs of random samples from our MFA model, resized to 128x128 pixels and the matching\nsamples generated by the conditional pix2pix model (more detailed)\n\nand calculation of log likelihood using the GAN models is dif\ufb01cult and requires special purpose\napproximations.\n\n5 Generating Sharp Images with GMMs\n\nSummarizing our previous results, GANs are better than GMMs in generating sharp images while\nGMMs are better at actually capturing the statistical structure and enabling ef\ufb01cient inference. Can\nGMMs produce sharp images? In this section we discuss two different approaches that achieve that.\nIn addition to evaluating the sharpness subjectively, we use a simple sharpness measure: the relative\nenergy of high-pass \ufb01ltered versions of set of images (more details in the supplementary material).\nThe sharpness values (higher is sharper) for original CelebA images is -3.4, for WGAN-GP samples\n-3.9 and for MFA samples it is -5.4 (indicating that GMM samples are indeed much less sharp). A\ntrivial way of increasing sharpness of the GMM samples is to increase the number of components:\nby increasing this number by a factor of 20 we obtain samples of sharpness similar to that of GANs\n(\u22124.0) but this clearly over\ufb01ts to the training data. Can a GMM obtain similar sharpness values\nwithout over\ufb01tting?\n\n5.1 Pairing GMM with a Conditional GAN\n\nWe experiment with the idea of combining the bene\ufb01ts of GMM with the \ufb01ne-details of GAN in\norder to generate sharp images while still being loyal to the data distribution. A pix2pix conditional\nGAN [19] is trained to take samples from our MFA as input and make them more realistic (sharpen,\nadd details) without modifying the global structure.\nWe \ufb01rst train our MFA model and then generate for each training sample a matching image from\nour model: For each real image x, we \ufb01nd the most likely component c and a latent variable z\nthat maximizes the posterior probability P (z|x, \u00b5c, \u03a3c). We then generate \u02c6x = Acz + \u00b5c. This\nis equivalent to projecting the training image on the component subspace and bringing it closer to\nthe mean. We then train a pix2pix model on pairs {x, \u02c6x} for the task of converting \u02c6x to x. \u02c6x can\nbe resized to any arbitrary size. In run time, the learned pix2pix deterministic transformation is\n\n8\n\n\fFigure 9: Samples generated by adversarially-trained MFA (500 components)\n\napplied to new images sampled from the GMM model to generate matching \ufb01ne-detailed images.\nHigher-resolution samples generated by our MFA+pix2pix models are shown in Fig. 8 and in the\nsupplementary material. As can be seen in Fig. 8, pix2pix adds \ufb01ne details without affecting the global\nstructure dictated by the MFA model sample. The measured sharpness of MFA+pix2pix samples is\n-3.5 \u2013 similar to the sharpness level of the original dataset images. At the same time, the NDB scores\nbecome worse (Table 1).\n\n5.2 Adversarial GMM Training\n\nx =(cid:80)K\n\nGANs and GMMs differ both in the generative model and in the way it is learned. The GAN\nGenerator is a deep non-linear transformation from latent to image space. In contrast, each GMM\ncomponent is a simple linear transformation (Az + \u00b5). GANs are trained in an adversarial manner\nin which the Discriminator neural-network provides the loss, while GMMs are trained by explicitly\nmaximizing the likelihood. Which of these two differences explains the difference in generated image\nsharpness? We try to answer this question by training a GMM in an adversarial manner.\nTo train a GMM adversarially, we replaced the WGAN-GP Generator network with a GMM Generator:\ni=1 ci(Aiz1 + \u00b5i + Diz2), where Ai, \u00b5i and Di are the component scale matrix, mean and\nnoise variance. z1 and z2 are two noise inputs and ci is a one-hot random variable drawn from a\nmultinomial distribution controlled by the mixing coef\ufb01cients \u03c0. All component outputs are generated\nin parallel and are then multiplied by the one-hot vector, ensuring the output of only one component\nreaches the Generator output. The Discriminator block and the training procedure are unchanged.\nAs can be seen in Fig. 9, samples produced by the adversarial GMM are sharp and realistic as GAN\nsamples. The sharpness value of these samples is -3.8 (slightly better than WGAN-GP). Unfortunately,\nNDB evaluation shows that, like GANs, adversarial GMM suffers from mode collapse and futhermore\nthe log likelihood this MFA model gives to the data is far worse than traditional, maximum likelihood\ntraining. Interestingly, early in the adversarial training process, the GMM Generator decreases the\nnoise variance parameters Di, effectively \"turning off\" the added noise.\n\n6 Conclusion\n\nThe abundance of training data along with advances in deep learning have enabled learning generative\nmodels of full images. GANs have proven to be tremendously popular due to their ability to\ngenerate high quality images, despite repeated reports of \"mode collapse\" and despite the dif\ufb01culty\nof performing explicit inference with them. In this paper we investigated the utility of GANs for\nlearning statistical models of images by comparing them to the humble Gaussian Mixture Model.\nWe showed that it is possible to ef\ufb01ciently train GMMs on the same datasets that are usually used\nwith GANs. We showed that the GMM also generates realistic samples (although not as sharp as the\nGAN samples) but unlike GANs it does an excellent job of capturing the underlying distribution and\nprovides explicit representation of the statistical structure.\nWe do not mean to suggest that Gaussian Mixture Models are the ultimate solution to the problem of\nlearning models of full images. Nevertheless, the success of such a simple model motivates the search\nfor more elaborate statistical models that still allow ef\ufb01cient inference and accurate representation of\nstatistical structure, even at the expense of not generating the prettiest pictures.\n\n9\n\n\fAcknowledgments\n\nSupported by the Israeli Science Foundation and the Gatsby Foundation.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for\nlarge-scale machine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein Generative Adversarial\nNetworks. In Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 214\u2013223, 2017.\n\n[3] Sanjeev Arora, Andrej Risteski, and Yi Zhang. Do GANs learn the distribution? some theory\n\nand empirics. In International Conference on Learning Representations, 2018.\n\n[4] Anthony J Bell and Terrence J Sejnowski. Edges are the \u2019independent components\u2019 of natural\n\nscenes. In Advances in neural information processing systems, pages 831\u2013837, 1997.\n\n[5] David Berthelot, Tom Schumm, and Luke Metz. BEGAN: Boundary equilibrium generative\n\nadversarial networks. arXiv preprint arXiv:1703.10717, 2017.\n\n[6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info-\nGAN: Interpretable representation learning by information maximizing generative adversarial\nnets. In Advances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[7] Wayne W Daniel and Chad Lee Cross. Biostatistics: a foundation for analysis in the health\n\nsciences. 1995.\n\n[8] Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan.\nComparison of maximum likelihood and GAN-based training of Real NVPs. arXiv preprint\narXiv:1705.05263, 2017.\n\n[9] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP.\n\nInternational Conference on Learning Representations, 2017.\n\n[10] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed,\nand Ian Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at\nevery step. International Conference on Learning Representations, 2018.\n\n[11] Zoubin Ghahramani, Geoffrey E Hinton, et al. The EM algorithm for mixtures of factor\n\nanalyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996.\n\n[12] Ian Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint\n\narXiv:1701.00160, 2016.\n\n[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[14] Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-GAN: Combining maximum likelihood\nand adversarial learning in generative models. In AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n[15] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein GANs. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\n[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances\nin Neural Information Processing Systems, pages 6626\u20136637, 2017.\n\n10\n\n\f[17] Xianxu Hou, Linlin Shen, Ke Sun, and Guoping Qiu. Deep feature consistent variational\nautoencoder. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on,\npages 1133\u20131141. IEEE, 2017.\n\n[18] Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively evaluat-\nInternational Conference on Learning\n\ning GANs with divergences proposed for training.\nRepresentations, 2018.\n\n[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with\nconditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), July 2017.\n\n[20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for\nimproved quality, stability, and variation. International Conference on Learning Representations,\n2018.\n\n[21] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. International Confer-\n\nence on Learning Representations, 2014.\n\n[22] Martin Knott and David J Bartholomew. Latent variable models and factor analysis. Number 7.\n\nEdward Arnold, 1999.\n\n[23] Alexander Kolesnikov and Christoph H Lampert. PixelCNN models with auxiliary variables for\nnatural image modeling. In International Conference on Machine Learning, pages 1905\u20131914,\n2017.\n\n[24] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, University of Toronto, 2009.\n\n[25] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[26] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. MMD\nGAN: Towards deeper understanding of moment matching network. In Advances in Neural\nInformation Processing Systems 30, pages 2203\u20132213. 2017.\n\n[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[28] Mario Lu\u02c7ci\u00b4c, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs\ncreated equal? a large-scale study. In Advances in Neural Information Processing Systems\n(NIPS), 2018.\n\n[29] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. International Conference on Learning Representations, 2017.\n\n[30] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on deep\nlearning and unsupervised feature learning, volume 2011, page 5, 2011.\n\n[31] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\nauxiliary classi\ufb01er GANs. In Proceedings of the 34th International Conference on Machine\nLearning, pages 2642\u20132651, 2017.\n\n[32] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive \ufb01eld properties by\n\nlearning a sparse code for natural images. Nature, 381(6583):607, 1996.\n\n[33] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. International Conference on Learning\nRepresentations, 2016.\n\n[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training GANs. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n11\n\n\f[35] Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles Sutton.\nVEEGAN: Reducing mode collapse in GANs using implicit variational learning. In Advances\nin Neural Information Processing Systems, pages 3308\u20133318, 2017.\n\n[36] Michael E Tipping and Christopher M Bishop. Mixtures of probabilistic principal component\n\nanalyzers. Neural computation, 11(2):443\u2013482, 1999.\n\n[37] Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611\u2013622,\n1999.\n\n[38] Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive\ndensity-estimator. In Advances in Neural Information Processing Systems, pages 2175\u20132183,\n2013.\n\n[39] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.\nConditional image generation with pixelcnn decoders. In Advances in Neural Information\nProcessing Systems, pages 4790\u20134798, 2016.\n\n[40] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1747\u20131756, 2016.\n\n[41] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis\n\nof decoder-based generative models. 2017.\n\n[42] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image\nrestoration. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 479\u2013\n486. IEEE, 2011.\n\n12\n\n\f", "award": [], "sourceid": 2822, "authors": [{"given_name": "Eitan", "family_name": "Richardson", "institution": "The Hebrew University of Jerusalem"}, {"given_name": "Yair", "family_name": "Weiss", "institution": "Hebrew University"}]}