{"title": "Adversarial Symmetric Variational Autoencoder", "book": "Advances in Neural Information Processing Systems", "page_first": 4330, "page_last": 4339, "abstract": "A new form of variational autoencoder (VAE) is developed, in which the joint distribution of data and codes is considered in two (symmetric) forms: (i) from observed data fed through the encoder to yield codes, and (ii) from latent codes drawn from a simple prior and propagated through the decoder to manifest data. Lower bounds are learned for marginal log-likelihood fits observed data and latent codes. When learning with the variational bound, one seeks to minimize the symmetric Kullback-Leibler divergence of joint density functions from (i) and (ii), while simultaneously seeking to maximize the two marginal log-likelihoods. To facilitate learning, a new form of adversarial training is developed. An extensive set of experiments is performed, in which we demonstrate state-of-the-art data reconstruction and generation on several image benchmarks datasets.", "full_text": "Adversarial Symmetric Variational Autoencoder\n\nYunchen Pu, Weiyao Wang, Ricardo Henao, Liqun Chen, Zhe Gan, Chunyuan Li\n\nand Lawrence Carin\n\nDepartment of Electrical and Computer Engineering, Duke University\n\n{yp42, ww109, r.henao, lc267, zg27,cl319, lcarin}@duke.edu\n\nAbstract\n\nA new form of variational autoencoder (VAE) is developed, in which the joint\ndistribution of data and codes is considered in two (symmetric) forms: (i) from\nobserved data fed through the encoder to yield codes, and (ii) from latent codes\ndrawn from a simple prior and propagated through the decoder to manifest data.\nLower bounds are learned for marginal log-likelihood \ufb01ts observed data and latent\ncodes. When learning with the variational bound, one seeks to minimize the\nsymmetric Kullback-Leibler divergence of joint density functions from (i) and (ii),\nwhile simultaneously seeking to maximize the two marginal log-likelihoods. To\nfacilitate learning, a new form of adversarial training is developed. An extensive\nset of experiments is performed, in which we demonstrate state-of-the-art data\nreconstruction and generation on several image benchmark datasets.\n\n1\n\nIntroduction\n\nRecently there has been increasing interest in developing generative models of data, offering the\npromise of learning based on the often vast quantity of unlabeled data. With such learning, one\ntypically seeks to build rich, hierarchical probabilistic models that are able to \ufb01t to the distribution of\ncomplex real data, and are also capable of realistic data synthesis.\nGenerative models are often characterized by latent variables (codes), and the variability in the codes\nencompasses the variation in the data [1, 2]. The generative adversarial network (GAN) [3] employs\na generative model in which the code is drawn from a simple distribution (e.g., isotropic Gaussian),\nand then the code is fed through a sophisticated deep neural network (decoder) to manifest the data.\nIn the context of data synthesis, GANs have shown tremendous capabilities in generating realistic,\nsharp images from models that learn to mimic the structure of real data [3, 4, 5, 6, 7, 8]. The quality\nof GAN-generated images has been evaluated by somewhat ad hoc metrics like inception score [9].\nHowever, the original GAN formulation does not allow inference of the underlying code, given\nobserved data. This makes it dif\ufb01cult to quantify the quality of the generative model, as it is not\npossible to compute the quality of model \ufb01t to data. To provide a principled quantitative analysis of\nmodel \ufb01t, not only should the generative model synthesize realistic-looking data, one also desires the\nability to infer the latent code given data (using an encoder). Recent GAN extensions [10, 11] have\nsought to address this limitation by learning an inverse mapping (encoder) to project data into the\nlatent space, achieving encouraging results on semi-supervised learning. However, these methods still\nfail to obtain faithful reproductions of the input data, partly due to model under\ufb01tting when learning\nfrom a fully adversarial objective [10, 11].\nVariational autoencoders (VAEs) are designed to learn both an encoder and decoder, leading to\nexcellent data reconstruction and the ability to quantify a bound on the log-likelihood \ufb01t of the\nmodel to data [12, 13, 14, 15, 16, 17, 18, 19]. In addition, the inferred latent codes can be utilized\nin downstream applications, including classi\ufb01cation [20] and image captioning [21]. However, new\nimages synthesized by VAEs tend to be unspeci\ufb01c and/or blurry, with relatively low resolution. These\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\flimitations of VAEs are becoming increasingly understood. Speci\ufb01cally, the traditional VAE seeks to\nmaximize a lower bound on the log-likelihood of the generative model, and therefore VAEs inherit\nthe limitations of maximum-likelihood (ML) learning [22]. Speci\ufb01cally, in ML-based learning one\noptimizes the (one-way) Kullback-Leibler (KL) divergence between the distribution of the underlying\ndata and the distribution of the model; such learning does not penalize a model that is capable of\ngenerating data that are different from that used for training.\nBased on the above observations, it is desirable to build a generative-model learning framework with\nwhich one can compute and assess the log-likelihood \ufb01t to real (observed) data, while also being\ncapable of generating synthetic samples of high realism. Since GANs and VAEs have complementary\nstrengths, their integration appears desirable, with this a principal contribution of this paper. While\nintegration seems natural, we make important changes to both the VAE and GAN setups, to leverage\nthe best of both. Speci\ufb01cally, we develop a new form of the variational lower bound, manifested\njointly for the expected log-likelihood of the observed data and for the latent codes. Optimizing\nthis variational bound involves maximizing the expected log-likelihood of the data and codes, while\nsimultaneously minimizing a symmetric KL divergence involving the joint distribution of data and\ncodes. To compute parts of this variational lower bound, a new form of adversarial learning is invoked.\nThe proposed framework is termed Adversarial Symmetric VAE (AS-VAE), since within the model\n(i) the data and codes are treated in a symmetric manner, (ii) a symmetric form of KL divergence is\nminimized when learning, and (iii) adversarial training is utilized. To illustrate the utility of AS-VAE,\nwe perform an extensive set of experiments, demonstrating state-of-the-art data reconstruction and\ngeneration on several benchmarks datasets.\n\n2 Background and Foundations\nConsider an observed data sample x, modeled as being drawn from p\u03b8(x|z), with model parameters\n\u03b8 and latent code z. The prior distribution on the code is denoted p(z), typically a distribution that is\neasy to draw from, such as isotropic Gaussian. The posterior distribution on the code given data x\nis p\u03b8(z|x), and since this is typically intractable, it is approximated as q\u03c6(z|x), parameterized by\nlearned parameters \u03c6. Conditional distributions q\u03c6(z|x) and p\u03b8(x|z) are typically designed such\nthat they are easily sampled and, for \ufb02exibility, modeled in terms of neural networks [12]. Since z\nis a latent code for x, q\u03c6(z|x) is also termed a stochastic encoder, with p\u03b8(x|z) a corresponding\nstochastic decoder. The observed data are assumed drawn from q(x), for which we do not have a\nexplicit form, but from which we have samples, i.e., the ensemble {xi}i=1,N used for learning.\n\nOur goal is to learn the model p\u03b8(x) =(cid:82) p\u03b8(x|z)p(z)dz such that it synthesizes samples that are\n\nwell matched to those drawn from q(x). We simultaneously seek to learn a corresponding encoder\nq\u03c6(z|x) that is both accurate and ef\ufb01cient to implement. Samples x are synthesized via x \u223c p\u03b8(x|z)\nwith z \u223c p(z); z \u223c q\u03c6(z|x) provides an ef\ufb01cient coding of observed x, that may be used for other\npurposes (e.g., classi\ufb01cation or caption generation when x is an image [21]).\n\n2.1 Traditional Variational Autoencoders and Their Limitations\n\nMaximum likelihood (ML) learning of \u03b8 based on direct evaluation of p\u03b8(x) is typically intractable.\nThe VAE [12, 13] seeks to bound p\u03b8(x) by maximizing variational expression LVAE(\u03b8, \u03c6), with\nrespect to parameters {\u03b8, \u03c6}, where\nLVAE(\u03b8, \u03c6) = Eq\u03c6(x,z) log\n\n= Eq(x)[log p\u03b8(x) \u2212 KL(q\u03c6(z|x)(cid:107)p\u03b8(z|x))]\n\n(cid:20) p\u03b8(x, z)\n\n(cid:21)\n\n(1)\n\nq\u03c6(z|x)\n\n= \u2212KL(q\u03c6(x, z)(cid:107)p\u03b8(x, z)) + const ,\n\n(2)\nwith expectations Eq\u03c6(x,z) and Eq(x) performed approximately via sampling. Speci\ufb01cally, to evaluate\nEq\u03c6(x,z) we draw a \ufb01nite set of samples zi \u223c q\u03c6(zi|xi), with xi \u223c q(x) denoting the observed\ndata, and for Eq(x), we directly use observed data xi \u223c q(x). When learning {\u03b8, \u03c6}, the expectation\nusing samples from zi \u223c q\u03c6(zi|xi) is implemented via the \u201creparametrization trick\u201d [12].\nMaximizing LVAE(\u03b8, \u03c6) wrt {\u03b8, \u03c6} provides a lower bound on 1\ni=1 log p\u03b8(xi), hence the VAE\nsetup is an approximation to ML learning of \u03b8. Learning \u03b8 based on 1\ni=1 log p\u03b8(xi) is equivalent\nto learning \u03b8 based on minimizing KL(q(x)(cid:107)p\u03b8(x)), again implemented in terms of the N observed\nN\nsamples of q(x). As discussed in [22], such learning does not penalize \u03b8 severely for yielding x\n\n(cid:80)N\n(cid:80)N\n\nN\n\n2\n\n\f(cid:82) p\u03b8(z|x)q(x)dx \u2248(cid:82) q\u03c6(z|x)q(x)dx = q\u03c6(z) is typically different from p(z), which implies that\n\nof relatively high probability in p\u03b8(x) while being simultaneously of low probability in q(x). This\nmeans that \u03b8 seeks to match p\u03b8(x) to the properties of the observed data samples, but p\u03b8(x) may\nalso have high probability of generating samples that do not look like data drawn from q(x). This is\na fundamental limitation of ML-based learning [22], inherited by the traditional VAE in (1).\nOne reason for the failing of ML-based learning of \u03b8 is that the cumulative posterior on latent codes\nx \u223c p\u03b8(x|z), with z \u223c p(z) may yield samples x that are different from those generated from q(x).\nHence, when learning {\u03b8, \u03c6} one may seek to match p\u03b8(x) to samples of q(x), as done in (1), while\nsimultaneously matching q\u03c6(z) to samples of p(z). The expression in (1) provides a variational\nbound for matching p\u03b8(x) to samples of q(x), thus one may naively think to simultaneously set a\nsimilar variational expression for q\u03c6(z), with these two variational expressions optimized jointly.\nHowever, to compute this additional variational expression we require an analytic expression for\nq\u03c6(x, z) = q\u03c6(z|x)q(x), which also means we need an analytic expression for q(x), which we do\nnot have.\nExamining (2), we also note that LVAE(\u03b8, \u03c6) approximates \u2212KL(q\u03c6(x, z)(cid:107)p\u03b8(x, z)), which has\nlimitations aligned with those discussed above for ML-based learning of \u03b8. Analogous to the above\ndiscussion, we would also like to consider \u2212KL(p\u03b8(x, z)(cid:107)q\u03c6(x, z)). So motivated, in Section 3 we\n(cid:80)M\ni=1 log p\u03b8(xi) and\ndevelop a new form of variational lower bound, applicable to maximizing 1\nN\nj=1 log q\u03c6(zj), where zj \u223c p(z) is the j-th of M samples from p(z). We demonstrate that this\n1\nnew framework leverages both KL(p\u03b8(x, z)(cid:107)q\u03c6(x, z)) and KL(q\u03c6(x, z)(cid:107)p\u03b8(x, z)), by extending\nM\nideas from adversarial networks.\n\n(cid:80)N\n\n2.2 Adversarial Learning\nThe original idea of GAN [3] was to build an effective generative model p\u03b8(x|z), with z \u223c p(z), as\ndiscussed above. There was no desire to simultaneously design an inference network q\u03c6(z|x). More\nrecently, authors [10, 11, 23] have devised adversarial networks that seek both p\u03b8(x|z) and q\u03c6(z|x).\nAs an important example, Adversarial Learned Inference (ALI) [10] considers the following objective\nfunction:\n\nmin\n\u03b8,\u03c6\n\nmax\n\n\u03c8\n\nLALI(\u03b8, \u03c6, \u03c8) = Eq\u03c6(x,z)[log \u03c3(f\u03c8(x, z))] + Ep\u03b8 (x,z)[log(1 \u2212 \u03c3(f\u03c8(x, z)))] ,\n\n(3)\n\nwhere the expectations are approximated with samples, as in (1). The function f\u03c8(x, z), termed a\ndiscriminator, is typically implemented using a neural network with parameters \u03c8 [10, 11]. Note that\nin (3) we need only sample from p\u03b8(x, z) = p\u03b8(x|z)p(z) and q\u03c6(x, z) = q\u03c6(z|x)q(x), avoiding\nthe need for an explicit form for q(x).\nThe framework in (3) can, in theory, match p\u03b8(x, z) and q\u03c6(x, z), by \ufb01nding a Nash equilibrium\nof their respective non-convex objectives [3, 9]. However, training of such adversarial networks\nis typically based on stochastic gradient descent, which is designed to \ufb01nd a local mode of a cost\nfunction, rather than locating an equilibrium [9]. This objective mismatch may lead to the well-known\ninstability issues associated with GAN training [9, 22].\nTo alleviate this problem, some researchers add a regularization term, such as reconstruction loss\n[24, 25, 26] or mutual information [4], to the GAN objective, to restrict the space of suitable mapping\nfunctions, thus avoiding some of the failure modes of GANs, i.e., mode collapsing. Below we\nwill formally match the joint distributions as in (3), and reconstruction-based regularization will be\nmanifested by generalizing the VAE setup via adversarial learning. Toward this goal we consider the\nfollowing lemma, which is analogous to Proposition 1 in [3, 23].\n\nLemma 1 Consider Random Variables (RVs) x and z with joint distributions, p(x, z) and q(x, z).\nThe optimal discriminator D\u2217(x, z) = \u03c3(f\u2217(x, z)) for the following objective\n\nEp(x,z) log[\u03c3(f (x, z))] + Eq(x,z)[log(1 \u2212 \u03c3(f (x, z)))] ,\n\n(4)\n\nmax\n\nf\n\nis f\u2217(x, z) = log p(x, z) \u2212 log q(x, z).\nUnder Lemma 1, we are able to estimate the log q\u03c6(x, z) \u2212 log p\u03b8(x)p(z) and log p\u03b8(x, z) \u2212\nlog q(x)q\u03c6(z) using the following corollary.\n\n3\n\n\fCorollary 1.1 For RVs x and z with encoder joint distribution q\u03c6(x, z) = q(x)q\u03c6(z|x) and\ndecoder joint distribution p\u03b8(x, z) = p(z)p\u03b8(x|z), consider the following objectives:\n\n(5)\n\n(6)\n\nthe optimal\n\nLA1(\u03c81) = Ex\u223cq(x),z\u223cq\u03c6(z|x) log[\u03c3(f\u03c81(x, z))]\n\n+ Ex\u223cp\u03b8 (x|z(cid:48)),z(cid:48)\u223cp(z),z\u223cp(z)[log(1 \u2212 \u03c3(f\u03c81(x, z)))] ,\n\nLA2(\u03c82) = Ez\u223cp(z),x\u223cp\u03b8 (x|z) log[\u03c3(f\u03c82(x, z))]\n\nmax\n\u03c81\n\nmax\n\u03c82\n\nIf the parameters \u03c6 and \u03b8 are \ufb01xed, with f\u03c8\u2217\ndiscriminator for (6), then\n\n1\n\n+ Ez\u223cq\u03c6(z|x(cid:48)),x(cid:48)\u223cq(x),x\u223cq(x)[log(1 \u2212 \u03c3(f\u03c82(x, z)))] ,\nthe optimal discriminator for (5) and f\u03c8\u2217\n\n2\n\n(x, z) = log q\u03c6(x, z) \u2212 log p\u03b8(x)p(z),\n\nf\u03c8\u2217\n\n1\n\n(x, z) = log p\u03b8(x, z) \u2212 log q\u03c6(z)q(x) .\n\n(7)\n\nf\u03c8\u2217\n\n2\n\nThe proof is provided in the Appendix A. We also assume in Corollary 1.1 that f\u03c81 (x, z) and\n\u2217\nf\u03c82(x, z) are suf\ufb01ciently \ufb02exible such that there are parameters \u03c8\n2 capable of achieving\nthe equalities in (7). Toward that end, f\u03c81 and f\u03c82 are implemented as \u03c81- and \u03c82-parameterized\nneural networks (details below), to encourage universal approximation [27].\n3 Adversarial Symmetric Variational Auto-Encoder (AS-VAE)\nConsider variational expressions\n\n\u2217\n1 and \u03c8\n\nLVAEx(\u03b8, \u03c6) = Eq(x) log p\u03b8(x) \u2212 KL(q\u03c6(x, z)(cid:107)p\u03b8(x, z))\nLVAEz(\u03b8, \u03c6) = Ep(z) log q\u03c6(z) \u2212 KL(p\u03b8(x, z)(cid:107)q\u03c6(x, z)) ,\n\n(8)\n(9)\nwhere all expectations are again performed approximately using samples from q(x) and p(z). Recall\nthat Eq(x) log p\u03b8(x) = \u2212KL(q(x)(cid:107)p\u03b8(x)) + const, and Ep(z) log p\u03b8(z) = \u2212KL(p(z)(cid:107)q\u03c6(z)) +\nconst, thus (8) is maximized when q(x) = p\u03b8(x) and q\u03c6(x, z) = p\u03b8(x, z). Similarly, (9) is\nmaximized when p(z) = q\u03c6(z) and q\u03c6(x, z) = p\u03b8(x, z). Hence, (8) and (9) impose desired\nconstraints on both the marginal and joint distributions. Note that the log-likelihood terms in (8)\nand (9) are analogous to the data-\ufb01t regularizers discussed above in the context of ALI, but here\nimplemented in a generalized form of the VAE. Direct evaluation of (8) and (9) is not possible, as it\nrequires an explicit form for q(x) to evaluate q\u03c6(x, z) = q\u03c6(z|x)q(x).\nOne may readily demonstrate that\n\nLVAEx(\u03b8, \u03c6) = Eq\u03c6(x,z)[log p\u03b8(x)p(z) \u2212 log q\u03c6(x, z) + log p\u03b8(x|z)]\n\n= Eq\u03c6(x,z)[log p\u03b8(x|z) \u2212 f\u03c8\u2217\n\n1\n\n(x, z)] .\n\n(x, z). This naturally suggests the\n\nA similar expression holds for LVAEz(\u03b8, \u03c6), in terms of f\u03c8\u2217\ncumulative variational expression\nLVAExz(\u03b8, \u03c6, \u03c81, \u03c82) = LVAEx(\u03b8, \u03c6) + LVAEz(\u03b8, \u03c6)\n\n2\n\n= Eq\u03c6(x,z)[log p\u03b8(x|z) \u2212 f\u03c81(x, z)] + Ep\u03b8 (x,z)[log q\u03c6(x|z) \u2212 f\u03c82(x, z)] ,\n\n(10)\n\nwhere \u03c81 and \u03c82 are updated using the adversarial objectives in (5) and (6), respectively.\nNote that to evaluate (10) we must be able to sample from q\u03c6(x, z) = q(x)q\u03c6(z|x) and\np\u03b8(x, z) = p(z)p\u03b8(x|z), both of which are readily available, as discussed above. Further, we\nrequire explicit expressions for q\u03c6(z|x) and p\u03b8(x|z), which we have. For (5) and (6) we similarly\nmust be able to sample from the distributions involved, and we must be able to evaluate f\u03c81 (x, z)\nand f\u03c82(x, z), each of which is implemented via a neural network. Note as well that the bound in\n(1) for Eq(x) log p\u03b8(x) is in terms of the KL distance between conditional distributions q\u03c6(z|x) and\np\u03b8(z|x), while (8) utilizes the KL distance between joint distributions q\u03c6(x, z) and p\u03b8(x, z) (use\nof joint distributions is related to ALI). By combining (8) and (9), the complete variational bound\nLVAExz employs the symmetric KL between these two joint distributions. By contrast, from (2),\nthe original variational lower bound only addresses a one-way KL distance between q\u03c6(x, z) and\np\u03b8(x, z). While [23] had a similar idea of employing adversarial methods in the context variational\nlearning, it was only done within the context of the original form in (1), the limitations of which were\ndiscussed in Section 2.1.\n\n4\n\n\fIn the original VAE, in which (1) was optimized, the reparametrization trick [12] was invoked\nwrt q\u03c6(z|x), with samples z\u03c6(x, \u0001) and \u0001 \u223c N (0, I), as the expectation was performed wrt this\ndistribution; this reparametrization is convenient for computing gradients wrt \u03c6. In the AS-VAE\nin (10), expectations are also needed wrt p\u03b8(x|z). Hence, to implement gradients wrt \u03b8, we\nalso constitute a reparametrization of p\u03b8(x|z). Speci\ufb01cally, we consider samples x\u03b8(z, \u03be) with\n\u03be \u223c N (0, I). LVAExz(\u03b8, \u03c6, \u03c81, \u03c82) in (10) is re-expressed as\n\n(cid:2)f\u03c81(x, z\u03c6(x, \u0001)) \u2212 log p\u03b8(x|z\u03c6(x, \u0001))(cid:3)\n(cid:2)f\u03c82 (x\u03b8(z, \u03be), z) \u2212 log q\u03c6(z|x\u03b8(z, \u03be))(cid:3) .\n\nLVAExz(\u03b8, \u03c6, \u03c81, \u03c82) = Ex\u223cq(x),\u0001\u223cN (0,I)\n+ Ez\u223cp(z),\u03be\u223cN (0,I)\n\n(11)\nThe expectations in (11) are approximated via samples drawn from q(x) and p(z), as well as samples\nof \u0001 and \u03be. x\u03b8(z, \u03be) and z\u03c6(x, \u0001) can be implemented with a Gaussian assumption [12] or via\ndensity transformation [14, 16], detailed when presenting experiments in Section 5.\nThe complete objective of the proposed Adversarial Symmetric VAE (AS-VAE) requires the cumula-\ntive variational in (11), which we maximize wrt \u03c81 and \u03c81 as in (5) and (6), using the results in (7).\nHence, we write\n\nmin\n\u03b8,\u03c6\n\nmax\n\u03c81,\u03c82\n\n\u2212LVAExz(\u03b8, \u03c6, \u03c81, \u03c82) .\n\n(12)\n\n\u2217\n1, \u03c8\n\n2} if and only if (7) holds, and p\u03b8\u2217 (x, z) = q\u03c6\u2217 (x, z).\n\u2217\n\n, \u03c8\n\n\u2217\n\n\u2217\n, \u03c6\n\n\u2217 is an estimator that\n\u2217 matches the aggregated posterior q\u03c6(z) to prior distribution\n\nThe following proposition characterizes the solutions of (12) in terms of the joint distributions of x\nand z.\nProposition 1 The equilibrium for the min-max objective in (12) is achieved by speci\ufb01cation\n{\u03b8\nThe proof is provided in the Appendix A. This theoretical result implies that (i) \u03b8\nyields good reconstruction, and (ii) \u03c6\np(z).\n4 Related Work\nVAEs [12, 13] represent one of the most successful deep generative models developed recently.\nAided by the reparameterization trick, VAEs can be trained with stochastic gradient descent. The\noriginal VAEs implement a Gaussian assumption for the encoder. More recently, there has been a\ndesire to remove this Gaussian assumption. Normalizing \ufb02ow [14] employs a sequence of invertible\ntransformation to make the distribution of the latent codes arbitrarily \ufb02exible. This work was followed\nby inverse auto-regressive \ufb02ow [16], which uses recurrent neural networks to make the latent codes\nmore expressive. More recently, SteinVAE [28] applies Stein variational gradient descent [29] to\ninfer the distribution of latent codes, discarding the assumption of a parametric form of posterior\ndistribution for the latent code. However, these methods are not able to address the fundamental\nlimitation of ML-based models, as they are all based on the variational formulation in (1).\nGANs [3] constitute another recent framework for learning a generative model. Recent extensions of\nGAN have focused on boosting the performance of image generation by improving the generator [5],\ndiscriminator [30] or the training algorithm [9, 22, 31]. More recently, some researchers [10, 11, 33]\nhave employed a bidirectional network structure within the adversarial learning framework, which in\ntheory guarantees the matching of joint distributions over two domains. However, non-identi\ufb01ability\nissues are raised in [32]. For example, they have dif\ufb01culties in providing good reconstruction in latent\nvariable models, or discovering the correct pairing relationship in domain transformation tasks. It was\nshown that these problems are alleviated in DiscoGAN [24], CycleGAN [26] and ALICE [32] via\nadditional (cid:96)1, (cid:96)2 or adversarial losses. However, these methods lack of explicit probabilistic modeling\nof observations, thus could not directly evaluate the likelihood of given data samples.\nA key component of the proposed framework concerns integrating a new VAE formulation with\nadversarial learning. There are several recent approaches that have tried to combining VAE and\nGAN [34, 35], Adversarial Variational Bayes (AVB) [23] is the one most closely related to our work.\nAVB employs adversarial learning to estimate the posterior of the latent codes, which makes the\nencoder arbitrarily \ufb02exible. However, AVB seeks to optimize the original VAE formulation in (1),\nand hence it inherits the limitations of ML-based learning of \u03b8. Unlike AVB, the proposed use of\nadversarial learning is based on a new VAE setup, that seeks to minimize the symmetric KL distance\nbetween p\u03b8(x, z) and q\u03c6(x, z), while simultaneously seeking to maximize the marginal expected\nlikelihoods Eq(x)[log p\u03b8(x)] and Ep(z)[log p\u03c6(z)].\n\n5\n\n\f5 Experiments\n\nWe evaluate our model on three datasets: MNIST, CIFAR-10 and ImageNet. To balance performance\nand computational cost, p\u03b8(x|z) and q\u03c6(z|x) are approximated with a normalizing \ufb02ow [14] of\nlength 80 for the MNIST dataset, and a Gaussian approximation for CIFAR-10 and ImageNet data.\nAll network architectures are provided in the Appendix B. All parameters were initialized with Xavier\n[36], and optimized via Adam [37] with learning rate 0.0001. We do not perform any dataset-speci\ufb01c\ntuning or regularization other than dropout [38]. Early stopping is employed based on average\nreconstruction loss of x and z on validation sets.\nWe show three types of results, using part of or all of our model to illustrate each component. i)\nAS-VAE-r: This model trained with the \ufb01rst half of the objective in (11) to minimize LVAEx(\u03b8, \u03c6)\nin (8); it is an ML-based method which focuses on reconstruction. ii) AS-VAE-g: This model trained\nwith the second half of the objective in (11) to minimize LVAEz(\u03b8, \u03c6) in (9); it can be considered as\nmaximizing the likelihood of q\u03c6(z), and designed for generation. iii) AS-VAE This is our proposed\nmodel, developed in Section 3.\n\n5.1 Evaluation\n\nWe evaluate our model on both reconstruction and generation. The performance of the former is\nevaluated using negative log-likelihood (NLL) estimated via the variational lower bound de\ufb01ned\nin (1). Images are modeled as continuous. To do this, we add [0, 1]-uniform noise to natural images\n(one color channel at the time), then divide by 256 to map 8-bit images (256 levels) to the unit\ninterval. This technique is widely used in applications involving natural images [12, 14, 16, 39, 40],\nsince it can be proved that in terms of log-likelihood, modeling in the discrete space is equivalent\nto modeling in the continuous space (with added noise) [39, 41]. During testing, the likelihood is\ncomputed as p(x = i|z) = p\u03b8(x \u2208 [i/256, (i + 1)/256]|z) where i = 0, . . . , 255. This is done to\nguarantee a fair comparison with prior work (that assumed quantization). For the MNIST dataset, we\ntreat the [0, 1]-mapped continuous input as the probability of a binary pixel value (on or off) [12]. The\ninception score (IS), de\ufb01ned as exp(Eq(x)KL(p(y|x)(cid:107)p(y))), is employed to quantitatively evaluate\nthe quality of generated natural images, where p(y) is the empirical distribution of labels (we do not\nleverage any label information during training) and p(y|x) is the output of the Inception model [42]\non each generated image.\nTo the authors\u2019 knowledge, we are the \ufb01rst to report both inception score (IS) and NLL for natural\nimages from a single model. For comparison, we implemented DCGAN [5] and PixelCNN++ [40] as\nbaselines. The implementation of DCGAN is based on a similar network architectures as our model.\nNote that for NLL a lower value is better, whereas for IS a higher value is better.\n\n5.2 MNIST\n\nWe \ufb01rst evaluate our model on the MNIST dataset. The log-likelihood results are summarized in\nTable 1. Our AS-VAE achieves a negative log-likelihood of 82.51 nats, outperforming normalizing\n\ufb02ow (85.1 nats) with a similar architecture. The perfomance of AS-VAE-r (81.14 nats) is competitive\nto the state-of-the-art (79.2 nats). The generated samples are showed in Figure 1. AS-VAE-g and\nAS-VAE both generate good samples while the results of AS-VAE-r are slightly more blurry, partly\ndue to the fact that AS-VAE-r is an ML-based model.\n\n5.3 CIFAR\n\nNext we evaluate our models on the CIFAR-10 dataset. The quantitative results are listed in Table 2.\nAS-VAE-r and AS-VAE-g achieve encouraging results on reconstruction and generation, respectively,\nwhile our AS-VAE model (leveraging the full objective) achieves a good balance between these\ntwo tasks, which demonstrates the bene\ufb01t of optimizing a symmetric objective. Compared with\n\nMethod\n\nNLL (nats)\n\nNF (k=80) [14]\n\nIAF [16] AVB [23]\n\nPixelRNN [39] AS-VAE-r AS-VAE-g AS-VAE\n\n85.1\n\n80.9\n\n79.5\n\n79.2\n\n81.14\n\n146.32\n\n82.51\n\nTable 1: NLL on MNIST.\n\n6\n\n\fTable 2: Quantitative Results on CIFAR-10; \u2020 2.96 is based on our\nimplementation and 2.92 is reported in [40].\n\nstate-of-the-art ML-based models [39, 40], we achieve competitive results on reconstruction but\nprovide a much better performance on generation, also outperforming other adversarially-trained\nmodels. Note that our negative ELBO (evidence lower bound) is an upper bound of NLL as reported\nin [39, 40]. We also achieve a smaller root-mean-square-error (RMSE). Generated samples are shown\nin Figure 2. Additional results are provided in the Appendix C.\nALI [10], which also seeks to match\nthe joint encoder and decoder distribu-\ntion, is also implemented as a baseline.\nSince the decoder in ALI is a deter-\nministic network, the NLL of ALI is\nimpractical to compute. Alternatively,\nwe report the RMSE of reconstruction\nas a reference. Figure 3 qualitatively\ncompares the reconstruction perfor-\nmance of our model, ALI and VAE.\nAs can be seen, the reconstruction of\nALI is related to but not faithful repro-\nduction of the input data, which evi-\ndences the limitation in reconstruction\nability of adversarial learning. This is\nalso consistent in terms of RMSE.\n\nPixelRNN [39]\nPixelCNN++ [40]\n\nAS-VAE-r\nAS-VAE-g\nAS-VAE\n\nMIX+WassersteinGAN [43]\n\n3.289\n3.17\n13.12\n3.36\n\nIS\n3.82\n4.05\n4.89\n4.79\n\n5.51\n2.91\n6.89\n6.34\n\n2.96 (2.92)\u2020\n\n3.09\n93.12\n3.32\n\nWGAN [43]\n\nDCGAN [5]\n\nNLL(bits)\n\nMethod\n\n-\n-\n-\n-\n\n3.06\n\nALI\n\nRMSE\n\n-\n-\n-\n\n-\n\n14.53\n\n-\n\n5.4\n\nImageNet\n\nImageNet 2012 is used to evaluate the scalability of our model to large datasets. The images are\nresized to 64\u00d764. The quantitative results are shown in Table 3. Our model signi\ufb01cantly improves the\nperformance on generation compared with DCGAN and PixelCNN++, while achieving competitive\nresults on reconstruction compared with PixelRNN and PixelCNN++.\nNote that the PixelCNN++ takes more than two weeks\n(44 hours per epoch) for training and 52.0 seconds/image\nfor generating samples while our model only requires less\nthan 2 days (4 hours per epoch) for training and 0.01 sec-\nonds/image for generating on a single TITAN X GPU. As a\nreference, the true validation set of ImageNet 2012 achieves\n53.24% accuracy. This is because ImageNet has much\ngreater variety of images than CIFAR-10. Figure 4 shows\ngenerated samples based on trained with ImageNet, com-\npared with DCGAN and PixelCNN++. Our model is able\nto produce sharp images without label information while capturing more local spatial dependencies\nthan PixelCNN++, and without suffering from mode collapse as DCGAN. Additional results are\nprovided in the Appendix C.\n\nDCGAN [5]\nPixelRNN [39]\nPixelCNN++ [40]\n\nTable 3: Quantitative Results on ImageNet.\n\n3.63\n3.27\n3.71\n\n7.65\n11.14\n\nMethod\n\nAS-VAE\n\n5.965\n\nNLL\n\nIS\n\n-\n\n-\n\n6 Conclusions\n\nWe presented Adversarial Symmetrical Variational Autoencoders, a novel deep generative model for\nunsupervised learning. The learning objective is to minimizing a symmetric KL divergence between\nthe joint distribution of data and latent codes from encoder and decoder, while simultaneously maxi-\nmizing the expected marginal likelihood of data and codes. An extensive set of results demonstrated\nexcellent performance on both reconstruction and generation, while scaling to large datasets. A\npossible direction for future work is to apply AS-VAE to semi-supervised learning tasks.\n\nAcknowledgements\n\nThis research was supported in part by ARO, DARPA, DOE, NGA, ONR and NSF.\n\n7\n\n\fFigure 1: Generated samples trained on MNIST. (Left) AS-VAE-r; (Middle) AS-VAE-g (Right) AS-VAE.\n\nFigure 2: Samples generated by AS-VAE\nwhen trained on CIFAR-10.\n\nFigure 3: Comparison of reconstruction with ALI [10].\nIn each block: column one for ground-truth, column two\nfor ALI and column three for AS-VAE.\n\nFigure 4: Generated samples trained on ImageNet. (Top) AS-VAE; (Middle) DCGAN [5];(Bottom) Pixel-\nCNN++ [40].\n\n8\n\n\fReferences\n[1] Y. Pu, X. Yuan, A. Stevens, C. Li, and L. Carin. A deep generative deconvolutional image\n\nmodel. Arti\ufb01cial Intelligence and Statistics (AISTATS), 2016.\n\n[2] Y. Pu, X. Yuan, and L. Carin. Generative deep deconvolutional learning. In ICLR workshop,\n\n2015.\n\n[3] I.. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S.l Ozair, A. Courville,\n\nand Y. Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In NIPS, 2016.\n\n[5] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. In ICLR, 2016.\n\n[6] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text\n\nto image synthesis. In ICML, 2016.\n\n[7] Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, and L. Carin. Adversarial feature\n\nmatching for text generation. In ICML, 2017.\n\n[8] Y. Zhang, Z. Gan, and L. Carin. Generating text with adversarial training. In NIPS workshop,\n\n2016.\n\n[9] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\ntechniques for training gans. In NIPS, 2016.\n\n[10] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville.\n\nAdversarially learned inference. In ICLR, 2017.\n\n[11] J. Donahue, . Kr\u00e4henb\u00fchl, and T. Darrell. Adversarial feature learning. In ICLR, 2017.\n\n[12] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n\n[13] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\n\ninference in deep generative models. In ICML, 2014.\n\n[14] D.J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. In ICML, 2015.\n\n[15] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.\n\n[16] D. P. Kingma, T. Salimans, R. Jozefowicz, X.i Chen, I. Sutskever, and M. Welling. Improving\n\nvariational inference with inverse autoregressive \ufb02ow. In NIPS, 2016.\n\n[17] Y. Zhang, D. Shen, G. Wang, Z. Gan, R. Henao, and L. Carin. Deconvolutional paragraph\n\nrepresentation learning. In NIPS, 2017.\n\n[18] L. Chen, S. Dai, Y. Pu, C. Li, and Q. Su Lawrence Carin. Symmetric variational autoencoder\n\nand connections to adversarial learning. In arXiv, 2017.\n\n[19] D. Shen, Y. Zhang, R. Henao, Q. Su, and L. Carin. Deconvolutional latent-variable model for\n\ntext sequence matching. In arXiv, 2017.\n\n[20] D.P. Kingma, D.J. Rezende, S. Mohamed, and M. Welling. Semi-supervised learning with deep\n\ngenerative models. In NIPS, 2014.\n\n[21] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for\n\ndeep learning of images, labels and captions. In NIPS, 2016.\n\n[22] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial\n\nnetworks. In ICLR, 2017.\n\n[23] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational\n\nautoencoders and generative adversarial networks. In arXiv, 2016.\n\n9\n\n\f[24] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with\n\ngenerative adversarial networks. In arXiv, 2017.\n\n[25] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. In arXiv, 2017.\n\n[26] JY Zhu, T. Park, P. Isola, and A. Efros. Unpaired image-to-image translation using cycle-\n\nconsistent adversarial networks. In arXiv, 2017.\n\n[27] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal\n\napproximators. Neural networks, 1989.\n\n[28] Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Vae learning via stein variational gradient\n\ndescent. In NIPS, 2017.\n\n[29] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference\n\nalgorithm. In NIPS, 2016.\n\n[30] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR,\n\n2017.\n\n[31] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. In arXiv, 2017.\n\n[32] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. Alice: Towards understanding\n\nadversarial learning for joint distribution matching. In NIPS, 2017.\n\n[33] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and Lawrence Carin. Triangle\n\ngenerative adversarial networks. In NIPS, 2017.\n\n[34] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In\n\narXiv, 2015.\n\n[35] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels\n\nusing a learned similarity metric. In ICML, 2016.\n\n[36] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In AISTATS, 2010.\n\n[37] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[38] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple\n\nway to prevent neural networks from over\ufb01tting. JMLR, 2014.\n\n[39] A. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural network. In ICML,\n\n2016.\n\n[40] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn\n\nwith discretized logistic mixture likelihood and other modi\ufb01cations. In ICLR, 2017.\n\n[41] L. Thei, A. Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR,\n\n2016.\n\n[42] C. Szegedy, W. Liui, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\n\nA. Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n[43] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative\n\nadversarial nets. In arXiv, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2260, "authors": [{"given_name": "Yuchen", "family_name": "Pu", "institution": "Duke University"}, {"given_name": "Weiyao", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Liqun", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Zhe", "family_name": "Gan", "institution": "Duke University"}, {"given_name": "Chunyuan", "family_name": "Li", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}