{"title": "Amortized Inference Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 4393, "page_last": 4402, "abstract": "The variational autoencoder (VAE) is a popular model for density estimation and representation learning. Canonically, the variational principle suggests to prefer an expressive inference model so that the variational approximation is accurate. However, it is often overlooked that an overly-expressive inference model can be detrimental to the test set performance of both the amortized posterior approximator and, more importantly, the generative density estimator. In this paper, we leverage the fact that VAEs rely on amortized inference and propose techniques for amortized inference regularization (AIR) that control the smoothness of the inference model. We demonstrate that, by applying AIR, it is possible to improve VAE generalization on both inference and generative performance. Our paper challenges the belief that amortized inference is simply a mechanism for approximating maximum likelihood training and illustrates that regularization of the amortization family provides a new direction for understanding and improving generalization in VAEs.", "full_text": "Amortized Inference Regularization\n\nRui Shu\n\nStanford University\n\nruishu@stanford.edu\n\nHung H. Bui\nDeepMind\n\nbuih@google.com\n\nShengjia Zhao\n\nStanford University\n\nsjzhao@stanford.edu\n\nMykel J. Kochenderfer\n\nStanford University\n\nmykel@stanford.edu\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nAbstract\n\nThe variational autoencoder (VAE) is a popular model for density estimation and\nrepresentation learning. Canonically, the variational principle suggests to prefer\nan expressive inference model so that the variational approximation is accurate.\nHowever, it is often overlooked that an overly-expressive inference model can be\ndetrimental to the test set performance of both the amortized posterior approximator\nand, more importantly, the generative density estimator. In this paper, we leverage\nthe fact that VAEs rely on amortized inference and propose techniques for amortized\ninference regularization (AIR) that control the smoothness of the inference model.\nWe demonstrate that, by applying AIR, it is possible to improve VAE generalization\non both inference and generative performance. Our paper challenges the belief that\namortized inference is simply a mechanism for approximating maximum likelihood\ntraining and illustrates that regularization of the amortization family provides a\nnew direction for understanding and improving generalization in VAEs.\n\n1\n\nIntroduction\n\nVariational autoencoders are a class of generative models with widespread applications in density\nestimation, semi-supervised learning, and representation learning [1, 2, 3, 4]. A popular approach for\nthe training of such models is to maximize the log-likelihood of the training data. However, maximum\nlikelihood is often intractable due to the presence of latent variables. Variational Bayes resolves this\nissue by constructing a tractable lower bound of the log-likelihood and maximizing the lower bound\ninstead. Classically, Variational Bayes introduces per-sample approximate proposal distributions that\nneed to be optimized using a process called variational inference. However, per-sample optimization\nincurs a high computational cost. A key contribution of the variational autoencoding framework is the\nobservation that the cost of variational inference can be amortized by using an amortized inference\nmodel that learns an ef\ufb01cient mapping from samples to proposal distributions. This perspective\nportrays amortized inference as a tool for ef\ufb01ciently approximating maximum likelihood training.\nMany techniques have since been proposed to expand the expressivity of the amortized inference\nmodel in order to better approximate maximum likelihood training [5, 6, 7, 8].\nIn this paper, we challenge the conventional role that amortized inference plays in variational\nautoencoders. For datasets where the generative model is prone to over\ufb01tting, we show that having\nan amortized inference model actually provides a new and effective way to regularize maximum\nlikelihood training. Rather than making the amortized inference model more expressive, we propose\ninstead to restrict the capacity of the amortization family. Through amortized inference regularization\n(AIR), we show that it is possible to reduce the inference gap and increase the log-likelihood\nperformance on the test set. We propose several techniques for AIR and provide extensive theoretical\nand empirical analyses of our proposed techniques when applied to the variational autoencoder and the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fimportance-weighted autoencoder. By rethinking the role of the amortized inference model, amortized\ninference regularization provides a new direction for studying and improving the generalization\nperformance of latent variable models.\n\n2 Background and Notation\n\n2.1 Variational Inference and the Evidence Lower Bound\nConsider a joint distribution p\u2713(x, z) parameterized by \u2713, where x 2X is observed and z 2Z\nis latent. Given a uniform distribution \u02c6p(x) over the dataset D = {x(i)}, maximum likelihood\nestimation performs model selection using the objective\n\nE\u02c6p(x) lnZz\n\nmax\n\n\u2713\n\nE\u02c6p(x) ln p\u2713(x) = max\n\n\u2713\n\np\u2713(x, z)dz.\n\n(1)\n\nHowever, marginalization of the latent variable is often intractable; to address this issue, it is common\nto employ the variational principle to maximize the following lower bound\n\nE\u02c6p(x)\uf8ffln p\u2713(x)  min\n\nq2Q\n\nD(q(z) k p\u2713(z | x)) = max\n\n\u2713\n\nE\u02c6p(x)\uf8ffmax\n\nq2Q\n\nmax\n\n\u2713\n\nEq(z) ln\n\np\u2713(x, z)\n\nq(z)  ,\n\n(2)\n\nwhere D is the Kullback-Leibler divergence and Q is a variational family. This lower bound,\ncommonly called the evidence lower bound (ELBO), converts log-likelihood estimation into a\ntractable optimization problem. Since the lower bound holds for any q, the variational family Q can\nbe chosen to ensure that q(z) is easily computable, and the lower bound is optimized to select the\nbest proposal distribution q\u21e4x(z) for each x 2D .\n2.2 Amortization and Variational Autoencoders\n[1, 9] proposed to construct p(x | z) using a parametric function g\u2713 2G (P) : Z!P , where P\nis some family of distributions over x, and G is a family of functions indexed by parameters \u2713. To\nexpedite training, they observed that it is possible to amortize the computational cost of variational\ninference by framing the per-sample optimization process as a regression problem; rather than solving\nfor the optimal proposal q\u21e4x(z) directly, they instead use a recognition model f 2F (Q) : X!Q to\npredict q\u21e4x(z). The functions (f, g\u2713) can be concisely represented as conditional distributions, where\n(3)\n(4)\nThe use of amortized inference yields the variational autoencoder, which is trained to maximize the\nvariational autoencoder objective\n\np\u2713(x | z) = g\u2713(z)(x)\nq(z | x) = f(x)(z).\n\nmax\n\u2713,\n\nE\u02c6p(x)\uf8ffEq(z|x) ln\n\np(z)p\u2713(x | z)\n\nq(z | x)  =\n\nmax\n\nf2F(Q),g2G(P)\n\nE\u02c6p(x)\uf8ffEz\u21e0f (x) ln\n\np(z)g(z)(x)\n\nf (x)(z)  .\n\n(5)\n\nWe omit the dependency of (p(z), g) on \u2713 and f on  for notational simplicity. In addition to the\ntypical presentation of the variational autoencoder objective (LHS), we also show an alternative\nformulation (RHS) that reveals the in\ufb02uence of the model capacities F,G and distribution family\ncapacities Q,P on the objective function. In this paper, we use (q, f ) interchangeably, depending on\nthe choice of emphasis. To highlight the relationship between the ELBO in Eq. (2) and the standard\nvariational autoencoder objective in Eq. (5), we shall also refer to the latter as the amortized ELBO.\n\n2.3 Amortized Inference Suboptimality\nFor a \ufb01xed generative model, the optimal unamortized and amortized inference models are\n\nq\u21e4x = arg max\n\nq2Q\n\nf\u21e4 = arg max\n\nf2F\n\np\u2713(x, z)\n\nEq(z)\uf8ffln\nE\u02c6p(x)\uf8ffEz\u21e0f (x) ln\n\nq(z)  , for each x 2D\n\np\u2713(x, z)\n\nf (x)(z) .\n\n(6)\n\n(7)\n\n2\n\n\finf(\u02c6p) = E\u02c6p(x)D(f\u21e4(x) k p\u2713(z | x))\nap(\u02c6p) = E\u02c6p(x)D(q\u21e4x(z) k p\u2713(z | x))\nam(\u02c6p) = inf(\u02c6p)  ap(\u02c6p),\n\nA notable consequence of using an amortization family to approximate variational inference is that\nEq. (5) is a lower bound of Eq. (2). This naturally raises the question of whether the learned inference\nmodel can accurately approximate the mapping x 7! q\u21e4x(z). To address this question, [10] de\ufb01ned\nthe inference, approximation, and amortization gaps as\n(8)\n(9)\n(10)\nStudies have found that the inference gap is non-negligible [11] and primarily attributable to the\npresence of a large amortization gap [10].\nThe amortization gap raises two critical considerations. On the one hand, we wish to reduce the\ntraining amortization gap am(\u02c6ptrain). If the family F is too low in capacity, then it is unable to\napproximate x 7! q\u21e4x and will thus increase the amortization gap. Motivated by this perspective, [5,\n12] proposed to reduce the training amortization gap by performing stochastic variational inference on\ntop of amortized inference. In this paper, we take the opposing perspective that an over-expressive F\nhurts generalization (see Appendix A) and that restricting the capacity of F is a form of regularization\nthat can prevent both the inference and generative models from over\ufb01tting to the training set.\n\n3 Amortized Inference Regularization in Variational Autoencoders\n\nMany methods have been proposed to expand the variational and amortization families in order\nto better approximate maximum likelihood training [5, 6, 7, 8, 13, 14]. We argue, however, that\nachieving a better approximation to maximum likelihood training is not necessarily the best training\nobjective, even if the end goal is test set density estimation. In general, it may be bene\ufb01cial to\nregularize the maximum likelihood training objective.\nImportantly, we observe that the evidence lower bound in Eq. (2) admits a natural interpretation as\nimplicitly regularizing maximum likelihood training\n\nmax\n\n\u2713\n\nlog-likelihood\n\n{\nz\nE\u02c6p(x) [ln p\u2713(x)]\n\n}|\n\nz\n\n\n\nE\u02c6p(x) min\nq2Q\n\nregularizer R(\u2713;Q)\n\n}|\n{\nD(q(z) k p\u2713(z | x)).\n\nThis formulation exposes the ELBO as a data-dependent regularized maximum likelihood objective.\nFor in\ufb01nite capacity Q, R(\u2713 ; Q) is zero for all \u2713 2 \u21e5, and the objective reduces to maximum\nlikelihood. When Q is the set of Gaussian distributions (as is the case in the standard VAE), then\nR(\u2713 ; Q) is zero only if p\u2713(z | x) is Gaussian for all x 2D . In other words, a Gaussian variational\nfamily regularizes the true posterior p\u2713(z | x) toward being Gaussian [10]. Careful selection of the\nvariational family to encourage p\u2713(z | x) to adopt certain properties (e.g. unimodality, fully-factorized\nposterior, etc.) can thus be considered a special case of posterior regularization [15, 16].\nUnlike traditional variational techniques, the variational autoencoder introduces an amortized infer-\nence model f 2F and thus a new source of posterior regularization.\nregularizer R(\u2713;Q,F)\n\nlog-likelihood\n\nmax\n\n\u2713\n\n{\nz\nE\u02c6p(x) [ln p\u2713(x)]\n\n}|\n\nz\n\n\n\nmin\n\nf2F(Q)\n\n{\nE\u02c6p(x) [D(f (x) k p\u2713(z | x))].\n\n}|\n\n(11)\n\n(12)\n\nIn contrast to unamortized variational inference, the introduction of the amortization family F forces\nthe inference model to consider the global structure of how X maps to Q. We thus de\ufb01ne amortized\ninference regularization as the strategy of restricting the inference model capacity F to satisfy certain\ndesiderata. In this paper, we explore a special case of AIR where a candidate model f 2F is\npenalized if it is not suf\ufb01ciently smooth. We propose two models that encourage inference model\nsmoothness and demonstrate that they can reduce the inference gap and increase log-likelihood on\nthe test set.\n\n3.1 Denoising Variational Autoencoder\nIn this section, we propose using random perturbation training for amortized inference regularization.\nThe resulting model\u2014the denoising variational autoencoder (DVAE)\u2014modi\ufb01es the variational\n\n3\n\n\fautoencoder objective by injecting \" noise into the inference model\n\n(13)\n\nmax\n\n\u2713\n\nE\u02c6p(x) [ln p\u2713(x)]  min\n\nf2F(Q)\n\nE\u02c6p(x)E\" [D(f (x + \") k p\u2713(z | x))].\n\nNote that the noise term only appears in the regularizer term. We consider the case of zero-mean\nisotropic Gaussian noise \" \u21e0N (0, I) and denote the denoising regularizer as R(\u2713 ; ). At this\npoint, we note that the DVAE was \ufb01rst described in [17]. However, our treatment of DVAE differs\nfrom [17]\u2019s in both theoretical analysis and underlying motivation. We found that [17] incorrectly\nstated the tightness of the DVAE variational lower bound (see Appendix B). In contrast, our analysis\ndemonstrates that the denoising objective smooths the inference model and necessarily lower bounds\nthe original variational autoencoder objective (see Theorem 1 and Proposition 1).\nWe now show that 1) the optimal DVAE amortized inference model is a kernel regression model and\nthat 2) the variance of the noise \" controls the smoothness of the optimal inference model.\nLemma 1. For \ufb01xed (\u2713, , Q) and in\ufb01nite capacity F, the inference model that optimizes the DVAE\nobjective in Eq. (13) is the kernel regression model\n\nf\u21e4(x) = arg min\n\nq2Q\n\nnXi=1\n\nw(x, x(i)) \u00b7 D(q(z) k p\u2713(z | x(i))),\n\n(14)\n\nwhere w(x, x(i)) = K(x,x(i))\n\nPj K(x,x(j)) and K(x, y) = exp\u21e3kxyk\n\n22 \u2318 is the RBF kernel.\n\nLemma 1 shows that the optimal denoising inference model f\u21e4 is dependent on the noise level .\nThe output of f\u21e4(x) is the proposal distribution that minimizes the weighted Kullback-Leibler (KL)\ndivergence from f\u21e4(x) to each p\u2713(z | x(i)), where the weighting w(x, x(i)) depends on the distance\nkx  x(i)k and the bandwidth . When > 0, the amortized inference model forces neighboring\npoints (x(i), x(j)) to have similar proposal distributions. Note that as  increases, w(x, x(i)) ! 1\nn,\nwhere n is the number of training samples. Controlling  thus modulates the smoothness of f\u21e4 (we\nsay that f\u21e4 is smooth if it maps similar inputs to similar outputs under some suitable measure of\nsimilarity). Intuitively, the denoising regularizer R(\u2713 ; ) approximates the true posteriors with a\n\u201c-smoothed\u201d inference model and penalizes generative models whose posteriors cannot easily be\napproximated by such an inference model. This intuition is formalized in Theorem 1.\nTheorem 1. Let Q be a minimal exponential family with corresponding natural parameter space \u2326.\nWith a slight abuse of notation, consider f 2F : X! \u2326. Under the simplifying assumption that\np\u2713(z | x(i)) is contained within Q and parameterized by \u2318(i) 2 \u2326, and that F has in\ufb01nite capacity,\nthen the optimal inference model in Lemma 1 returns f\u21e4(x) = \u2318 2 \u2326, where\n\n\u2318 =\n\nw(x, x(i)) \u00b7 \u2318(i)\n\n(15)\n\nnXi=1\n\nand Lipschitz constant of f\u21e4 is bounded by O(1/2).\nWe wish to address Theorem 1\u2019s assumption that the true posteriors lie in the variational family.\nNote that for suf\ufb01ciently large exponential families, this assumption is likely to hold. But even in\nthe case where the variational family is Gaussian (a relatively small exponential family), the small\napproximation gap observed in [10] suggests that it is plausible that posterior regularization would\nencourage the true posteriors to be approximately Gaussian.\nGiven that  modulates the smoothness of the inference model, it is natural to suspect that a larger\nchoice of  results in a stronger regularization. To formalize this notion of regularization strength,\nwe introduce a way to partially order a set of regularizers {Ri(\u2713)}.\nDe\ufb01nition 1. Suppose two regularizers R1(\u2713) and R2(\u2713) share the same minimum min\u2713 R1(\u2713) =\nmin\u2713 R2(\u2713). We say that R1 is a stronger regularizer than R2 if R1(\u2713)  R2(\u2713) for all \u2713 2 \u21e5.\nNote that any two regularizers can be modi\ufb01ed via scalar addition to share the same minimum.\nFurthermore, if R1 is stronger than R2, then R1 and R2 share at least one minimizer. We now apply\nDe\ufb01nition 1 to characterize the regularization strength of R(\u2713 ; ) as  increases.\nDe\ufb01nition 2. We say that F is closed under input translation if f 2F =) fa 2F for all a 2X ,\nwhere fa(x) = f (x + a).\n\n4\n\n\fProposition 1. Consider the denoising regularizer R(\u2713 ; ). Suppose F is closed under input\ntranslation and that, for any \u2713 2 \u21e5, there exists f 2F such that f (x) maps to the prior p\u2713(z)\nall x 2X . Furthermore, assume that there exists \u2713 2 \u21e5 such that p\u2713(x, z) = p\u2713(z)p\u2713(x). Then\nR(\u2713 ; 1) is stronger R(\u2713 ; 2) when 1  2; i.e., min\u2713 R(\u2713 ; 1) = min\u2713 R(\u2713 ; 2) = 0 and\nR(\u2713 ; 1)  R(\u2713 ; 2) for all \u2713 2 \u21e5.\nLemma 1 and Proposition 1 show that as we increase , the optimal inference model is forced to\nbecome smoother and the regularization strength increases. Figure 1 is consistent with this analysis,\nshowing the progression from under-regularized to over-regularized models as we increase .\nIt is worth noting that, in addition to adjusting the denoising regularizer strength via , it is also\npossible to adjust the strength by taking a convex combination of the VAE and DVAE objectives. In\nparticular, we can de\ufb01ne the partially denoising regularizer R(\u2713 ; , \u21b5) as\n\nE\u02c6p(x)\u2713\u21b5 \u00b7 E\" [D(f (x + \") k p\u2713(z | x))] + (1  \u21b5) \u00b7 D(f (x) k p\u2713(z | x))\u25c6\n\nmin\n\nf2F(Q)\n\n(16)\n\nImportantly, we note that R(\u2713 ; , \u21b5) is still strictly non-negative and, when combined with the\nlog-likelihood term, still yields a tractable variational lower bound.\n\n3.2 Weight-Normalized Amortized Inference\nIn addition to DVAE, we propose an alternative method that directly restricts F to the set of smooth\nfunctions. To do so, we consider the case where the inference model is a neural network encoder\nparameterized by weight matrices {Wi} and leverage [18]\u2019s weight normalization technique, which\nproposes to reparameterize the columns wi of each weight matrix W as\n\n(17)\n\n(18)\n\nwhere vi 2 Rd, si 2 R are trainable parameters. Since it is possible to modulate the smoothness of\nthe encoder by capping the magnitude of si, we introduce a new parameter ui 2 R and de\ufb01ne\n\nvi\n\nwi =\n\nkvik \u00b7 si,\n1 + exp(ui)\u25c6 .\n\nH\n\nsi = min\u21e2kvik,\u2713\n\nThe norm kwik is thus bounded by the hyperparameter H. We denote the weight-normalized\nregularizer as R(\u2713 ; FH), where FH is the amortization family induced by a H-weight-normalized\nencoder. Under similar assumptions as Proposition 1, it is easy to see that min\u2713 R(\u2713 ; FH) = 0 for\nany H  0 and that R(\u2713 ; FH1)  R(\u2713 ; FH2) for all \u2713 2 \u21e5 when H1 \uf8ff H2 (since FH1 \u2713F H2).\nWe refer to the resulting model as the weight-normalized inference VAE (WNI-VAE) and show in\nTable 1 that weight-normalized amortized inference can achieve similar performance as DVAE.\n\n3.3 Experiments\nWe conducted experiments on statically binarized MNIST, statically binarized OMNIGLOT, and the\nCaltech 101 Silhouettes datasets. These datasets have a relatively small amount of training data and\nare thus susceptible to model over\ufb01tting. For each dataset, we used the same decoder architecture\nacross all four models (VAE, DVAE (\u21b5 = 0.5), DVAE (\u21b5 = 1.0), WNI-VAE) and only modi\ufb01ed the\nencoder, and trained all models using Adam [19] (see Appendix E for more details). To approximate\nthe log-likelihood, we proposed to use importance-weighted stochastic variational inference (IW-SVI),\nan extension of SVI [20] which we describe in detail in Appendix C. Hyperparameter tuning of\nDVAE\u2019s  and WNI-VAE\u2019s FH is described in Table 7.\nTable 1 shows the performance of VAE, DVAE, and WNI-VAE. Regularizing the inference model\nconsistently improved the test set log-likelihood performance. On the MNIST and Caltech 101\nSilhouettes datasets, the results also show a consistent reduction of the test set inference gap when\nthe inference model is regularized. We observed differences in the performance of DVAE versus\nWNI-VAE on the Caltech 101 Silhouettes dataset, suggesting a difference in how denoising and\nweight normalization regularizes the inference model; an interesting consideration would thus be to\ncombine DVAE and WNI. As a whole, Table 1 demonstrates that AIR bene\ufb01ts the generative model.\nThe denoising and weight normalization regularizers have respective hyperparameters  and H that\ncontrol the regularization strength. In Figure 1, we performed an ablation analysis of how adjusting\n\n5\n\n\fTable 1: Test set evaluation of VAE, DVAE, and WNI-VAE. The performance metrics are log-\nlikelihood ln p\u2713(x), the amortized ELBO L(x), and the inference gap inf = ln p\u2713(x) L (x). All\nthree proposed models out-perform VAE across most metrics.\n\nVAE\nDVAE (\u21b5 = 0.5)\nDVAE (\u21b5 = 1.0)\nWNI-VAE\n\nL(x)\nL(x)\n ln p\u2713 (x)\nL(x)\n122.35 \u00b10.33\n138.05 \u00b10.15\n86.93 \u00b10.04\n95.48 \u00b10.07\n121.87 \u00b10.37 108.64 \u00b10.19 23.40 \u00b10.19 132.04 \u00b10.37\n86.46 \u00b10.02 6.34 \u00b10.05 92.80 \u00b10.07\n132.60 \u00b10.15\n93.35 \u00b10.06\n86.51 \u00b10.02\n122.56 \u00b10.34\n93.10 \u00b10.02 109.16 \u00b10.12 11.39 \u00b10.10 120.55 \u00b10.20\n86.42 \u00b10.01\n137.82 \u00b10.25\n\n ln p\u2713 (x)\n110.32 \u00b10.16\n109.31 \u00b10.19\n110.12 \u00b10.18\n\n ln p\u2713 (x)\n109.14 \u00b10.28\n\n12.03 \u00b10.25\n12.56 \u00b10.18\n12.44 \u00b10.16\n\n108.66 \u00b10.23\n108.94 \u00b10.31\n\n23.94 \u00b10.15\n28.88 \u00b10.29\n\n28.90 \u00b10.42\n\ninf\n\ninf\n\nMNIST\ninf\n\n8.54 \u00b10.14\n\n6.83 \u00b10.04\n6.68 \u00b10.01\n\nOMNIGLOT\n\nCALTECH\n\nFigure 1: Evaluation of the log-likelihood performance of all three proposed models as we vary\nthe regularization parameter value. The regularization parameter is de\ufb01ned in Table 7. When the\nparameter value is too small, the model over\ufb01ts and the test set performance degrades. When the\nparameter value is too high, the model under\ufb01ts.\n\nthe regularization strength impacts the test set log-likelihood. In almost all cases, we see a transition\nfrom over\ufb01tting to under\ufb01tting as we adjust the strength of AIR. For well-chosen regularization\nstrength, however, it is possible to increase the test set log-likelihood performance by 0.5 \u21e0 1.0\nnats\u2014a non-trivial improvement.\n\n3.4 How Does Amortized Inference Regularization Affect the Generator?\nTable 1 shows that regularizing the inference model empirically bene\ufb01ts the generative model. We\nnow provide some initial theoretical characterization of how a smoothed amortized inference model\naffects the generative model. Our analysis rests on the following proposition.\nProposition 2. Let P be an exponential family with corresponding mean parameter space M and\nsuf\ufb01cient statistic function T (\u00b7). With a slight abuse of notation, consider g 2G : Z!M . De\ufb01ne\nq(x, z) = \u02c6p(x)q(z | x), where q(z | x) is a \ufb01xed inference model. Supposing G has in\ufb01nite capacity,\nthen the optimal generative model in Eq. (5) returns g\u21e4(z) = \u00b5 2M , where\n\n\u00b5 =\n\nq(x(i) | z) \u00b7 T (x(i)) =\n\nnXi=1\n\nnXi=1 q(z | x(i))\nPj q(z | x(j)) \u00b7 T (x(i))! .\n\nProposition 2 generalizes the analysis in [21] which determined the optimal generative model when P\nis Gaussian. The key observation is that the optimal generative model outputs a convex combination\nof {(x(i))}, weighted by q(x(i) | z). Furthermore, the weights q(x(i) | z) are simply density ratios\nof the proposal distributions {q(z | x(i))}. As we increase the smoothness of the amortized inference\nn for all z 2Z . This suggests that a smoothed\nmodel, the weight q(x(i) | z) should tend toward 1\ninference model provides a natural way to smooth (and thus regularize) the generative model.\n\n(19)\n\n(20)\n\n4 Amortized Inference Regularization in Importance-Weighted\n\nAutoencoders\n\nIn this section, we extend AIR to importance-weighted autoencoders (IWAE-k). Although the\napplication is straightforward, we demonstrate a noteworthy relationship between the number of\nimportance samples k and the effect of AIR. To begin our analysis, we consider the IWAE-k objective\n\nEz1...zk\u21e0q(z|x)\"ln\n\nmax\n\u2713,\n\np\u2713(x, zi)\n\nq(zi | x)# ,\n\n1\nk\n\nkXi=1\n\n6\n\n\fwhere {z1 . . . zk} are k samples from the proposal distribution q(z | x) to be used as importance-\nsamples. Analysis by [22] allows us to rewrite it as a regularized maximum likelihood objective\n\nmax\n\n\u2713\n\nE\u02c6p(x) [ln p\u2713(x)] \n\nRk(\u2713)\n\nz\n\nmin\n\nf2F(Q)\n\n{\nE\u02c6p(x)Ez2...zk\u21e0f (x) \u02dcD( \u02dcfk(x, z1 . . . zk) k p\u2713(z | x)),\n\n}|\n\nwhere \u02dcfk (or equivalently \u02dcqk) is the unnormalized distribution\n\n(21)\n\n(22)\n\n\u02dcfk(x, z2 . . . zk)(z1) =\n\n= \u02dcqk(z1 | x, z2 . . . zk)\n\np\u2713(x, z1)\n\np\u2713(x,zi)\nf (x)(zi)\n\n1\n\nkPi\n\nand \u02dcD(q k p) = R q(z) [ln q(z)  ln p(z)] dz is the Kullback-Leibler divergence extended to un-\n\nnormalized distributions. For notational simplicity, we omit the dependency of \u02dcfk on (z2 . . . zk).\nImportantly, [22] showed that the IWAE with k importance samples drawn from the amortized\ninference model f is, on expectation, equivalent to a VAE with 1 importance sample drawn from the\nmore expressive inference model \u02dcfk.\n\n4.1\n\nImportance Sampling Attenuates Amortized Inference Regularization\n\nWe now consider the interaction between importance sampling and AIR. We introduce the regularizer\nRk(\u2713 ; ,FH) as follows\n\nRk(\u2713 ; ,FH) = min\n\nf2FH (Q)\n\nE\u02c6p(x)E\"Ez2...zk\u21e0f (x+\") \u02dcD( \u02dcfk(x + \") k p\u2713(z | x)),\n\n(23)\n\nwhich corresponds to a regularizer where weight normalization, denoising, and importance sampling\nare simultaneously applied. By adapting Theorem 1 from [8], we can show that\nProposition 3. Consider the regularizer Rk(\u2713 ; ,FH). Under similar assumptions as Proposition 1,\nthen Rk1 is stronger than Rk2 when k1 \uf8ff k2; i.e., min\u2713 Rk1(\u2713 ; ,FH) = min\u2713 Rk2(\u2713 ; ,FH) = 0\nand Rk1(\u2713 ; ,FH) \uf8ff Rk2(\u2713 ; ,FH) for all \u2713 2 \u21e5.\nA notable consequence of Proposition 3 is that as k increases, AIR exhibits a weaker regularizing\neffect on the posterior distributions {p\u2713(z | x(i))}. Intuitively, this arises from the phenomenon\nthat although AIR is applied to f, the subsequent importance-weighting procedure can still create\na \ufb02exible \u02dcfk. Our analysis thus predicts that AIR is less likely to cause under\ufb01tting of IWAE-k\u2019s\ngenerative model as k increases, which we demonstrate in Figure 2. In the limit of in\ufb01nite importance\nsamples, we also predict AIR to have zero regularizing effect since \u02dcf1 (under some assumptions) can\nalways approximate any posterior. However, for practically feasible values of k, we show in Tables 2\nand 3 that AIR is a highly effective regularizer.\n\n4.2 Experiments\n\nTable 2: Test set evaluation of the four models when trained with 8 importance samples. L8(x)\ndenotes the amortized ELBO using 8 importance samples. inf = ln p\u2713(x) L 8(x).\n\nOMNIGLOT\n\nCALTECH\n\n ln p\u2713 (x)\nIWAE\n86.21 \u00b10.01\nDIWAE (\u21b5 = 0.5) 85.78 \u00b10.02\nDIWAE (\u21b5 = 1.0) 85.78 \u00b10.03 4.21 \u00b10.03 90.00 \u00b10.06\nWNI-IWAE\n90.14 \u00b10.04\n\n6.13 \u00b10.03\n4.47 \u00b10.02\n\n85.81 \u00b10.01\n\n4.33 \u00b10.03\n\nL8(x)  ln p\u2713 (x)\n92.34 \u00b10.02\n21.52 \u00b10.13\n90.25 \u00b10.03 107.01 \u00b10.11 8.64 \u00b10.07 115.66 \u00b10.17 107.34 \u00b10.17 17.61 \u00b10.18\n\n ln p\u2713 (x)\n108.65 \u00b10.11\n\nL8(x)\n116.87 \u00b10.16\n\n108.18 \u00b10.24\n\n8.69 \u00b10.39\n\ninf\n\ninf\n\nL8(x)\n130.17 \u00b10.09\n124.96 \u00b10.14\n107.54 \u00b10.11 17.06 \u00b10.35 124.60 \u00b10.29\n107.98 \u00b10.19\n130.16 \u00b10.14\n\n22.18 \u00b10.33\n\n107.47 \u00b10.06 8.57 \u00b10.14 116.04 \u00b10.18\n107.15 \u00b10.08\n115.93 \u00b10.10\n\n8.78 \u00b10.17\n\nMNIST\ninf\n\nTable 3: Test set evaluation of the four models when trained with 64 importance samples. inf =\nln p\u2713(x) L 64(x).\n\nMNIST\ninf\n\n ln p\u2713 (x)\n86.06 \u00b10.03\n\nL64(x)  ln p\u2713 (x)\n90.48 \u00b10.07\n\nOMNIGLOT\n\ninf\n\nIWAE\nDIWAE (\u21b5 = 0.5) 85.55 \u00b10.02 3.01 \u00b10.01 88.56 \u00b10.02 106.02 \u00b10.01 6.98 \u00b10.06\nDIWAE (\u21b5 = 1.0) 85.55 \u00b10.02\nWNI-IWAE\n85.64 \u00b10.03\n\n106.15 \u00b10.03\n106.17 \u00b10.07\n\n88.70 \u00b10.04\n88.74 \u00b10.03\n\n3.15 \u00b10.02\n3.10 \u00b10.01\n\n4.41 \u00b10.10\n\n6.70 \u00b10.05 112.85 \u00b10.07\n7.11 \u00b10.07\n113.28 \u00b10.13\n\n107.31 \u00b10.14 6.66 \u00b10.22 113.97 \u00b10.10\n\nCALTECH\n\ninf\n\nL64(x)  ln p\u2713 (x)\n\nL64(x)\n125.40 \u00b10.25\n113.00 \u00b10.07 106.94 \u00b10.11 12.28 \u00b10.14 119.22 \u00b10.11\n119.87 \u00b10.16\n122.57 \u00b10.10\n\n106.96 \u00b10.11\n108.15 \u00b10.11\n\n12.94 \u00b10.22\n14.42 \u00b10.20\n\n108.89 \u00b10.35\n\n16.51 \u00b10.32\n\n7\n\n\fTables 2 and 3 extends the model evaluation to IWAE-8 and IWAE-64. We see that the denoising\nIWAE (DIWAE) and weight-normalized inference IWAE (WNI-IWAE) consistently out-perform the\nstandard IWAE on test set log-likelihood evaluations. Furthermore, the regularized models frequently\nreduced the inference gap as well. Our results demonstrate that AIR is a highly effective regularizer\neven when a large number of importance samples are used.\nOur main experimental contribution in this section is the veri\ufb01cation that increasing the number of\nimportance samples results in less under\ufb01tting when the inference model is over-regularized. In\ncontrast to k = 1, where aggressively increasing the regularization strength can cause considerable\nunder\ufb01tting, Figure 2 shows that increasing the number of importance samples to k = 8 and k = 64\nmakes the models much more robust to mis-speci\ufb01ed choices of regularization strength. Interestingly,\nwe also observed that the optimal regularization strength (determined using the validation set)\nincreases with k (see Table 7 for details). The robustness of importance sampling when paired with\namortized inference regularization makes AIR an effective and practical way to regularize IWAE.\n\nFigure 2: Evaluation of the log-likelihood performance of all three proposed models as we vary\nthe regularization parameter (see Table 7 for de\ufb01nition) and number of importance samples k. To\ncompare across different k\u2019s, the performance without regularization (IWAE-k baseline) is subtracted.\nWe see that IWAE-64 is the least likely to under\ufb01t when the regularization parameter value is high.\n\n4.3 Are High Signal-to-Noise Ratio Gradients Necessarily Better?\n\nWe note the existence of a related work [23] that also concluded that approximating maximum\nlikelihood training is not necessarily better. However, [23] focused on increasing the signal-to-noise\nratio of the gradient updates and analyzed the trade-off between importance sampling and Monte\nCarlo sampling under budgetary constraints. An in-depth discussion of these two works within the\ncontext of generalization is provided in Appendix D.\n\n5 Conclusion\n\nIn this paper, we challenged the conventional role that amortized inference plays in training deep\ngenerative models. In addition to expediting variational inference, amortized inference introduces new\nways to regularize maximum likelihood training. We considered a special case of amortized inference\nregularization (AIR) where the inference model must learn a smoothed mapping from X!Q\nand showed that the denoising variational autoencoder (DVAE) and weight-normalized inference\n(WNI) are effective instantiations of AIR. Promising directions for future work include replacing\ndenoising with adversarial training [24] and weight normalization with spectral normalization [25].\nFurthermore, we demonstrated that AIR plays a crucial role in the regularization of IWAE, and that\nhigher levels of regularization may be necessary due to the attenuating effects of importance sampling\non AIR. We believe that variational family expansion by Monte Carlo methods [26] may exhibit the\nsame attenuating effect on AIR and recommend this as an additional research direction.\n\n8\n\n\fAcknowledgements\n\nThis research was supported by TRI, NSF (#1651565, #1522054, #1733686 ), ONR, Sony, and FLI.\nToyota Research Institute provided funds to assist the authors with their research but this article solely\nre\ufb02ects the opinions and conclusions of its authors and not TRI or any other Toyota entity.\n\nReferences\n[1] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[2] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nIn Advances In Neural Information\n\nSupervised Learning With Deep Generative Models.\nProcessing Systems, pages 3581\u20133589, 2014.\n\n[3] Hyunjik Kim and Andriy Mnih. Disentangling By Factorising. arXiv preprint arXiv:1802.05983,\n\n2018.\n\n[4] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating Sources Of Disentan-\n\nglement In Variational Autoencoders. arXiv preprint arXiv:1802.04942, 2018.\n\n[5] Yoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. Semi-\n\nAmortized Variational Autoencoders. arXiv preprint arXiv:1802.02550, 2018.\n\n[6] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved Variational Inference With Inverse Autoregressive Flow. In Advances In Neural\nInformation Processing Systems, pages 4743\u20134751, 2016.\n\n[7] Casper Kaae S\u00f8nderby, Tapani Raiko, Lars Maal\u00f8e, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther.\nLadder Variational Autoencoders. In Advances In Neural Information Processing Systems,\npages 3738\u20133746, 2016.\n\n[8] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders.\n\narXiv preprint arXiv:1509.00519, 2015.\n\n[9] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation\nAnd Approximate Inference In Deep Generative Models. arXiv preprint arXiv:1401.4082, 2014.\n\n[10] Chris Cremer, Xuechen Li, and David Duvenaud. Inference Suboptimality In Variational\n\nAutoencoders. arXiv preprint arXiv:1801.03558, 2018.\n\n[11] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On The Quantitative Analysis\n\nOf Decoder-Based Generative Models. arXiv preprint arXiv:1611.04273, 2016.\n\n[12] Rahul G Krishnan, Dawen Liang, and Matthew Hoffman. On the challenges of learning with\ninference networks on sparse, high-dimensional data. arXiv preprint arXiv:1710.06085, 2017.\n\n[13] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary Deep\n\nGenerative Models. arXiv preprint arXiv:1602.05473, 2016.\n\n[14] Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical Variational Models. In Interna-\n\ntional Conference On Machine Learning, pages 324\u2013333, 2016.\n\n[15] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. Posterior Regularization For Struc-\ntured Latent Variable Models. Journal of Machine Learning Research, 11(Jul):2001\u20132049,\n2010.\n\n[16] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian Inference With Posterior Regularization And\nApplications To In\ufb01nite Latent Svms. The Journal of Machine Learning Research, 15(1):1799\u2013\n1847, 2014.\n\n[17] Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, Yoshua Bengio, et al. Denoising Criterion\n\nFor Variational Auto-Encoding Framework. In AAAI, pages 2059\u20132065, 2017.\n\n9\n\n\f[18] Tim Salimans and Diederik P Kingma. Weight Normalization: A Simple Reparameterization To\nAccelerate Training Of Deep Neural Networks. In Advances In Neural Information Processing\nSystems, pages 901\u2013909, 2016.\n\n[19] Diederik P Kingma and Jimmy Ba. Adam: A method For Stochastic Optimization. arXiv\n\npreprint arXiv:1412.6980, 2014.\n\n[20] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic Variational\n\nInference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[21] Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard\nSchoelkopf. From Optimal Transport To Generative Modeling: The VEGAN Cookbook. arXiv\npreprint arXiv:1705.07642, 2017.\n\n[22] Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting Importance-Weighted\n\nAutoencoders. arXiv preprint arXiv:1704.02916, 2017.\n\n[23] Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank\nWood, and Yee Whye Teh. Tighter Variational Bounds Are Not Necessarily Better. arXiv\npreprint arXiv:1802.04537, 2018.\n\n[24] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining And Harnessing Adver-\n\nsarial Examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[25] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normaliza-\n\ntion For Generative Adversarial Networks. arXiv preprint arXiv:1802.05957, 2018.\n\n[26] Matthew D Hoffman. Learning Deep Latent Gaussian Models With Markov Chain Monte Carlo.\n\nIn International Conference On Machine Learning, pages 1510\u20131519, 2017.\n\n[27] Yingzhen Li and Richard E Turner. R\u00e9nyi Divergence Variational Inference. In Advances In\n\nNeural Information Processing Systems, pages 1073\u20131081, 2016.\n\n[28] Jakub M Tomczak and Max Welling. VAE With A Vampprior. arXiv preprint arXiv:1705.07120,\n\n2017.\n\n[29] Samuel L. Smith and Quoc V. Le. A bayesian Perspective On Generalization And Stochastic\n\nGradient Descent. In International Conference On Learning Representations, 2018.\n\n[30] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp Minima Can General-\n\nize For Deep Nets. arXiv preprint arXiv:1703.04933, 2017.\n\n[31] Dominic Masters and Carlo Luschi. Revisiting Small Batch Training For Deep Neural Networks.\n\narXiv preprint arXiv:1804.07612, 2018.\n\n[32] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\nDeep Learning Requires Rethinking Generalization. arXiv preprint arXiv:1611.03530, 2016.\n[33] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering With\n\nBregman Divergences. Journal of machine learning research, 6(Oct):1705\u20131749, 2005.\n\n10\n\n\f", "award": [], "sourceid": 2160, "authors": [{"given_name": "Rui", "family_name": "Shu", "institution": "Stanford University"}, {"given_name": "Hung", "family_name": "Bui", "institution": "Google DeepMind"}, {"given_name": "Shengjia", "family_name": "Zhao", "institution": "Stanford University"}, {"given_name": "Mykel", "family_name": "Kochenderfer", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}