{"title": "Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 6925, "page_last": 6934, "abstract": "We propose a simple and general variant of the standard reparameterized gradient estimator for the variational evidence lower bound. Specifically, we remove a part of the total derivative with respect to the variational parameters that corresponds to the score function. Removing this term produces an unbiased gradient estimator whose variance approaches zero as the approximate posterior approaches the exact posterior. We analyze the behavior of this gradient estimator theoretically and empirically, and generalize it to more complex variational distributions such as mixtures and importance-weighted posteriors.", "full_text": "Sticking the Landing: Simple, Lower-Variance\nGradient Estimators for Variational Inference\n\nGeoffrey Roeder\nUniversity of Toronto\n\nroeder@cs.toronto.edu\n\nYuhuai Wu\n\nUniversity of Toronto\n\nywu@cs.toronto.edu\n\nDavid Duvenaud\n\nUniversity of Toronto\n\nduvenaud@cs.toronto.edu\n\nAbstract\n\nWe propose a simple and general variant of the standard reparameterized gradient\nestimator for the variational evidence lower bound. Speci\ufb01cally, we remove a part\nof the total derivative with respect to the variational parameters that corresponds to\nthe score function. Removing this term produces an unbiased gradient estimator\nwhose variance approaches zero as the approximate posterior approaches the exact\nposterior. We analyze the behavior of this gradient estimator theoretically and\nempirically, and generalize it to more complex variational distributions such as\nmixtures and importance-weighted posteriors.\n\n1\n\nIntroduction\n\n)\n\ne\nu\nr\nt\n\n\u03c6\n(cid:107)\n\nt\ni\nn\ni\n\u03c6\n(\nL\nK\n\nRecent advances in variational inference have begun to\nmake approximate inference practical in large-scale latent\nvariable models. One of the main recent advances has\nbeen the development of variational autoencoders along\nwith the reparameterization trick [Kingma and Welling,\n2013, Rezende et al., 2014]. The reparameterization\ntrick is applicable to most continuous latent-variable mod-\nels, and usually provides lower-variance gradient esti-\nmates than the more general REINFORCE gradient es-\ntimator [Williams, 1992].\nIntuitively, the reparameterization trick provides more in-\nformative gradients by exposing the dependence of sam-\npled latent variables z on variational parameters \u03c6. In\ncontrast, the REINFORCE gradient estimate only de-\npends on the relationship between the density function\nlog q\u03c6(z|x, \u03c6) and its parameters.\nSurprisingly, even the reparameterized gradient estimate\ncontains the score function\u2014a special case of the REIN-\nFORCE gradient estimator. We show that this term can\neasily be removed, and that doing so gives even lower-variance gradient estimates in many circum-\nstances. In particular, as the variational posterior approaches the true posterior, this gradient estimator\napproaches zero variance faster, making stochastic gradient-based optimization converge and \"stick\"\nto the true variational parameters, as seen in \ufb01gure 1.\n\nFigure 1: Fitting a 100-dimensional varia-\ntional posterior to another Gaussian, using\nstandard gradient versus our proposed path\nderivative gradient estimator.\n\nIterations\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n40060080010001200Optimizationusing:PathDerivativeTotalDerivative\f1.1 Contributions\n\nthat has zero variance when the variational approximation is exact.\n\n\u2022 We present a novel unbiased estimator for the variational evidence lower bound (ELBO)\n\u2022 We provide a simple and general implementation of this trick in terms of a single change to\n\u2022 We generalize our gradient estimator to mixture and importance-weighted lower bounds,\nand discuss extensions to \ufb02ow-based approximate posteriors. This change takes a single\nfunction call using automatic differentiation packages.\n\nthe computation graph operated on by standard automatic differentiation packages.\n\n\u2022 We demonstrate the ef\ufb01cacy of this trick through experimental results on MNIST and\n\nOmniglot datasets using variational and importance-weighted autoencoders.\n\n1.2 Background\n\nMaking predictions or computing expectations using latent variable models requires approximating\nthe posterior distribution p(z|x). Calculating these quantities in turn amounts to using Bayes\u2019 rule:\np(z|x) = p(x|z)p(z)/p(x).\nVariational inference approximates p(z|x) with a tractable distribution q\u03c6(z|x) parameterized by \u03c6\nthat is close in KL-divergence to the exact posterior. Minimizing the KL-divergence is equivalent to\nmaximizing the evidence lower bound (ELBO):\n\nL(\u03c6) = Ez\u223cq[log p(x, z) \u2212 log q\u03c6(z| x)]\n\n(ELBO)\nAn unbiased approximation of the gradient of the ELBO allows stochastic gradient descent to scalably\nlearn parametric models. Stochastic gradients of the ELBO can be formed from the REINFORCE-\nstyle gradient, which applies to any continuous or discrete model, or a reparameterized gradient,\nwhich requires the latent variables to be modeled as continuous. Our variance reduction trick applies\nto the reparameterized gradient of the evidence lower bound.\n\n2 Estimators of the variational lower bound\n\nL(\u03c6) = Ez\u223cq[log p(x|z) + log p(z) \u2212 log q\u03c6(z|x)]\n\n= Ez\u223cq[log p(x|z) + log p(z))] + H[q\u03c6]\n= Ez\u223cq[log p(x|z)] \u2212 KL(q\u03c6(z|x)||p(z))\n\nIn this section, we analyze the gradient of the ELBO with respect to the variational parameters to\nshow a source of variance that depends on the complexity of the approximate distribution.\nWhen the joint distribution p(x, z) can be evaluated by p(x|z) and p(z) separately, the ELBO can be\nwritten in the following three equivalent forms:\n(1)\n(2)\n(3)\nWhich ELBO estimator is best? When p(z) and q\u03c6(z|x) are multivariate Gaussians, using equation\n(3) is appealing because it analytically integrates out terms that would otherwise have to be estimated\nby Monte Carlo. Intuitively, we might expect that using exact integrals wherever possible will give\nlower-variance estimators by reducing the number of terms to be estimated by Monte Carlo methods.\nSurprisingly, even when analytic forms of the entropy or KL divergence are available, sometimes it is\nbetter to use (1) because it will have lower variance. Speci\ufb01cally, this occurs when q\u03c6(z|x) = p(z|x),\ni.e. the variational approximation is exact. Then, the variance of the full Monte Carlo estimator \u02c6LM C\nis exactly zero. Its value is a constant, independent of z iid\n\u223c q\u03c6(z|x). This follows from the assumption\nq\u03c6(z|x) = p(z|x):\n\n\u02c6LM C(\u03c6) = log p(x, z) \u2212 log q\u03c6(z|x) = log p(z|x) + log p(x) \u2212 log p(z|x) = log p(x),\n\n(4)\nThis suggests that using equation (1) should be preferred when we believe that q\u03c6(z|x) \u2248 p(z|x).\nAnother reason to prefer the ELBO estimator given by equation (1) is that it is the most generally\napplicable, requiring a closed form only for q\u03c6(z|x). This makes it suitable for highly \ufb02exible\napproximate distributions such as normalizing \ufb02ows [Jimenez Rezende and Mohamed, 2015], Real\nNVP [Dinh et al., 2016], or Inverse Autoregressive Flows [Kingma et al., 2016].\n\n2\n\n\fEstimators of the lower bound gradient What about estimating the gradient of the evidence\nlower bound? Perhaps surprisingly, the variance of the gradient of the fully Monte Carlo estimator (1)\nwith respect to the variational parameters is not zero, even when the variational parameters exactly\ncapture the true posterior, i.e., q\u03c6(z|x) = p(z|x).\nThis phenomenon can be understood by decomposing the gradient of the evidence lower bound.\nUsing the reparameterization trick, we can express a sample z from a parametric distribution q\u03c6(z)\nas a deterministic function of a random variable \u0001 with some \ufb01xed distribution and the parameters \u03c6\nof q\u03c6, i.e., z = t(\u0001, \u03c6). For example, if q\u03c6 is a diagonal Gaussian, then for \u0001 \u223c N (0, I), z = \u00b5 + \u03c3\u0001\nis a sample from q\u03c6.\nUnder such a parameterization of z, we can decompose the total derivative (TD) of the integrand of\nestimator (1) w.r.t. the trainable parameters \u03c6 as\n\n\u02c6\u2207TD(\u0001, \u03c6) = \u2207\u03c6 [log p(x|z) + log p(z) \u2212 log q\u03c6(z|x)]\n(cid:125)\n= \u2207\u03c6 [log p(z|x) + log p(x) \u2212 log q\u03c6(z|x)]\n= \u2207z [log p(z|x) \u2212 log q\u03c6(z|x)]\u2207\u03c6t(\u0001, \u03c6)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:124)\n\npath derivative\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n,\n\u2212\u2207\u03c6 log q\u03c6(z|x)\n\nscore function\n\n(5)\n(6)\n\n(7)\n\nThe reparameterized gradient estimator w.r.t. \u03c6\ndecomposes into two parts. We call these the\npath derivative and score function components.\nThe path derivative measures dependence on\n\u03c6 only through the sample z. The score func-\ntion measures the dependence on log q\u03c6 directly,\nwithout considering how the sample z changes\nas a function of \u03c6.\nWhen q\u03c6(z|x) = p(z|x) for all z, the path\nderivative component of equation (7) is iden-\ntically zero for all z. However, the score func-\ntion component is not necessarily zero for any\nz in some \ufb01nite sample, meaning that the total\nderivative gradient estimator (7) will have non-\nzero variance even when q matches the exact\nposterior everywhere.\nThis variance is induced by the Monte Carlo\nsampling procedure itself. Figure 3 depicts\nthis phenomenon through the loss surface of\nlog p(x, z)\u2212 log q\u03c6(z|x) for a Mixture of Gaus-\nsians approximate and true posterior.\n\nFigure 2: The evidence lower bound is a function of the\nsampled latent variables z and the variational parameters\n\u03c6. As the variational distribution approaches the true\nposterior, the gradient with respect to the sampled z\n(blue) vanishes.\n\nPath derivative of the ELBO Could we remove the high-variance score function term from the\ngradient estimate? For stochastic gradient descent to converge, we require that our gradient estimate\nis unbiased. By construction, the gradient estimate given by equation (7) is unbiased. Fortunately, the\nproblematic score function term has expectation zero. If we simply remove that term, we maintain an\nunbiased estimator of the true gradient:\n\n\u02c6\u2207PD(\u0001, \u03c6) = \u2207z [log p(z|x) \u2212 log q\u03c6(z|x)]\u2207\u03c6t(\u0001, \u03c6) \u2212 (((((((\n\u2207\u03c6 log q\u03c6(z|x).\n\n(8)\nThis estimator, which we call the path derivative gradient estimator due to its dependence on the\ngradient \ufb02ow only through the path variables z to update \u03c6, is equivalent to the standard gradient\nestimate with the score function term removed. The path derivative estimator has the desirable\nproperty that as q\u03c6(z|x) approaches p(z|x), the variance of this estimator goes to zero.\nWhen to prefer the path derivative estimator Does eliminating the score function term from the\ngradient yield lower variance in all cases? It might seem that its removal can only have a variance\nreduction effect on the gradient estimator. Interestingly, the variance of the path derivative gradient\nestimator may actually be higher in some cases. This will be true when the score function is positively\ncorrelated with the remaining terms in the total derivative estimator. In this case, the score function\nacts as a control variate: a zero-expectation term added to an estimator in order to reduce variance.\n\n3\n\nVariationalParameters(\u03c6)q\u03c6(z|x)=p(z|x)LatentVariable(z)logp(x,z)\u2212logq\u03c6(z|x)logp(x,z)\u2212logq\u03c6(z|x)SurfaceAlongTrajectorythroughTrue\u03c6ELBO\fAlg. 2 Path Derivative ELBO Gradient\nInput: Variational parameters \u03c6t, Data x\n\u0001t \u223c p(\u0001)\ndef \u02c6Lt(\u03c6):\n\n\u2190 stop_gradient(\u03c6)\n\nzt \u2190 sample_q(\u03c6, \u0001t)\nreturn log p(x, zt) - log q(zt|x, \u03c6)\n\nzt \u2190 sample_q(\u03c6, \u0001t)\n\u03c6(cid:48)\nreturn log p(x, zt) - log q(zt|x, \u03c6(cid:48))\nreturn \u2207\u03c6 \u02c6Lt(\u03c6t)\nControl variates are usually scaled by an adaptive constant c\u2217, which modi\ufb01es the magnitude and\npreceding discussion, we have shown that (cid:98)c\u2217 = 1 is optimal when the variational approximation is\ndirection of the control variate to optimally reduce variance, as in Ranganath et al. [2014]. In the\n\nreturn \u2207\u03c6 \u02c6Lt(\u03c6t)\n\nAlg. 1 Standard ELBO Gradient\nInput: Variational parameters \u03c6t, Data x\n\u0001t \u223c p(\u0001)\ndef \u02c6Lt(\u03c6):\n\nexact, since that choice yields analytically zero variance. When the variational approximation is not\nexact, an estimate of c\u2217 based on the current minibatch will change sign and magnitude depending on\nthe positive or negative correlation of the score function with the path derivative.\nOptimal scale estimation procedures is particularly important when the variance of an estimator is\nso large that convergence is unlikely. However, in the present case of reparameterized gradients,\nwhere the variance is already low, estimating a scaling constant introduces another source of variance.\nIndeed, we can only recover the true optimal scale when the variational approximation is exact in the\nregime of in\ufb01nite samples during Monte Carlo integration.\nMoreover, the score function must be independently estimated in order to scale it. Estimating the\ngradient of the score function independent of automatic reverse-mode differentiation can be a chal-\nlenging engineering task for many \ufb02exible approximate posterior distributions such as Normalizing\nFlows [Jimenez Rezende and Mohamed, 2015], Real NVP [Dinh et al., 2016], or IAF [Kingma et al.,\n2016].\nBy contrast, in section 6 we show improved performance on the MNIST and Omniglot density\nestimation benchmarks by approximating the optimal scale with 1 throughout optimization. This\ntechnique is easy to implement using existing automatic differentiation software packages. However,\nif estimating the score function independently is computationally feasible, and a practitioner has\nevidence that the variance induced by Monte Carlo integration will reduce the overall variance away\nfrom the optimum point, we recommend establishing an annealling schedule for the optimal scaling\nconstant that converges to 1.\n\n3\n\nImplementation Details\n\nIn this section, we introduce algorithms 1 and 2 in relation to reverse-mode automatic differentiation,\nand discuss how to implement the new gradient estimator in Theano, Autograd, Torch or Tensor\ufb02ow\nBergstra et al. [2010], Maclaurin et al. [2015], Collobert et al. [2002], Abadi et al. [2015].\nAlgorithm 1 shows the standard reparameterized gradient for the ELBO. We require three function\nde\ufb01nitions: q_sample to generate a reparameterized sample from the variational approximation,\nand functions that implement log p(x, z) and log q(z|x, \u03c6). Once the loss \u02c6Lt is de\ufb01ned, we can\nleverage automatic differentiation to return the standard gradient evaluated at \u03c6t. This yields equation\n(7).\nAlgorithm 2 shows the path derivative gradient for the ELBO. The only difference from al-\ngorithm 1 is the application of the stop_gradient function to the variational parameters\ninside \u02c6Lt. Table 1 indicates the names of stop_gradient in popular software packages.\n\nTheano:\nT.gradient.disconnected_grad\nAutograd:\nautograd.core.getval\nTensorFlow:\ntf.stop_gradient\nTorch:\ntorch-autograd.util.get_value\nTable 1: Functions that implement stop_gradient\n\n4\n\n\fi=1, Data x\nt}K\n\nt}K\n\nAlg. 3 Path Derivative Mixture ELBO Gradient\nInput: Params \u03c0t = {\u03c0j\nj=1, \u03c6t = {\u03c6i\n\u0001t \u223c p(\u0001)\n\u03c6(cid:48)\nt, \u03c0(cid:48)\nt \u2190 stop_gradient(\u03c6t, \u03c0t)\ndef \u02c6Lc\nt(\u03c6):\nzc\nt \u2190 sample_q(\u03c6, \u0001t)\nreturn log p(x, zc\n\u02c6Lc\nt (\u03c6c\nc=1 \u03c0c\nt\n\nt ) - log(cid:80)K\nt )(cid:1)\n\n(cid:0)(cid:80)K\n\nc=1 \u03c0(cid:48)c\n\nt q(zc\n\nt|x, \u03c6(cid:48)\n\nt)\n\nAlg. 4 IWAE ELBO Gradient\nInput: Params \u03c6t, Data x\n\n\u00011, \u00012, . . . , \u0001K \u223c p(\u0001)\n\u03c6(cid:48)\nt \u2190 stop_gradient(\u03c6t)\ndef wi(\u03c6, \u0001i):\nzi \u2190 sample_q(\u03c6, \u0001i)\nreturn p(x,zi)\nq(zi|x,\u03c6(cid:48)\nt)\n\ni=1 wi(\u03c6, \u0001i)(cid:1)\n(cid:80)K\n\nreturn \u2207\u03c6 log(cid:0) 1\n\nk\n\nreturn \u2207\u03c6,\u03c0\nThis simple modi\ufb01cation to algorithm 1 generates a copy of the parameter variable that is treated as a\nconstant with respect to the computation graph generated for automatic differentiation. The copied\nvariational parameters are used to evaluate variational the density log q\u03c6 at z.\nRecall that the variational parameters \u03c6 are used both to generate z through some deterministic\nfunction of an independent random variable \u0001, and to evaluate the density of z through log q\u03c6. By\nblocking the gradient through variational parameters in the density function, we eliminate the score\nfunction term that appears in equation (7). Per-iteration updates to the variational parameters \u03c6 rely\non the z channel only, e.g., the path derivative component of the gradient of the loss function \u02c6Lt.\nThis yields the gradient estimator corresponding to equation (8).\n\n4 Extensions to Richer Variational Families\n\nMixture Distributions\nIn this section, we dis-\ncuss extensions of the path derivative gradient\nestimator to richer variational approximations to\nthe true posterior.\nUsing a mixture distribution as an approximate\nposterior in an otherwise differentiable estima-\ntor introduces a problematic, non-differentiable\nrandom variable \u03c0 \u223c Cat(\u03b1). We solve this by\nintegrating out the discrete mixture choice from\nboth the ELBO and the mixture distribution. In\nthis section, we show that such a gradient es-\ntimator is unbiased, and introduce an extended\nalgorithm to handle mixture variational families.\nFor any mixture of K base distributions q\u03c6(z|x),\na mixture variational family can be de\ufb01ned by\nc=1 \u03c0c q\u03c6c(z|x), where \u03c6M =\n{\u03c01, ..., \u03c0k, \u03c61, ..., \u03c6k} are variational parame-\nters, e.g., the weights and distributional param-\neters for each component. Then, the mixture\nELBO LM is given by:\nK(cid:88)\n\nq\u03c6M (z|x) =(cid:80)K\n\n(cid:20)\n\nx\ni\nr\nt\na\n\nM\n\ne\nc\nn\na\ni\nr\na\nv\no\nC\n\nf\no\nm\nr\no\nN\ne\nc\na\nr\nT\n\nVariational Parameters \u03c6init \u2192 \u03c6true\n\nFigure 3: Fitting a mixture of 5 Gaussians as a varia-\ntional approximation to a posterior that is also a mixture\nof 5 Gaussians. Path derivative and score function gra-\ndient components were measured 1000 times. The path\nderivative goes to 0 as the variational approximation\nbecomes exact, along an arbitrarily chosen path\n\n\u03c0cEzc\u223cq\u03c6c\n\nc=1\n\nlog p(x, zc) \u2212 log\n\n(cid:18) K(cid:88)\n\nk=1\n\n(cid:19)(cid:21)\n\n\u03c0kq\u03c6k (zc|x)\n\n,\n\nwhere the outer sum integrates over the choice of mixture component for each sample from q\u03c6M ,\nand the inner sum evaluates the density. Applying the new gradient estimator to the mixture ELBO\ninvolves applying it to each q\u03c6k (zc|x) in the inner marginalization.\nAlgorithm 3 implements the gradient estimator of (8) in the context of a continuous mixture distribu-\ntion. Like algorithm 2, the new gradient estimator of 3 differs from the vanilla gradient estimator only\nin the application of stop_gradient to the variational parameters. This eliminates the gradient\nof the score function from the gradient of any mixture distribution.\n\n5\n\n0.0e+002.0e+054.0e+056.0e+058.0e+05TotalDerivativeEstimatorPathDerivativeEstimatorTruePosteriorVariationalApproximation\fImportance-Weighted Autoencoder We also explore the effect of our new gradient estimator on\nthe IWAE bound Burda et al. [2015], de\ufb01ned as\n\n\u02c6LK = Ez1,...,zK\u223cq(z|x)\n\nwith gradient\n\n(9)\n\n(10)\n\n(11)\n\n(cid:20)\n\nlog\n\nK(cid:88)\n\ni=1\n\n(cid:18) 1\n(cid:20) K(cid:88)\n\nk\n\ni=1\n\np(x, zi)\nq(zi|x)\n\n\u02dcwi\u2207\u03c6 log wi\n\n(cid:19)(cid:21)\n\n(cid:21)\n\n(cid:21)\n\n\u2207\u03c6 \u02c6LK = Ez1,...,zK\u223cq(z|x)\n\nwhere wi := p(x, zi)/q(zi|x) and \u02dcwi := wi/(cid:80)k\n\ni=1 wi. Since \u2207\u03c6 log wi is the same gradient as\nthe Monte Carlo estimator of the ELBO (equation (7)), we can again apply our trick to get a new\nestimator.\nHowever, it is not obvious whether this new gradient estimator is unbiased. In the unmodi\ufb01ed IWAE\nbound, when q = p, the gradient with respect to the variational parameters reduces to:\n\n(cid:20)\n\nk(cid:88)\n\ni=1\n\nEz1,...,zk\u223cq(z|x)\n\n\u2212\n\n\u02dcwi\u2207\u03c6 log q\u03c6(zi|x)\n\n.\n\nEach sample zi is used to evaluate both \u02dcwi and the partial derivative term. Hence, we cannot\nsimply appeal to the linearity of expectation to show that this gradient is 0. Nevertheless, a natural\nextension of the variance reduction technique in equation (8) is to apply our variance reduction to\neach importance-weighted gradient sample. See algorithm 4 for how to implement the path derivative\nestimator in this form.\nWe present empirical validation of the idea in our experimental results section, which shows markedly\nimproved results using our gradient estimator. We observe a strong improvement in many cases,\nsupporting our conjecture that the gradient estimator is unbiased as in the mixture and multi-sample\nELBO cases.\n\nFlow Distributions Flow-based approximate posteriors such as Kingma et al. [2016], Dinh et al.\n[2016], Jimenez Rezende and Mohamed [2015] are a powerful and \ufb02exible framework for \ufb01tting\napproximate posterior distributions in variational inference. Flow-based variational inference samples\nan initial z0 from a simple base distribution with known density, then learns a chain of invertible,\nk=1 log(cid:12)(cid:12) det \u2202fk\n(cid:80)K\nparameterized maps fk(zk\u22121) that warp z0 into zK = fK \u25e6 fK\u22121 \u25e6 ... \u25e6 f1(z0). The endpoint\nzK represents a sample from a more \ufb02exible distribution with density log qK(zK) = log q0(z0) \u2212\n\n(cid:12)(cid:12).\n\n\u2202zk\u22121\n\nWe expect our gradient estimator to improve the performance of \ufb02ow-based stochastic variational\ninference. However, due to the chain composition used to learn zK, we cannot straightforwardly apply\nour trick as described in algorithm 2. This is because each intermediate zj, 1 \u2264 j \u2264 K contributes\nto the path derivative component in equation (8). The log-Jacobian terms used in the evaluation of\nlog q(zk), however, require this gradient information to calculate the correct estimator. By applying\nstop_gradient to the variational parameters used to generate each intermediate zi and passing\nonly the endpoint zK to a log density function, we would lose necessary gradient information at\neach intermediate step needed for the gradient estimator to be correct. At time of writing, the\nrequisite software engineering to track and expose intermediate steps during backpropagation is not\nimplemented in the packages listed in Table 1, and so we leave this to future work.\n\n5 Related Work\n\nOur modi\ufb01cation of the standard reparameterized gradient estimate can be interpreted as adding a\ncontrol variate, and in fact Ranganath et al. [2014] investigated the use of the score function as a\ncontrol variate in the context of non-reparameterized variational inference. The variance-reduction\neffect we use to motivate our general gradient estimator has been noted in the special cases of\nGaussian distributions with sparse precision matrices and Gaussian copula inference in Tan and\nNott [2017] and Han et al. [2016] respectively. In particular, Tan and Nott [2017] observes that by\n\n6\n\n\fMNIST\n\nOmniglot\n\nVAE\n\nIWAE\n\nVAE\n\nIWAE\n\nstochastic layers\n1\n\n2\n\nk\n1\n5\n50\n1\n5\n50\n\nTotal\n86.76\n86.47\n86.35\n85.33\n85.01\n84.78\n\nPath\n86.40\n86.33\n86.48\n84.77\n84.68\n84.33\n\nTotal\n86.76\n85.54\n84.78\n85.33\n83.89\n82.90\n\nPath\n86.40\n85.20\n84.45\n84.77\n83.57\n83.16\n\nTotal\n108.11\n107.62\n107.80\n107.58\n106.31\n106.30\n\nPath\n107.39\n107.40\n107.42\n105.22\n104.87\n105.70\n\nTotal\n108.11\n106.12\n104.67\n107.56\n104.79\n103.38\n\nPath\n107.39\n105.42\n104.16\n105.22\n103.59\n102.86\n\nTable 2: Results on variational (VAE) and importance-weighted (IWAE) autoencoders using the total\nderivative estimator, equation (7), versus the path derivative estimator, equation (8) (ours).\n\neliminating certain terms from a gradient estimator for Gaussian families parameterized by sparse\nprecision matrices, multiple lower-variance unbiased gradient estimators may be derived.\nOur work is a generalization to any continuous variational family. This provides a framework for\neasily implementing the technique in existing software packages that provide automatic differentiation.\nBy expressing the general technique in terms of automatic differentiation, we eliminate the need\nfor case-by-case analysis of the gradient of the variational lower bound as in Tan and Nott [2017]\nand Han et al. [2016].\nAn innovation by Ruiz et al. [2016] introduces the generalized reparameterization gradient (GRG)\nwhich uni\ufb01es the REINFORCE-style and reparameterization gradients. GRG employs a weaker\nform of reparameterization that requires only the \ufb01rst moment to have no dependence on the latent\nvariables, as opposed to complete independence as in Kingma and Welling [2013]. GRG improves on\nthe variance of the score-function gradient estimator in BBVI without the use of Rao-Blackwellization\nas in Ranganath et al. [2014]. A term in their estimator also behaves like a control variate.\nThe present study, in contrast, develops a simple drop-in variance reduction technique through an\nanalysis of the functional form of the reparameterized evidence lower bound gradient. Our technique\nis developed outside of the framework of GRG but can strongly improve the performance of existing\nalgorithms, as demonstrated in section 6. Our technique can be applied alongside GRG.\nIn the python toolkit Edward [Tran et al., 2016], efforts are ongoing to develop algorithms that\nimplement stochastic variational inference in general as a black-box method. In cases where an\nanalytic form of the entropy or KL-divergence is known, the score function term can be avoided\nusing Edward. This is equivalent to using equations (2) or (3) respectively to estimate the ELBO. As\nof release 1.2.4 of Edward, the total derivative gradient estimator corresponding to (7) is used for\nreparameterized stochastic variational inference.\n\n6 Experiments\n\nExperimental Setup Because we follow the experimental setup of Burda et al. [2015], we review\nit brie\ufb02y here. Both benchmark datasets are composed of 28 \u00d7 28 binarized images. The MNIST\ndataset was split into 60, 000 training and 10, 000 test examples. The Omniglot dataset was split\ninto 24, 345 training and 8070 test examples. Each model used Xavier initialization [Glorot and\nBengio, 2010] and trained using Adam with parameters \u03b21 = 0.9, \u03b22 = 0.999, and \u0001 = 1e\u22124 with\n20 observations per minibatch [Kingma and Ba, 2015]. We compared against both architectures\nreported in Burda et al. [2015]. The \ufb01rst has one stochastic layer with 50 hidden units, encoded using\ntwo fully-connected layers of 200 neurons each, using a tanh nonlinearity throughout. The second\narchitecture is two stochastic layers: the \ufb01rst stochastic layer encodes the observations, with two\nfully-connected layers of 200 hidden units each, into 100 dimensional outputs. The output is used as\nthe parameters of diagonal Gaussian. The second layer takes samples from this Gaussian and passes\nthem through two fully-connected layers of 100 hidden units each into 50 dimensions.\nSee table 2 for NLL scores estimated as the mean of equation (9) with k=5000 on the test set. We can\nsee that the path derivative gradient estimator improves over the original gradient estimator in all but\ntwo cases.\n\n7\n\n\fBenchmark Datasets We evaluate our path derivative estimator using two benchmark datasets:\nMNIST, a dataset of handwritten digits [LeCun et al., 1998], and Omniglot, a dataset of handwritten\ncharacters from many different alphabets [Lake, 2014]. To underscore both the easy implementation\nof this technique and the improvement it offers over existing approaches, we have empirically\nevaluated our new gradient estimator by a simple modi\ufb01cation of existing code1 [Burda et al., 2015].\n\nOmniglot Results For a two-stochastic-layer VAE using the multi-sample ELBO with gradient\ncorresponding to equation (8) improves over the results in Burda et al. [2015] by 2.36, 1.44, and 0.6\nnats for k={1, 5, 50} respectively. For a one-stochastic-layer VAE, the improvements are more modest:\n0.72, 0.22, and 0.38 nats lower for k={1, 5, 50} respectively. A VAE with a deep recognition network\nappears to bene\ufb01t more from our path derivative estimator than one with a shallow recognition\nnetwork. For comparison, a VAE using the path derivative estimator with k=5 samples performs only\n0.08 nats worse than an IWAE using the total derivative gradient estimator (7) and 5 samples. By\ncontrast, using the total derivative (vanilla) estimator for both models, IWAE otherwise outperforms\nVAE for k=5 samples by 1.52 nats.\nBy increasing the accuracy of the ELBO gradient estimator, we may also increase the risk of\nover\ufb01tting. Burda et al. [2015] report that they didn\u2019t notice any signi\ufb01cant problems with over\ufb01tting,\nas the training log likelihood was usually 2 nats lower than the test log likelihood. With our gradient\nestimator, we observe only 0.77 nats worse performance for a VAE with k=50 compared to k=5 in\nthe two-layer experiments. IWAE using equation (8) markedly outperforms IWAE using equation\n(7) on Omniglot. For a 2-layer IWAE, we observe an improvement of 2.34, 1.2, and 0.52 nats\nfor k={1, 5, 50} respectively. For a 1-layer IWAE, the improvements are 0.72, 0.7, and 0.51 for\nk={1, 5, 50} respectively. Just as in the VAE Omniglot results, a deeper recognition network for an\nIWAE model bene\ufb01ts more from the improved gradient estimator than a shallow recognition network.\n\nMNIST Results For all but one experiment, a VAE with our path derivative estimator outperforms a\nvanilla VAE on MNIST data. For k=50 with one stochastic layer, our gradient estimator underperforms\na vanilla VAE by 0.13 nats. Interestingly, the training NLL for this run is 86.11, only 0.37 nats\ndifferent than the test NLL. The similar magnitude of the two numbers suggests that training for\nlonger than Burda et al. [2015] would improve the performance of our gradient estimator. We\nhypothesize that the worse performance using the path derivative estimator is a consequence of\n\ufb01ne-tuning towards the characteristics of the total derivative estimator.\nFor a two-stochastic-layer VAE on MNIST, the improvements are 0.56, 0.33 and 0.45 for k={1, 5, 50}\nrespectively. In a one-stochastic-layer VAE on MNIST, the improvements are 0.36 and 0.14 for\nk={1, 5} respectively.\nThe improvements on IWAE are of a similar magnitude. For k=50 in a two-layer path-derivative\nIWAE, we perform 0.26 nats worse than with a vanilla IWAE. The training loss for the k=50 run is\n82.74, only 0.42 nats different. As in the other failure case, this suggests we have room to improve\nthese results by \ufb01ne-tuning over our method. For a two stochastic layer IWAE, the improvements are\n0.66 and 0.22 for k=1 and 5 respectively. In a one stochastic layer IWAE, the improvements are 0.36,\n0.34, and 0.33 for k={1, 5, 50} respectively.\n\n7 Conclusions and Future Work\n\nWe demonstrated that even when the reparameterization trick is applicable, further reductions in\ngradient variance are possible. We presented our variance reduction method in a general way by\nexpressing it as a modi\ufb01cation of the computation graph used for automatic differentiation. The\ngain from using our method grows with the complexity of the approximate posterior, making it\ncomplementary to the development of non-Gaussian posterior families.\nAlthough the proposed method is speci\ufb01c to variational inference, we suspect that similar unbiased\nbut high-variance terms might exist in other stochastic optimization settings, such as in reinforcement\nlearning, or gradient-based Markov Chain Monte Carlo.\n\n1See https://github.com/geoffroeder/iwae\n\n8\n\n\fReferences\nMart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew\nHarp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath\nKudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike\nSchuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent\nVanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg,\nMartin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on\nheterogeneous systems, 2015. URL http://tensorflow.org/. Software available from\ntensor\ufb02ow.org.\n\nJames Bergstra, Olivier Breuleux, Fr\u00e9d\u00e9ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume\nDesjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A cpu and gpu math\ncompiler in python. In Proc. 9th Python in Science Conf, pages 1\u20137, 2010.\n\nYuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv\n\npreprint arXiv:1509.00519, 2015.\n\nRonan Collobert, Samy Bengio, and Johnny Mari\u00e9thoz. Torch: a modular machine learning software\n\nlibrary. Technical report, Idiap, 2002.\n\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv\n\npreprint arXiv:1605.08803, 2016.\n\nXavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In Aistats, volume 9, pages 249\u2013256, 2010.\n\nShaobo Han, Xuejun Liao, David B Dunson, and Lawrence Carin. Variational gaussian copula\ninference. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and\nStatistics, volume 51, pages 829\u2013838, 2016.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. In The\n\n32nd International Conference on Machine Learning, 2015.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the\n\n3rd international conference on learning representations, 2015.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nDiederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse\n\nautoregressive \ufb02ow. Advances in Neural Information Processing Systems 29, 2016.\n\nBrenden M Lake. Towards more human-like concept learning in machines: Compositionality,\n\ncausality, and learning-to-learn. PhD thesis, Massachusetts Institute of Technology, 2014.\n\nYann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist dataset of handwritten digits.\n\nURL http://yann. lecun. com/exdb/mnist, 1998.\n\nDougal Maclaurin, David Duvenaud, Matthew Johnson, and Ryan P. Adams. Autograd: Reverse-\nmode differentiation of native Python, 2015. URL http://github.com/HIPS/autograd.\n\nRajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In AISTATS,\n\npages 814\u2013822, 2014.\n\nDanilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In Proceedings of the 31st International Conference on\nMachine Learning (ICML-14), pages 1278\u20131286, 2014.\n\nFrancisco JR Ruiz, Michalis K Titsias, and David M Blei. The generalized reparameterization\n\ngradient. arXiv preprint arXiv:1610.02287, 2016.\n\nLinda SL Tan and David J Nott. Gaussian variational approximation with sparse precision matrices.\n\nStatistics and Computing, pages 1\u201317, 2017.\n\n9\n\n\fDustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M.\nBlei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint\narXiv:1610.09787, 2016.\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n10\n\n\f", "award": [], "sourceid": 3474, "authors": [{"given_name": "Geoffrey", "family_name": "Roeder", "institution": "University of Toronto"}, {"given_name": "Yuhuai", "family_name": "Wu", "institution": "University of Toronto"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "University of Toronto"}]}