{"title": "Implicit Reparameterization Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 441, "page_last": 452, "abstract": "By providing a simple and efficient way of computing low-variance gradients of continuous random variables, the reparameterization trick has become the technique of choice for training a variety of latent variable models. However, it is not applicable to a number of important continuous distributions.  We introduce an alternative approach to computing reparameterization gradients based on implicit differentiation and demonstrate its broader applicability by applying it to Gamma, Beta, Dirichlet, and von Mises distributions, which cannot be used with the classic reparameterization trick. Our experiments show that the proposed approach is faster and more accurate than the existing gradient estimators for these distributions.", "full_text": "Implicit Reparameterization Gradients\n\nMichael Figurnov\n\nShakir Mohamed Andriy Mnih\n\nDeepMind, London, UK\n\n{mfigurnov,shakir,amnih}@google.com\n\nAbstract\n\nBy providing a simple and ef\ufb01cient way of computing low-variance gradients of\ncontinuous random variables, the reparameterization trick has become the technique\nof choice for training a variety of latent variable models. However, it is not\napplicable to a number of important continuous distributions. We introduce an\nalternative approach to computing reparameterization gradients based on implicit\ndifferentiation and demonstrate its broader applicability by applying it to Gamma,\nBeta, Dirichlet, and von Mises distributions, which cannot be used with the classic\nreparameterization trick. Our experiments show that the proposed approach is faster\nand more accurate than the existing gradient estimators for these distributions.\n\n1\n\nIntroduction\n\nPathwise gradient estimators are a core tool for stochastic estimation in machine learning and\nstatistics [12, 15, 26, 42, 51]. In machine learning, we now commonly introduce these estimators\nusing the \u201creparameterization trick\u201d, in which we replace a probability distribution with an equivalent\nparameterization of it, using a deterministic and differentiable transformation of some \ufb01xed base\ndistribution. This reparameterization is a powerful tool for learning because it makes backpropagation\npossible in computation graphs with certain types of continuous random variables, e.g. with Normal,\nLogistic, or Concrete distributions [23, 30]. Many of the recent advances in machine learning were\nmade possible by this ability to backpropagate through stochastic nodes. They include variational\nautoenecoders (VAEs), automatic variational inference [26, 28, 42], Bayesian learning in neural\nnetworks [7, 14], and principled regularization in deep networks [13, 34].\nThe reparameterization trick is easily used with distributions that have location-scale parameteriza-\ntions or tractable inverse cumulative distribution functions (CDFs), or are expressible as deterministic\ntransformations of such distributions. These seemingly modest requirements are still fairly restrictive\nas they preclude a number of standard distributions, such as truncated, mixture, Gamma, Beta,\nDirichlet, or von Mises, from being used with reparameterization gradients. This paper provides a\ngeneral tool for reparameterization in these important cases.\nThe limited applicability of reparameterization has often been addressed by using a different class\nof gradient estimators, the score-function estimators [12, 16, 53]. While being more general, they\ntypically result in high-variance gradients which require problem-speci\ufb01c variance reduction tech-\nniques to be practical. Generalized reparameterizations involve combining the reparameterization\nand score-function estimators [36, 44]. Another approach is to approximate the intractable derivative\nof the inverse CDF [27].\nFollowing Graves [17], we use implicit differentiation to differentiate the CDF rather than its inverse.\nWhile the method of Graves [17] is only practical for distributions with analytically tractable CDFs\nand has been used solely with mixture distributions, we leverage automatic differentiation to handle\ndistributions with numerically tractable CDFs, such as Gamma and von Mises. We review the\nstandard reparameterization trick in Section 2 and then make the following contributions:\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u2022 We develop implicit reparameterization gradients that provide unbiased estimators for continuous\ndistributions with numerically tractable CDFs. This allows many other important distributions to\nbe used as easily as the Normal distribution in stochastic computation graphs.\n\u2022 We show that the proposed gradients are both faster and more accurate than alternative approaches.\n\u2022 We demonstrate that our method can outperform existing stochastic variational methods at training\n\u2022 We use implicit reparameterization gradients to train VAEs with Gamma, Beta, and von Mises\nlatent variables instead of the usual Normal variables, leading to latent spaces with interesting\nalternative topologies.\n\nthe Latent Dirichlet Allocation topic model in a black-box fashion using amortized inference.\n\n2 Background\n\n2.1 Explicit reparameterization gradients\n\nWe start with a review of the original formulation of reparameterization gradients [26, 42, 51], which\nwe will refer to as explicit reparameterization. Suppose we would like to optimize an expectation\nEq(z) [f (z)] of some continuously differentiable function f (z) w.r.t. the parameters  of the\ndistribution. We assume that we can \ufb01nd a standardization function S(z) that when applied to a\nsample from q(z) removes its dependence on the parameters of the distribution. The standardization\nfunction should be continuously differentiable w.r.t. its argument and parameters, and invertible:\n\nS(z) = \" \u21e0 q(\")\n\nz = S1\n\n (\").\n\n(1)\n\nFor example, for a Gaussian distribution N (\u00b5, ) we can use S\u00b5,(z) = (z  \u00b5)/ \u21e0N (0, 1). We\ncan then express the objective as an expectation w.r.t. \", transferring the dependence on  into f:\n\n(2)\n\n(3)\n\nThis allows us to compute the gradient of the expectation as the expectation of the gradients:\n\nEq(z) [f (z)] = Eq(\")hf (S1\n\n (\"))i .\n (\"))i = Eq(\")hrzf (S1\n\nr Eq(z) [f (z)] = Eq(\")hrf (S1\n\n (\"))rS1\n\n (\")i .\n\nA standardization function S(z) satisfying the requirements exists for a wide range of continuous\ndistributions, but it is not always practical to take advantage of this. For instance, the CDF F (z|)\nof a univariate distribution provides such a function, mapping samples from it to samples from the\nuniform distribution over [0, 1]. However, inverting the CDF is often complicated and expensive, and\ncomputing its derivative is even harder.\n\n2.2 Stochastic variational inference\n\nStochastic variational inference [19] for latent variable models is perhaps the most popular use\n\ncase for reparameterization gradients. Consider a model p\u2713(x) = R p\u2713(x|z)p(z)dz, where x is\nan observation, z 2 RD is a vector-valued latent variable, p\u2713(x|z) is the likelihood function with\nparameters \u2713, and p(z) is the prior distribution. Except for a few special cases, maximum likelihood\nlearning in such models is intractable because of the dif\ufb01culty of the integrals involved. Variational\ninference [22] provides a tractable alternative by introducing a variational posterior distribution\nq(z|x) and maximizing a lower bound on the marginal log-likelihood:\n\nL(x, \u2713, ) = Eq(z|x) [log p\u2713(x|z)]  KL(q(z|x)kp(z)) \uf8ff log p\u2713(x).\n\n(4)\n\nTraining models with modern stochastic variational inference [26, 39] involves gradient-based\noptimization of the bound w.r.t. the model parameters \u2713 and the variational posterior parameters .\nWhile the KL-divergence term and its gradients can often be computed analytically, the remaining\nterm and its gradients are typically intractable and are approximated using samples from the variational\nposterior. The most general form of this approach involves score-function gradient estimators [33, 39,\n41] that handle both discrete and continuous latent variables but have relatively high variance. The\nreparameterization trick usually provides a lower variance gradient estimator and is easier to use, but\ndue to the limitations discussed above, is not applicable to many important continuous distributions.\n\n2\n\n\f3\n\nImplicit reparameterization gradients\n\nr Eq(z) [f (z)] = Eq(z) [rzf (z)rz] ;\n\nWe propose an alternative way of computing the reparameterization gradient that avoids the inversion\nof the standardization function. We start from Eqn. (3) and perform a change of variable z = S1\n (\"):\n(5)\nOur key insight is that we can compute rz by implicit differentiation. We apply the total gradient\n to the equality S(z) = \". Then, we use the chain rule to expand the total gradient in terms of\nrTD\nthe partial gradients. The standardization function S(z) depends on the parameters  directly via\nthe subscript parameters and indirectly via the argument z, while the noise \" is independent of  by\nthe de\ufb01nition of a standardization function. Thus, we have rzS(z)rz + rS(z) = 0, where\nall the gradients are partial. Solving this equation for rz yields\nrz = (rzS(z))1rS(z)\n\nrz = rS1\n\n (\")|\"=S(z).\n\n(6)\n\nThis expression for the gradient only requires differentiating the standardization function and not\ninverting it. Note that its value does not change under any invertible transformation T (\") of the\nstandardization function, since the corresponding Jacobian r\"T (\") cancels out with the inverse.\nExample: univariate Normal distribution N (\u00b5, 2). We illustrate that explicit and implicit repa-\nrameterizations give identical results. A standardization function is given by S\u00b5,(z) = (z  \u00b5)/ =\n\" \u21e0N (0, 1). Explicit reparameterization inverts this function: z = S1\n@\u00b5 =\n1, @z\n\n@ = \". The implicit reparameterization, Eqn. (6), gives:\n\n\u00b5,(\") = \u00b5 + \", @z\n\n@z\n@\u00b5\n\n= \n\n@\u00b5\n\n@S\u00b5,(z)\n@S\u00b5,(z)\n\n@z\n\n=  1\n\n\n1\n\n\n= 1,\n\n@z\n@\n\n= \n\n@\n\n@S\u00b5,(z)\n@S\u00b5,(z)\n\n@z\n\n=  z\u00b5\n\n2\n1\n\n\n=\n\nz  \u00b5\n\n\n.\n\n(7)\n\nThe expressions are equivalent, but the implicit version avoids inverting S\u00b5,(z).\nUniversal standardization function. For univariate distributions, a standardization function is given\nby the CDF: S(z) = F (z|) \u21e0 Uniform(0, 1). Assuming that the CDF is strictly monotonic and\ncontinuously differentiable w.r.t. z and , it satis\ufb01es the requirements for a standardization function.\nPlugging this function into (6), we have\n\nrz = rF (z|)\n\nq(z)\n\n.\n\n(8)\n\nTherefore, computing the implicit gradient requires only differentiating the CDF. In the multivariate\ncase, we can perform the multivariate distributional transform [45]:\n\n(9)\nS(z) = (F (z1|), F (z2|z1, ), . . . , F (zD|z1, . . . , zD1, )) = \",\nwhere q(\") =QD\nd=1 Uniform(\"d|0, 1). Eqn. (6) requires computing the gradient of the (conditional)\nCDFs and solving a linear system with matrix rzS(z). If the distribution is factorized, the matrix\nis diagonal and the system can be solved in O(D). Otherwise, the matrix is triangular because each\nCDF depends only on the preceding elements, and the system is solvable in O(D2).\nAlgorithm. We present the comparison between the standard explicit and the proposed implicit\nreparameterization in Table 1. Samples of z in implicit reparameterization can be obtained with\nany suitable method, such as rejection sampling [10]. The required gradients of the standardization\nfunction can be computed either analytically or using automatic differentiation.\n\n4 Applications of implicit reparameterization gradients\n\nWe now demonstrate how implicit reparameterization can be applied to a variety of distributions.\nOur strategy is to provide a computation method for a standardization function, such as CDF or\nmultivariate distributional transform, and its gradients.\nTruncated univariate distribution. A truncated distribution is obtained by restricting a distribution\u2019s\ndomain to some range [a, b]. Its CDF can be computed from the CDF of the original distribution:\n\n3\n\n\fTable 1: Comparison of the two reparameterization types. While they provide the same result, the\nimplicit version is easier to implement for distributions such as Gamma because it does not require\ninverting the standardization function S(z).\n\nForward pass\n\nBackward pass\n\nExplicit reparameterization\nSample \" \u21e0 q(\")\nSet z S 1\n (\")\nSet rz rS1\nSet rf (z) rzf (z)rz\n\n (\")\n\nImplicit reparameterization (proposed)\nSample z \u21e0 q(z)\n\nSet rz (rzS(z))1rS(z)\nSet rf (z) rzf (z)rz\n\ni=1 wd\n\nF (b|)F (a|) , z 2 [a, b]. Assuming that the gradient rF (z|) is available, we\n\n\u02c6F (z|, a, b) = F (z|)F (a|)\ncan easily compute the implicit gradient for the truncated distribution.\nMixture distribution q(z) = PK\ni=1 wiqi(z), where  = (1, . . . , K, w1, . . . , wK). In the\nunivariate case, the CDF of the mixture is simplyPK\ni=1 wiF (z|i). In the multivariate case, the dis-\ntributional transform is given by F (zd|z1, . . . , zd1, ) =PK\ni F (zd|z1, . . . , zd1, i), where\nwiqi (z1,...,zd1)\nj=1 wj qj (z1,...,zd1) is the posterior weight for the mixture component after observing the\nwd\ni =\nPK\n\ufb01rst d  1 dimensions of the sample. The required gradient can be obtained via automatic differentia-\ntion. When the mixture components are fully factorized, we obtain the same result as [17], but in a\nsimpler form, due to automatic differentiation and the explicitly speci\ufb01ed linear system.\nGamma distribution Gamma(\u21b5, ) with shape \u21b5> 0 and rate > 0. The rate can be standardized\nusing the scaling property: if z \u21e0 Gamma(\u21b5, 1), then z/ \u21e0 Gamma(\u21b5, ). For the shape\nparameter, the CDF of the Gamma distribution with shape \u21b5 and unit rate is the regularized incomplete\nGamma function (z, \u21b5) that does not have an analytic expression. Following Moore [35], we propose\nto apply forward-mode automatic differentiation [2] to a numerical method [3] that computes its\nvalue. This provides the derivative @(z,\u21b5)\nStudent\u2019s t-distribution samples can be derived from samples of Gamma.\n2 ), then z \u21e0N (0, 2) is t-distributed with \u232b degrees of freedom.\nGamma( \u232b\nBeta and Dirichlet distribution samples can also be obtained from samples of Gamma.\nz1 \u21e0 Gamma(\u21b5, 1) and z2 \u21e0 Gamma(, 1), then\nj=1 zj\u25c6 \u21e0 Dirichlet(\u21b51, . . . ,\u21b5 D).\nzDPD\n\nIndeed, if  \u21e0\nIf\nz1+z2 \u21e0 Beta(\u21b5, ). Similarly, if zi \u21e0\n\nGamma(\u21b5i, 1), then\u2713 z1PD\n\nVon Mises distribution [31, 32] is a maximum entropy distribution on a circle with the density\nfunction vonMises(z|\u00b5, \uf8ff) = exp(\uf8ff cos(z\u00b5))\n, where \u00b5 is the location parameter, \uf8ff> 0 is the\nconcentration, and I0(\uf8ff) is the modi\ufb01ed Bessel function of the \ufb01rst kind. The location parameter \u00b5\ncan be standardized by noting that if z \u21e0 vonMises(0,\uf8ff ), then z + \u00b5 \u21e0 vonMises(\u00b5, \uf8ff). For the\nconcentration parameter \uf8ff, we propose to use implicit reparameterization by performing forward-\nmode automatic differentiation of an ef\ufb01cient numerical method [18] for computation of the CDF.\n\n2\u21e1I0(\uf8ff)\n\n, . . . ,\n\nj=1 zj\n\nfor roughly twice the cost of computing the CDF.\n\n@\u21b5\n\n2 , \u232b\n\nz1\n\n4.1 Accuracy and speed of reparameterization gradient estimators\n\nImplicit reparameterization requires differentiating the CDF w.r.t. its parameters. When this operation\nis analytically intractable, e.g. for Gamma and von Mises distributions, we estimate it via forward-\nmode differentiation of the code that numerically evaluates the CDF. We implement this approach by\nmanually performing the required modi\ufb01cations of the C++ code (see Appendix B). An alternative is\nto use a central \ufb01nite difference approximation of the derivative: @F (z|)\n@ \u21e1 F (z|(1+))F (z|(1))\n,\nwhere 0 << 1 is the relative step size that we choose via grid search. For the Gamma distribution,\nwe also compare with two alternatives: (1) the estimator of Knowles [27] that performs explicit\nreparameterization by approximately computing the derivative of the inverse CDF; (2) the concurrently\ndeveloped method of Jankowiak and Obermeyer [25] that computes implicit reparameterization using\na closed-form approximation of the CDF derivative. We use the reference PyTorch Paszke et al. [40]\nimplementation of the method of Jankowiak and Obermeyer [25]. The ground truth value of the CDF\n\n2\n\n4\n\n\fTable 2: Average error and time (measured in seconds per element) of the reparameterization gradient\ncomputation methods. Automatic differentiation achieves the lowest error and the highest speed.\n\nMethod\nAutomatic differentiation\nFinite difference\nJankowiak and Obermeyer [25]\nAutomatic differentiation\nFinite difference\nKnowles [27]\n\nGamma\n\nfloat32\n\nPrecision Mean abs. error\n2.3 \u21e5 106\n1.9 \u21e5 103\n4.1 \u21e5 105\n5.4 \u21e5 1013\n3.2 \u21e5 109\n6.5 \u21e5 103\n\nfloat64\n\nTime (s)\n1.9 \u21e5 108\n3.8 \u21e5 108\n9.0 \u21e5 108\n3.2 \u21e5 108\n7.1 \u21e5 108\n1.2 \u21e5 106\n\nVon Mises\n\nMean abs. error\n1.9 \u21e5 107\n9.6 \u21e5 105\n1.3 \u21e5 1013\n1.1 \u21e5 1010\n\n\u2013\n\n\u2013\n\nTime (s)\n3.1 \u21e5 108\n3.8 \u21e5 108\n3.7 \u21e5 108\n5.9 \u21e5 108\n\n\u2013\n\n\u2013\n\nderivative is computed in a computationally expensive but accurate way (see Appendix C). The results\nin Table 2 suggest that the automatic differentiation approach provides the highest accuracy and speed.\nThe \ufb01nite difference method can be easier to implement if a CDF computation method is available,\nbut requires computation in float64 to obtain the float32 precision. This can be problematic for\ndevices such as GPUs and other accelerators that do not support fast high-precision computation. The\napproach of Knowles is slower and signi\ufb01cantly less accurate due to the approximations of the inverse\nCDF derivative computation method. The method of Jankowiak and Obermeyer is 4.5\u21e5 slower\nand 3\u21e5 less accurate than the automatic differentiation approach, which re\ufb02ects the complexity of\nobtaining fast and accurate closed-form approximations to the CDF derivative. In the remaining\nexperiments we use automatic differentiation and float32 precision.\n\n5 Related work\n\nSurrogate distributions. When explicit reparameterization is not feasible, it is often possible to\nmodify the model to use alternative distributions that are reparameterizable. This is a popular approach\ndue to is simplicity. Kucukelbir et al. [28] approximate posterior distributions by a deterministic\ntransformation of Normal samples; Nalisnick et al. [37] and Nalisnick and Smyth [38] replace Beta\ndistributions with Kumaraswamy distributions in the Dirichlet Process stick-breaking construction;\nZhang et al. [54] substitute the Gamma distribution for a Weibull distribution; Srivastava and Sutton\n[47, 48] replace the Dirichlet distribution with a Logistic Normal. Surrogate distributions however do\nnot always have all the desirable properties of the distributions they replace. For example, as noted\nby Ruiz et al. [44], such surrogate distributions struggle to capture sparsity, which is achievable with\nGamma and Dirichlet distributions.\nIntegrating out the nuisance variables. In some cases it is possible to trade computation for sim-\nplicity of reparameterization. Roeder et al. [43] consider a mixture of reparameterizable distributions\nand analytically sum out the discrete mixture component id variable. For a mixture with K compo-\nnents, this results in a K-fold increase of computation, compared to direct reparameterization of the\nmixture. This approach becomes prohibitively expensive for a chain of mixture distributions, where\nthe amount of computation grows exponentially with the length of the chain. On the other hand, we\ncan always estimate the gradients with just one sample by directly reparameterizing the mixture.\nImplicit reparameterization gradients. Reparameterization gradients have been known in the\noperations research community since the late 1980s under the name of pathwise, or stochastic,\ngradients [12, 49]. There the \u201cexplicit\u201d and \u201cimplicit\u201d versions were usually introduced side-by-\nside, but they were applied only to univariate distributions and simple computational graphs that do\nnot require backpropagation. In the machine learning community, the implicit reparameterization\ngradients for univariate distributions were introduced by Salimans and Knowles [46]. That work,\nas well as Hoffman and Blei [21], used the implicit gradients to perform backpropagation through\nthe Gamma distribution using a \ufb01nite difference approximation of the CDF derivative. Graves [17]\nindependently introduced the implicit reparameterization gradients for multivariate distributions\nwith analytically tractable CDFs, such as mixtures. We add to this rich literature by generalizing\nthe technique to handle arbitrary standardization functions, deriving a simpler expression than that\nof Graves [17] for the multivariate case, showing the connection to explicit reparameterization\ngradients, and providing an ef\ufb01cient automatic differentiation method to compute the intractable CDF\nderivatives.\n\n5\n\n\fReparameterization gradients as differential equation solutions. The concurrent works [24, 25]\nprovide a complementary view of the reparameterization gradients as solutions of a differential\nequation called the transport equation. For univariate distributions, the unique solution is Eqn. (8).\nHowever, for the non-factorial multivariate distributions, there are multiple solutions. By choosing an\nappropriate one, the variance of the gradient estimator may be reduced. Unfortunately, there does not\nseem to be a general way to obtain these solutions, so distribution-speci\ufb01c derivations are required.\nWe hypothesize that the transport equation solutions correspond to the implicit reparameterization\ngradients for different standardization functions.\nGeneralized reparameterizations. The limitations of standard reparameterization was recently\ntackled by several other works. Ruiz et al. [44] introduced generalized reparameterization gradients\n(GRG) that expand the applicability of the reparameterization trick by using a standardization\nfunction that allows the underlying base distribution to depend weakly on the parameter vector\n(e.g. only through the higher moments). The resulting gradient estimator, which in addition to the\nthe reparameterized gradients term includes a score-function gradient term that takes into account\nthe dependence of the base distribution on the parameter vector, was applied to the Gamma, Beta,\nand log-Normal distributions. The challenge of using this approach lies in \ufb01nding an effective\napproximate standardization function, which is nontrivial yet essential for obtaining low-variance\ngradients.\nRejection sampling variational inference (RSVI) [36] is a closely-related approach that combines\nthe reparameterization gradients from the proposal distribution of a rejection sampler with a score-\nfunction gradient term that takes into account the effect of the accept/reject step. When applied to the\ngamma distribution the RSVI gradients can have lower variance gradients than those computed using\nGRG [36]. Davidson et al. [9] have recently demonstrated the use of RSVI with the von Mises-Fisher\ndistribution.\n\n6 Experiments\n\nWe apply implicit reparameterization for two distributions with analytically intractable CDFs (Gamma\nand von Mises) to three problems: a toy setting of stochastic cross-entropy estimation, training a\nLatent Dirichlet Allocation [6] (LDA) topic model, and training VAEs [26, 42] with non-Normal\nlatent distributions. We use the RSVI gradient estimator [36] as our main baseline. For Gamma\ndistributions, RSVI provides a shape augmentation parameter B that decreases the magnitude of\nthe score-function correction term by using additional B samples from a uniform distribution. As\nB ! 1, the term vanishes and the RSVI gradient becomes equivalent to ours, but with a higher\ncomputational cost. Von Mises distribution does not have such an augmentation parameter. For\nLDA, we also compare with a surrogate distribution approach [47] and a classic stochastic variational\ninference method [19]. The experimental details are given in Appendix D. We use TensorFlow [1] for\nour experiments. Implicit reparameterization for Gamma, Student\u2019s t, Beta, Dirichlet and von Mises\ndistributions is available in TensorFlow Probability [11]. This library also contains an implementation\nof the LDA model from section 6.2.\n\n6.1 Gradient of the cross-entropy\n\nWe compare the variance of the implicit and RSVI gradient estimators on a toy problem of stochastic\nestimation of the cross-entropy gradient, @\n@ Eq(z)[ log p(z)]. It was introduced by Naesseth\net al. [36] as minimization of the KL-divergence; however, since they analytically compute the\nentropy, the only source of variance is the cross-entropy term. We use their setting for the Dirichlet\ndistribution: p(z) = Dirichlet(z|\u21b51,\u21b5 2, . . . ,\u21b5 100), q(z) = Dirichlet(z|, \u21b52, . . . ,\u21b5 100), where\n\u21b5 are the posterior parameters for a Dirichlet with a uniform prior after observing 100 samples\nfrom a Categorical distribution. The Dirichlet samples are obtained by transforming samples from\nGamma. Additionally, we construct a similar problem with the von Mises distribution: p(z) =\n\nQ10\nd=1 vonMises(zd|0, 2) and q(z) = vonMises(z1|0, )Q10\n\nThe results presented on Fig. 1 show that the implicit gradient is faster and has lower variance than\nRSVI. For the Dirichlet distribution, increasing the shape augmentation parameter B allows RSVI to\nasymptotically approach the variance of the implicit gradient. However, this comes at an additional\n\nd=2 vonMises(zd|0, 2).\n\n6\n\n\f(a) Dirichlet distribution\n\n(b) Von Mises distribution\n\nMethod\nImplicit\n\nRSVI\n\nDirichlet\n5.8 \u21e5 108\n1.4 \u21e5 107\n1.6 \u21e5 107\n1.8 \u21e5 107\n2.0 \u21e5 107\n\nVon Mises\n2.0 \u21e5 107\n3.0 \u21e5 107\n\nB = 0\nB = 1\nB = 5\nB = 10\n\n(c) Computation time (in seconds per\nsample of Gamma/von Mises)\n\nFigure 1: Variance of the gradient and computation time for the cross-entropy optimization problem.\nThe vertical line denotes the optimal value for the parameter. Implicit gradient is faster and has lower\nvariance than RSVI [36].\n\ncomputational cost and requires tuning this parameter. Furthermore, such a parameter is not available\nfor other distributions, including von Mises.\n\n6.2 Latent Dirichlet Allocation\n\nLDA [6] is a popular topic model that represents each document as a bag-of-words and \ufb01nds a set\nof topics so that each document is well-described by a few topics. It has been extended in various\nways, e.g. [4, 5], and often serves as a testbed for approximate inference methods [19, 20, 50]. LDA\nis a latent variable model with a likelihood p(w|z) =QK\ni=1 Categorical(wi|z), and the prior\np\u21b5(z) = Dirichlet(z|\u21b5), where w is the observed document represented as a vector of word counts,\nz is a distribution of topics,  2 R#words\u21e5#topics is a matrix that speci\ufb01es the categorical distribution of\nwords in each topic, and \u21b5 parameterizes the prior distribution over the topics. We perform amortized\nvariational inference by using a neural network to parameterize the Dirichlet variational posterior\nover the topics z as a function of the observation.\nWe use the 20 Newsgroups (11,200 documents, 2,000-word vocabulary) and RCV1 [29] (800,000\ndocuments, 10,000-word vocabulary) datasets with the same preprocessing as in [47]. We report the\n\nn=1\n\n1\nLn\n\nlog p(wn)\u2318, where Ln is the number of words in\n\ntest perplexity of the models, exp\u21e3 1\n\nNPN\n\nthe document and the marginal log-likelihood is approximated with a single-sample estimate of the\nevidence lower bound. Following [52], we optimize the prior parameters \u21b5 during training.\nWe compare amortized variational inference in LDA using implicit reparameterization to several\nalternatives: (i) training the LDA model with the RSVI gradients; (ii) stochastic variational inference\n(SVI) [19] training method for LDA; (iii) the method of Srivastava and Sutton [47], which we refer to\nas LN-LDA, that uses a Logistic Normal approximation in place of the Dirichlet prior and performs\namortized variational inference using a Logistic Normal variational posterior.\nThe results in Table 3 and Fig. 3(a-b) show that RSVI matches the implicit gradient results only at\nB = 20, as opposed to B = 10 for the previous problem. Lower gradient variance leads to faster\ntraining objective convergence. Interestingly, amortized inference can achieve better perplexity than\nSVI. Finally, we see that LDA trained with implicit gradients performs as well or better than LN-LDA.\nThe learned topics and the prior weights shown on Fig. 2 demonstrate that LDA automatically\ndetermines the number of topics in the corpus by setting some of the prior weights to 0; this does\nnot occur in LN-LDA model. Additionally, LN-LDA is prone to representing the same topic several\ntimes, perhaps due to a non-sparse variational posterior distribution.\nThe obtained results suggest that the advantage of implicit gradients compared to RSVI increases\nwith the complexity of the problem. When the original distributions are replaced by surrogates, some\ndesirable properties of the solution, such as sparsity, might be lost.\n\n6.3 Variational Autoencoders\n\nVAE [26, 42] is a generative latent variable model trained using amortized variational inference. Both\nthe variational posterior and the generative distributions (also known as the encoder and decoder) are\nparameterized using neural networks. VAEs typically use the standard Normal distribution as the\nprior and a factorized Normal as the variational posterior. The form of the likelihood depends on\n\n7\n\n\fTable 3: Test perplexity (lower is better) for the topic modeling task. Mean \u00b1 standard deviation over\n5 runs. LN-LDA uses Logistic Normal distributions instead of Dirichlet.\n\nModel\n\nLDA [6]\n\nTraining method\nImplicit reparameterization\nRSVI B = 1\nRSVI B = 5\nRSVI B = 10\nRSVI B = 20\nSVI\n\nLN-LDA [47] Explicit reparameterization\n\n20 Newsgroups\n\n876 \u00b1 7\n1066 \u00b1 7\n968 \u00b1 18\n887 \u00b1 10\n865 \u00b1 11\n964 \u00b1 4\n875 \u00b1 6\n\nRCV1\n896 \u00b1 6\n1505 \u00b1 33\n1075 \u00b1 15\n953 \u00b1 16\n907 \u00b1 13\n1330 \u00b1 4\n951 \u00b1 10\n\n\u21b5 = 1.15 write article get think go\n\u21b5 = 1.07 write get think like article\n\u21b5 = 1.07 write article get think like\n\u21b5 = 1.07 write article get like know\n\u21b5 = 1.06 write article think get like\n\u21b5 = 1.04 write article get know think\n\u21b5 = 1.04 write article get know like\n\u21b5 = 1.02 write article think get like\n\n\u21b5 = 0.47 write article get like one\n\u21b5 = 0.31 write one people say think\n\u21b5 = 0.25 please thanks post send know\n\u21b5 = 0.11 use drive card problem system\n\u21b5 = 0.10 go say people know get\n\u21b5 = 0.08 use \ufb01le key program system\n\u21b5 = 0.08 gun government law state use\n\u21b5 = 0.08 god christian jesus say people\n\n(a) LN-LDA topics\n\n(b) LDA topics (implicit)\n\n(c) 20 Newsgroups weights\n\n(d) RCV1 weights\n\nFigure 2: Left: topics with the highest weight for the 20 Newsgroups dataset; Right: prior topic\nweights \u21b5. LDA learns sparse prior weights, while LN-LDA does not.\n\nthe data, with factorized Bernoulli or Normal distributions being popular choices for images. In this\nsection, we experiment with using distributions other than Normal for the prior and the variational\nposterior. The use of alternative distributions allows incorporating different prior assumptions about\nthe latent factors of the data, such as bounded support or periodicity.\nWe use fully factorized priors and variational posteriors. For the variational posterior we explore\nGamma, Beta, and von Mises distributions. For Gamma, we use a sparse Gamma(0.3, 0.3) prior and\na bell-shaped prior Gamma(10, 10). For Beta and von Mises, instead of a sparse prior we choose a\nuniform prior over the corresponding domain.\nWe train the models on the dynamically binarized MNIST dataset [8] using the fully-connected\nencoder and decoder architectures from [9], so our results are comparable. The results in Table 4 show\nthat a uniform prior and cyclic latent space of von Mises is advantageous for low-dimensional latent\nspaces, consistent with the \ufb01ndings of [9]. For a uniform prior, the factorized von Mises distribution\noutperforms the multivariate von Mises-Fisher distribution in low dimensions, perhaps due to the\nmore \ufb02exible concentration parameterization (von Mises-Fisher uses shared concentration across\ndimensions). The results obtained with bell-shaped priors are similar to the Normal prior/posterior\n\n(a) LDA on 20 Newsgroups\n\n(b) LDA on RCV1\n\n(c) VAE with von Mises posterior\n\nFigure 3: The training objective (top) and the variance of the gradient (bottom) during training. The\nsharp drop in perplexity on RCV1 dataset occurs at the end of the \u21b5 burn-in period.\n\n8\n\n\fD = 2\n\nD = 5\n\nTable 4: Test negative log-likelihood (lower is better) for VAE on MNIST. Mean \u00b1 standard deviation\nover 5 runs. The von Mises-Fisher results are from [9].\nVariational posterior\nPrior\nN (\u00b5, 2)\nN (0, 1)\nGamma(0.3, 0.3) Gamma(\u21b5, )\nGamma(\u21b5, )\nGamma(10, 10)\nBeta(\u21b5, )\nUniform(0, 1)\nBeta(\u21b5, )\nBeta(10, 10)\nUniform(\u21e1, \u21e1 )\nvonMises(\u00b5, \uf8ff)\nvonMises(0, 10)\nvonMises(\u00b5, \uf8ff)\nUniform(SD)\n\nD = 10\n92.5 \u00b1 0.2\n94.0 \u00b1 0.3\n92.3 \u00b1 0.2\n94.1 \u00b1 0.1\n92.1 \u00b1 0.2\n94.4 \u00b1 0.5\n92.3 \u00b1 0.2\n93.2 \u00b1 0.1\n\nD = 20\n88.1 \u00b1 0.2\n90.3 \u00b1 0.2\n88.3 \u00b1 0.2\n88.9 \u00b1 0.1\n87.8 \u00b1 0.1\n90.9 \u00b1 0.1\n87.8 \u00b1 0.2\n89.0 \u00b1 0.3\n\nD = 40\n88.1 \u00b1 0.0\n90.6 \u00b1 0.2\n88.3 \u00b1 0.1\n88.6 \u00b1 0.1\n87.7 \u00b1 0.1\n91.5 \u00b1 0.4\n87.9 \u00b1 0.3\n90.9 \u00b1 0.3\n\n131.1 \u00b1 0.6\n132.4 \u00b1 0.3\n135.0 \u00b1 0.2\n128.3 \u00b1 0.2\n131.1 \u00b1 0.4\n127.6 \u00b1 0.4\n130.7 \u00b1 0.8\n132.5 \u00b1 0.7\n\n107.9 \u00b1 0.4\n108.0 \u00b1 0.3\n107.0 \u00b1 0.2\n107.4 \u00b1 0.2\n106.7 \u00b1 0.1\n107.5 \u00b1 0.4\n107.5 \u00b1 0.5\n108.4 \u00b1 0.1\n\nvonMisesFisher(\u00b5,\uf8ff )\n\n(a) Normal posterior and prior,\n\n(b) Beta, uniform prior,\n\n(c) Von Mises, uniform prior,\n\n[3, 3] \u21e5 [3, 3]\n\n[0, 1] \u21e5 [0, 1]\n\n[\u21e1, \u21e1 ] \u21e5 [\u21e1, \u21e1 ]\n\nFigure 4: 2D latent spaces learned by a VAE on the MNIST dataset. Normal distribution exhibits a\nstrong pull to the center, while Beta and Von Mises latents are tiling the entire available space.\n\npair, as expected. The latent spaces learned by models with 2 latents shown on Fig. 4 demonstrate the\ndifferences in topology.\nWe provide a detailed comparison between implicit gradients and RSVI in Table 7 of the supplemen-\ntary material. For Gamma and Beta distributions, RSVI with B = 20 performs similarly to implicit\ngradients. However, for the von Mises distribution implicit gradients usually perform better than\nRSVI. For example, for a uniform prior and D = 40, implicit gradients yield a 1.3 nat advantage in\nthe test log-likelihood due to lower gradient variance (Fig. 3c).\n\n7 Conclusion\n\nReparameterization gradients have become established as a central tool underlying many of the\nrecent advances in machine learning. In this paper, we strengthened this tool by extending its\napplicability to distributions, such as truncated, Gamma, and von Mises, that are often encountered\nin probabilistic modelling. The proposed implicit reparameterization gradients offer a simple and\npractical approach to stochastic gradient estimation which has the properties we expect from such\na new type of estimator: it is faster than the existing methods and simultaneously provides lower\ngradient variance. These new estimators allow us to move away from making model choices for\nreasons of computational convenience. Applying these estimators requires a numerically tractable\nCDF or some other standardization function. When one is not available, it should be possible to use\nan approximate standardization function to augment implicit reparameterization with a score function\ncorrection term, along the lines of generalized reparameterization. We intend to explore this direction\nin future work.\n\nAcknowledgments\nWe would like to thank Chris Maddison, Hyunjik Kim, J\u00f6rg Bornschein, Alex Graves, Hussein Fawzi,\nChris Burgess, Matt Hoffman, and Charles Sutton for helpful discussions. We also thank Akash\nSrivastava for providing the preprocessed document datasets.\n\n9\n\n\fReferences\n\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. \u201cTensorFlow: A System for Large-Scale Machine Learning.\u201d In: USENIX\nSymposium on Operating Systems Design and Implementation. Vol. 16. 2016, pp. 265\u2013283.\n[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. \u201cAutomatic differentiation in\n\nmachine learning: a survey\u201d. In: arXiv preprint arXiv:1502.05767 (2015).\n\n[3] G. P. Bhattacharjee. \u201cAlgorithm AS 32: The Incomplete Gamma Integral\u201d. In: Journal of\nthe Royal Statistical Society. Series C (Applied Statistics) 19.3 (1970), pp. 285\u2013287. ISSN:\n00359254, 14679876.\n\n[4] D. M. Blei and J. D. Lafferty. \u201cCorrelated topic models\u201d. In: Advances in Neural Information\n\nProcessing Systems (2005), pp. 147\u2013154.\n\n[5] D. M. Blei and J. D. Lafferty. \u201cDynamic topic models\u201d. In: International Conference on\n\nMachine Learning (2006), pp. 113\u2013120.\n\n[6] D. M. Blei, A. Y. Ng, and M. I. Jordan. \u201cLatent dirichlet allocation\u201d. In: Journal of Machine\n\nLearning Research 3.Jan (2003), pp. 993\u20131022.\n\n[7] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. \u201cWeight uncertainty in neural\n\nnetworks\u201d. In: International Conference on Machine Learning (2015).\n\n[8] Y. Burda, R. Grosse, and R. Salakhutdinov. \u201cImportance weighted autoencoders\u201d. In: Interna-\n\ntional Conference on Learning Representations (2016).\n\n[9] T. R. Davidson, L. Falorsi, N. De Cao, T. Kipf, and J. M. Tomczak. \u201cHyperspherical Variational\n\nAuto-Encoders\u201d. In: Conference on Uncertainty in Arti\ufb01cial Intelligence (2018).\n\n[10] L. Devroye. Non-Uniform Random Variate Generation. Springer, 1986.\n[11]\n\nJ. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi,\nM. Hoffman, and R. A. Saurous. \u201cTensorFlow Distributions\u201d. In: arXiv (2017).\n\n[12] M. C. Fu. \u201cGradient estimation\u201d. In: Handbooks in operations research and management\n\nscience 13 (2006), pp. 575\u2013616.\n\n[13] Y. Gal and Z. Ghahramani. \u201cA theoretically grounded application of dropout in recurrent neural\n\nnetworks\u201d. In: Advances in Neural Information Processing Systems. 2016, pp. 1019\u20131027.\n\n[14] Y. Gal and Z. Ghahramani. \u201cDropout as a Bayesian approximation: Representing model\nuncertainty in deep learning\u201d. In: International Conference on Machine Learning (2016),\npp. 1050\u20131059.\n\n[15] P. Glasserman. Monte Carlo methods in \ufb01nancial engineering. Vol. 53. Springer Science &\n\nBusiness Media, 2013.\n\n[16] P. W. Glynn. \u201cLikelihood ratio gradient estimation for stochastic systems\u201d. In: Communications\n\nof the ACM 33.10 (1990), pp. 75\u201384.\n\n[17] A. Graves. \u201cStochastic backpropagation through mixture density distributions\u201d. In: arXiv\n\npreprint arXiv:1607.05690 (2016).\n\n[18] G. W. Hill. \u201cAlgorithm 518: Incomplete Bessel Function I0. The Von Mises Distribution\u201d. In:\n\nACM Transactions on Mathematical Software (TOMS) 3.3 (1977), pp. 279\u2013284.\n\n[19] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. \u201cStochastic variational inference\u201d. In:\n\nJournal of Machine Learning Research 14.1 (2013), pp. 1303\u20131347.\n\n[20] M. Hoffman, F. R. Bach, and D. M. Blei. \u201cOnline learning for latent dirichlet allocation\u201d. In:\n\nAdvances in Neural Information Processing Systems (2010), pp. 856\u2013864.\n\n[21] M. Hoffman and D. Blei. \u201cStochastic structured variational inference\u201d. In: International\n\nConference on Arti\ufb01cial Intelligence and Statistics (2015), pp. 361\u2013369.\n\n[22] T. S. Jaakkola and M. I. Jordan. \u201cBayesian parameter estimation via variational methods\u201d. In:\n\nStatistics and Computing 10.1 (2000), pp. 25\u201337.\n\n[23] E. Jang, S. Gu, and B. Poole. \u201cCategorical reparameterization with gumbel-softmax\u201d. In:\n\nInternational Conference on Learning Representations (2017).\n\n[24] M. Jankowiak and T. Karaletsos. \u201cPathwise Derivatives for Multivariate Distributions\u201d. In:\n\narXiv preprint arXiv:1806.01856 (2018).\n\n[25] M. Jankowiak and F. Obermeyer. \u201cPathwise Derivatives Beyond the Reparameterization Trick\u201d.\n\nIn: International Conference on Machine Learning (2018).\n\n10\n\n\f[26] D. P. Kingma and M. Welling. \u201cAuto-encoding variational bayes\u201d. In: International Conference\n\non Learning Representations (2014).\n\n[27] D. A. Knowles. \u201cStochastic gradient variational Bayes for Gamma approximating distribu-\n\ntions\u201d. In: arXiv preprint arXiv:1509.01631 (2015).\n\n[28] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. \u201cAutomatic differentiation\nvariational inference\u201d. In: Journal of Machine Learning Research 18.1 (2017), pp. 430\u2013474.\n[29] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. \u201cRcv1: A new benchmark collection for text\ncategorization research\u201d. In: Journal of Machine Learning Research 5.Apr (2004), pp. 361\u2013\n397.\n\n[30] C. J. Maddison, A. Mnih, and Y. W. Teh. \u201cThe concrete distribution: A continuous relaxation of\ndiscrete random variables\u201d. In: International Conference on Learning Representations (2017).\n\n[31] K. V. Mardia and P. E. Jupp. Directional statistics. John Wiley & Sons, 2009, p. 494.\n[32] R. von Mises. \u201c\u00dcber die \u201cGanzzahligkeit\u201d der Atomgewicht und verwandte Fragen.\u201d In:\n\nPhysikalische Z. 19 (1918), pp. 490\u2013500.\n\n[33] A. Mnih and K. Gregor. \u201cNeural variational inference and learning in belief networks\u201d. In:\n\nInternational Conference on Machine Learning (2014).\n\n[34] D. Molchanov, A. Ashukha, and D. Vetrov. \u201cVariational dropout sparsi\ufb01es deep neural net-\n\nworks\u201d. In: International Conference on Machine Learning (2017).\n\n[35] R. Moore. \u201cAlgorithm AS 187: Derivatives of the incomplete gamma integral\u201d. In: Journal of\n\nthe Royal Statistical Society. Series C (Applied Statistics) 31.3 (1982), pp. 330\u2013335.\n\n[36] C. Naesseth, F. Ruiz, S. Linderman, and D. Blei. \u201cReparameterization gradients through\nacceptance-rejection sampling algorithms\u201d. In: International Conference on Arti\ufb01cial Intelli-\ngence and Statistics (2017), pp. 489\u2013498.\n\n[37] E. Nalisnick, L. Hertel, and P. Smyth. \u201cApproximate inference for deep latent gaussian\nmixtures\u201d. In: Advances in Neural Information Processing Systems Workshop on Bayesian\nDeep Learning. Vol. 2. 2016.\n\n[38] E. Nalisnick and P. Smyth. \u201cStick-breaking variational autoencoders\u201d. In: International Con-\n\nference on Learning Representations (2017).\nJ. Paisley, D. Blei, and M. Jordan. \u201cVariational Bayesian inference with stochastic search\u201d. In:\nInternational Conference on Machine Learning (2012).\n\n[39]\n\n[40] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\nL. Antiga, and A. Lerer. \u201cAutomatic differentiation in PyTorch\u201d. In: Advances in Neural\nInformation Processing Systems Workshop (2017).\n\n[41] R. Ranganath, S. Gerrish, and D. Blei. \u201cBlack box variational inference\u201d. In: International\n\nConference on Arti\ufb01cial Intelligence and Statistics (2014), pp. 814\u2013822.\n\n[42] D. J. Rezende, S. Mohamed, and D. Wierstra. \u201cStochastic backpropagation and approximate\ninference in deep generative models\u201d. In: International Conference on Machine Learning\n(2014).\n\n[43] G. Roeder, Y. Wu, and D. Duvenaud. \u201cSticking the landing: An asymptotically zero-variance\ngradient estimator for variational inference\u201d. In: Advances in Neural Information Processing\nSystems (2017).\n\n[44] F. R. Ruiz, M. Titsias, and D. Blei. \u201cThe Generalized Reparameterization Gradient\u201d. In:\n\nAdvances in Neural Information Processing Systems (2016).\n\n[45] L. R\u00fcschendorf. \u201cCopulas, Sklar\u2019s theorem, and distributional transform\u201d. In: Mathematical\n\nRisk Analysis. Springer, 2013, pp. 3\u201334.\n\n[46] T. Salimans and D. A. Knowles. \u201cFixed-form variational posterior approximation through\n\nstochastic linear regression\u201d. In: Bayesian Analysis 8.4 (2013), pp. 837\u2013882.\n\n[47] A. Srivastava and C. Sutton. \u201cAutoencoding variational inference for topic models\u201d. In:\n\nInternational Conference on Learning Representations (2017).\n\n[48] A. Srivastava and C. Sutton. \u201cVariational Inference In Pachinko Allocation Machines\u201d. In:\n\narXiv preprint arXiv:1804.07944 (2018).\n\n[49] R. Suri and M. A. Zazanis. \u201cPerturbation analysis gives strongly consistent sensitivity estimates\n\nfor the M/G/1 queue\u201d. In: Management Science 34.1 (1988), pp. 39\u201364.\n\n[50] Y. W. Teh, D. Newman, and M. Welling. \u201cA collapsed variational Bayesian inference algorithm\nfor latent Dirichlet allocation\u201d. In: Advances in Neural Information Processing Systems (2007),\npp. 1353\u20131360.\n\n11\n\n\f[51] M. Titsias and M. L\u00e1zaro-Gredilla. \u201cDoubly stochastic variational Bayes for non-conjugate\n\ninference\u201d. In: International Conference on Machine Learning. 2014, pp. 1971\u20131979.\n\n[52] H. M. Wallach, D. M. Mimno, and A. McCallum. \u201cRethinking LDA: Why priors matter\u201d. In:\n\nAdvances in Neural Information Processing Systems. 2009, pp. 1973\u20131981.\n\n[53] R. J. Williams. \u201cSimple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning\u201d. In: Reinforcement Learning (1992), pp. 5\u201332.\n\n[54] H. Zhang, B. Chen, D. Guo, and M. Zhou. \u201cWHAI: Weibull Hybrid Autoencoding Inference\nfor Deep Topic Modeling\u201d. In: International Conference on Learning Representations (2018).\n\n12\n\n\f", "award": [], "sourceid": 280, "authors": [{"given_name": "Mikhail", "family_name": "Figurnov", "institution": "DeepMind"}, {"given_name": "Shakir", "family_name": "Mohamed", "institution": "DeepMind"}, {"given_name": "Andriy", "family_name": "Mnih", "institution": "DeepMind"}]}