{"title": "Importance Weighting and Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 4470, "page_last": 4479, "abstract": "Recent work used importance sampling ideas for better variational bounds on likelihoods. We clarify the applicability of these ideas to pure probabilistic inference, by showing the resulting Importance Weighted Variational Inference (IWVI) technique is an instance of augmented variational inference, thus identifying the looseness in previous work. Experiments confirm IWVI's practicality for probabilistic inference. As a second contribution, we investigate inference with elliptical distributions, which improves accuracy in low dimensions, and convergence in high dimensions.", "full_text": "Importance Weighting and Variational Inference\n\nJustin Domke1 and Daniel Sheldon1,2\n\n1 College of Information and Computer Sciences, University of Massachusetts Amherst\n\n2 Department of Computer Science, Mount Holyoke College\n\nAbstract\n\nRecent work used importance sampling ideas for better variational bounds on likeli-\nhoods. We clarify the applicability of these ideas to pure probabilistic inference, by\nshowing the resulting Importance Weighted Variational Inference (IWVI) technique\nis an instance of augmented variational inference, thus identifying the looseness in\nprevious work. Experiments con\ufb01rm IWVI\u2019s practicality for probabilistic inference.\nAs a second contribution, we investigate inference with elliptical distributions,\nwhich improves accuracy in low dimensions, and convergence in high dimensions.\n\n1\n\nIntroduction\n\nProbabilistic modeling is used to reason about the world by formulating a joint model p(z, x) for\nunobserved variables z and observed variables x, and then querying the posterior distribution p(z | x)\nto learn about hidden quantities given evidence x. Common tasks are to draw samples from p(z | x)\nor compute posterior expectations. However, it is often intractable to perform these tasks directly, so\nconsiderable research has been devoted to methods for approximate probabilistic inference.\nVariational inference (VI) is a leading approach for approximate inference. In VI, p(z | x) is\napproximated by a distribution q(z) in a simpler family for which inference is tractable. The process\nto select q is based on the following decomposition [22, Eqs. 11-12]:\n\nlog p(x) = E\nq(z)\n\nlog\n\np(z, x)\n\nq(z)\n\nELBO[q(z)kp(z,x)]\n\n|\n\n{z\n\n|\n\n}\n\n.\n\n(1)\n\n+ KL [q(z)kp(z|x)]\n\ndivergence\n\n{z\n\n}\n\nThe \ufb01rst term is a lower bound of log p(x) known as the \"evidence lower bound\" (ELBO). Selecting\nq to make the ELBO as big as possible simultaneously obtains a lower bound of log p(x) that is as\ntight as possible and drives q close to p in KL-divergence.\nThe ELBO is closely related to importance sampling. For \ufb01xed q, let R = p(z, x)/q(z) where\nz \u21e0 q. This random variable satis\ufb01es p(x) = E R, which is the foundation of importance sampling.\nSimilarly, we can write by Jensen\u2019s inequality that log p(x) E log R = ELBO [qkp], which is the\nfoundation of modern \u201cblack-box\u201d versions of VI (BBVI) [19] in which Monte Carlo samples are\nused to estimate E log R, in the same way that IS estimates E R.\nCritically, the only property VI uses to obtain a lower bound is p(x) = E R. Further, it is straightfor-\nward to see that Jensen\u2019s inequality yields a tighter bound when R is more concentrated about its\nmean p(x). So, it is natural to consider different random variables with the same mean that are more\nconcentrated, for example the sample average RM = 1\nm=1 Rm. Then, by identical reasoning,\nlog p(x) E log RM. The last quantity is the objective of importance-weighted auto-encoders [5];\nwe call it the importance weighted ELBO (IW-ELBO), and the process of selecting q to maximize it\nimportance-weighted VI (IWVI).\n\nMPM\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fHowever, at this point we should pause. The decomposition in Eq. 1 makes it clear exactly in\nwhat sense standard VI, when optimizing the ELBO, makes q close to p. By switching to the\none-dimensional random variable RM, we derived the IW-ELBO, which gives a tighter bound on\nlog p(x). For learning applications, this may be all we want. But for probabilistic inference, we are\nleft uncertain exactly in what sense q \"is close to\" p, and how we should use q to approximate p, say,\nfor computing posterior expectations.\nOur \ufb01rst contribution is to provide a new perspective on IWVI by highlighting a precise connection\nbetween IWVI and self-normalized importance sampling (NIS) [17], which instructs us how to\nuse IWVI for \u201cpure inference\u201d applications. Speci\ufb01cally, IWVI is an instance of augmented VI.\nMaximizing the IW-ELBO corresponds exactly to minimizing the KL divergence between joint\ndistributions qM and pM, where qM is derived from NIS over a batch of M samples from q, and\npM is the joint distribution obtained by drawing one sample from p and M 1 \u201cdummy\u201d samples\nfrom q. This has strong implications for probabilistic inference (as opposed to learning) which is\nour primary focus. After optimizing q, one should compute posterior expectations using NIS. We\nshow that not only does IWVI signi\ufb01cantly tighten bounds on log p(x), but, by using q this way at\ntest time, it signi\ufb01cantly reduces estimation error for posterior expectations.\nPrevious work has connected IWVI and NIS by showing that the importance weighted ELBO is a\nlower bound of the ELBO applied to the NIS distribution [6, 16, 2]. Our work makes this relationship\nprecise as an instance of augmented VI, and exactly quanti\ufb01es the gap between the IW-ELBO and\nconventional ELBO applied to the NIS distribution, which is a conditional KL divergence.\nOur second contribution is to further explore the connection between variational inference and\nimportance sampling by adapting ideas of \u201cdefensive sampling\u201d [17] to VI. Defensive importance\nsampling uses a widely dispersed q distribution to reduce variance by avoiding situations where q\nplaces essentially no mass in an area with p has density. This idea is incompatible with regular VI due\nto its \u201cmode seeking\u201d behavior, but it is quite compatible with IWVI. We show how to use elliptical\ndistributions and reparameterization to achieve a form of defensive sampling with almost no additional\noverhead to black-box VI (BBVI). \u201cElliptical VI\u201d provides small improvements over Gaussian BBVI\nin terms of ELBO and posterior expectations. In higher dimensions, these improvements diminish,\nbut elliptical VI provides signi\ufb01cant improvement in the convergence reliability and speed. This\nis consistent with the notion that using a \u201cdefensive\u201d q distribution is advisable when it is not well\nmatched to p (e.g., before optimization has completed).\n\n2 Variational Inference\n\nConsider again the \"ELBO decomposition\" in Eq. 1. Variational inference maximizes the \u201cevidence\nlower bound\u201d (ELBO) over q. Since the divergence is non-negative, this tightens a lower-bound\non log p(x). But, of course, since the divergence and ELBO vary by a constant, maximizing the\nELBO is equivalent to minimizing the divergence. Thus, variational inference can be thought of as\nsimultaneously solving two problems:\n\u2022 \u201cprobabilistic inference\u201d or \ufb01nding a distribution q(z) that is close to p(z|x) in KL-divergence.\n\u2022 \u201cbounding the marginal likelihood\u201d or \ufb01nding a lower-bound on log p(x).\nThe \ufb01rst problem is typically used with Bayesian inference: A user speci\ufb01es a model p(z, x), observes\nsome data x, and is interested in the posterior p(z|x) over the latent variables. While Markov chain\nMonte Carlo is most commonly for these problems [9, 23], the high computational expense motivates\nVI [11, 3]. While a user might be interested in any aspect of the posterior, for concreteness, we focus\non \u201cposterior expectations\u201d, where the user speci\ufb01es some arbitrary t(z) and wants to approximate\nEp(z|x) t(z).\nThe second problem is typically used to support maximum likelihood learning. Suppose that p\u2713(z, x)\nis some distribution over observed data x and hidden variables z. In principle, one would like to set \u2713\n\nto maximize the marginal likelihood over the observed data. When the integral p\u2713(x) =R p\u2713(z, x)dz\n\nis intractable, one can optimize the lower-bound Eq(z) log (p\u2713(z, x)/q(z)) instead [22], over both\n\u2713 and the parameters of q. This idea has been used to great success recently with variational\nauto-encoders (VAEs) [10].\n\n2\n\n\f3\n\nImportance Weighting\n\nRecently, ideas from importance sampling have been applied to obtain tighter ELBOs for learning in\nVAEs [5]. We review the idea and then draw novel connections to augmented VI that make it clear\nhow adapt apply these ideas to probabilistic inference.\nTake any random variable R such that E R =\np(x), which we will think of as an \u201cestimator\u201d\nof p(x). Then it\u2019s easy to see via Jensen\u2019s in-\nequality that\n\nlog p(x) = E log R\nbound\n\n| {z }\n\np(x)\n\n+ E log\n\nR\nlooseness\n\n{z\n\n|\n\n,\n\n}\n\n(2)\n\nwhere the \ufb01rst term is a lower bound on log p(x),\nand the second (non-negative) term is the loose-\nness. The bound will be tight if R is highly\nconcentrated.\nWhile Eq. 2 looks quite trivial, it is a generalization of the \u201cELBO\u201d decomposition in Eq. 1. To see\nthat, use the random variable\n\nFigure 1: How the density of RM changes with M.\n(Distribution and setting as in Fig. 2.)\n\nR = !(z) =\n\np(z, x)\n\nq(z)\n\n, z \u21e0 q,\n\n(3)\n\nwhich clearly obeys E R = p(x), and for which Eq. 2 becomes Eq. 1.\nThe advantage of Eq. 2 over Eq. 1 is increased \ufb02exibility: alternative estimators R can give a\ntighter bound on log p(x). One natural idea is to draw multiple i.i.d. samples from q and average the\nestimates as in importance sampling (IS) . This gives the estimator\n\nRM =\n\n1\nM\n\np (zm, x)\n\nq(zm)\n\n, zm \u21e0 q.\n\n(4)\n\nMXm=1\n\nIt\u2019s always true that E RM = p(x), but the distribution of RM places less mass near zero for larger\nM, which leads to a tighter bound (Fig. 1).\nThis leads to a tighter \u201cimportance weighted ELBO\u201d (IW-ELBO) lower bound on log p(x), namely\n\nIW-ELBOM [q(z)kp(z, x)] := E\n\nq(z1:M )\n\nlog\n\n1\nM\n\np (zm, x)\n\nq(zm)\n\n,\n\n(5)\n\nMXm=1\n\nwhere z1:M is a shorthand for (z1, ..., zM ) and q(z1:M ) = q(z1)\u00b7\u00b7\u00b7 q(zM ). This bound was \ufb01rst\nintroduced by Burda et al. [5] in the context of supporting maximum likelihood learning of a\nvariational auto-encoder.\n\n3.1 A generative process for the importance weighted ELBO\nWhile Eq. 2 makes clear that optimizing the IW-ELBO tightens a bound on log p(x), it isn\u2019t obvious\nwhat connection this has to probabilistic inference. Is there some divergence that is being minimized?\nTheorem 1 shows this can be understood by constructing \u201caugmented\u201d distributions pM (z1:M , x)\nand qM (z1:M ) and then applying the ELBO decomposition in Eq. 1 to the joint distributions.\nTheorem 1 (IWVI). Let qM (z1:M ) be the density of the generative process described by Alg. 1,\nwhich is based on self-normalized importance sampling over a batch of M samples from q. Let\npM (z1:M , x) = p(z1, x)q(z2:M ) be the density obtained by drawing z1 and x from p and drawing\nthe \u201cdummy\u201d samples z2:M from q. Then\n\nqM (z1:M ) =\n\npM (z1:M , x)\n\n.\n\nm=1 !(zm)\n\n1\n\nMPM\n\nFurther, the ELBO decomposition in Eq. 1 applied to qM and pM is\n\nlog p(x) = IW-ELBOM [q(z)kp(z, x)] + KL [qM (z1:M )kpM (z1:M|x)] .\n\n(6)\n\n(7)\n\n3\n\n\fAlgorithm 1 A generative process for qM (z1:M )\n\n1. Draw \u02c6z1, \u02c6z1, ..., \u02c6zM independently from q (z) .\n2. Choose m 2{ 1, ..., M} with probability\n3. Set z1 = \u02c6zm and z2:M = \u02c6zm and return z1:M .\n\nPM\n\n! (\u02c6zm)\n\nm0=1 ! (\u02c6zm0)\n\n.\n\nWe will call the process of maximizing the IW-ELBO \u201cImportance Weighted Variational Inference\u201d\n(IWVI). (Burda et al. used \u201cImportance Weighted Auto-encoder\u201d for optimizing Eq. 5 as a bound on\nthe likelihood of a variational auto-encoder, but this terminology ties the idea to a particular model,\nand is not suggestive of the probabilistic inference setting.)\nThe generative process for qM in Alg. 1 is very similar to self-normalized importance sampling. The\nusual NIS distribution draws a batch of size M, and then \u201cselects\u201d a single variable with probability\nin proportion to its importance weight. NIS is exactly equivalent to the marginal distribution qM (z1).\nThe generative process for qM (z1:M ) additionally keeps the unselected variables and relabels them\nas z2:M.\nPrevious work [6, 2, 16, 12] investigated a similar connection between NIS and the importance-\nweighted ELBO. In our notation, they showed that\n\nlog p(x) ELBO [qM (z1)kp(z1, x)] IW-ELBOM [q(z)kp(z, x)] .\n\n(8)\n\nThat is, they showed that the IW-ELBO lower bounds the ELBO between the NIS distribution and\np, without quantifying the gap in the second inequality. Our result makes it clear exactly what\nKL-divergence is being minimized by maximizing the IW-ELBO and in what sense doing this makes\nq \u201cclose to\u201d p. As a corollary, we also quantify the gap in the inequality above (see Thm. 2 below).\nA recent decomposition [12, Claim 1] is related to Thm. 1, but based on different augmented\ndistributions qIS\nM \"\ufb01xed\" to be an\nindependent sample of size M from q, and modi\ufb01es pIS\nM so its marginals approach q. This does not\ninform inference. Contrast this with our result, where qM (z1) gets closer and closer to p(z1 | x), and\ncan be used for probabilistic inference. See appendix (Section A.3.2) for details.\nIdentifying the precise generative process is useful if IWVI will be used for general probabilistic\nqueries, which is a focus of our work, and, to our knowledge, has not been investigated before. For\nexample, the expected value of t(z) can be approximated as\n\nM . This result is fundamentally different in that it holds qIS\n\nM and pIS\n\nt(z) =\n\nE\n\np(z|x)\n\nE\n\npM (z1|x)\n\nt(z1) \u21e1 E\n\nqM (z1)\n\nt(z1) = E\n\n.\n\n(9)\n\nq(z1:M )PM\nPM\n\nm=1 ! (zm) t(zm)\n\nm=1 ! (zm)\n\nThe \ufb01nal equality is established by Lemma 4 in the Appendix. Here, the inner approximation is\njusti\ufb01ed since IWVI minimizes the joint divergence between qM (z1:M ) and pM (z1:M|x) . However,\nthis is not equivalent to minimizing the divergence between qM (z1) and pM (z1|x), as the following\nresult shows.\nTheorem 2. The marginal and joint divergences relevant to IWVI are related by\n\nKL [qM (z1:M )kpM (z1:M|x)] = KL [qM (z1)kp(z1|x)] + KL [qM (z2:M|z1)kq(z2:M )] .\n\nAs a consequence, the gap in the \ufb01rst inequality of Eq 8 is exactly KL [qM (z1)kp(z1|x)] and the gap\nin the second inequality is exactly KL [qM (z2:M|z1)kq(z2:M )].\nThe \ufb01rst term is the divergence between the marginal of qM, i.e., the \u201cstandard\u201d NIS distribution, and\nthe posterior. In principle, this is exactly the divergence we would like to minimize to justify Eq. 9.\nHowever, the second term is not zero since the selection phase in Alg. 1 leaves z2:M distributed\ndifferently under qM than under q. Since this term is irrelevant to the quality of the approximation in\nEq. 9, IWVI truly minimizes an upper-bound. Thus, IWVI can be seen as an instance of auxiliary\nvariational inference [1] where a joint divergence upper-bounds the divergence of interest.\n\n4\n\n\f(a) The target p and four candidate variational distributions.\n\n(c) The IW-ELBO. (Higher is better.)\n\n(b) Reweighted densities qM (z1) for each distribution.\n\n(d) Moment error k EqM t(z1) Ep t(z)k2\nfor t(z) = (z, z2). (Lower is better.)\n\n2\n\nFigure 2: Two Gaussian (N ) and two Student-T (T ) variational distributions, all with constant\nvariance and one of two means (A or B). For M = 1 it is better to use a mean closer to one mode of\np. For large M, a mean in the center is superior, and the heavy tails of the Student T lead to better\napproximation of p and better performance both in terms of IW-ELBO and moment error.\n\n4\n\nImportance Sampling Variance\n\nThis section considers the family for the variational distribution. For small M, the mode-seeking\nbehavior of VI will favor weak tails, while for large M, variance reduction provided by importance\nweighting will favor wider tails.\nThe most common variational distribution is the Gaussian. One explanation for this is the Bayesian\ncentral limit theorem, which, in many cases, guarantees that the posterior is asymptotically Gaussian.\nAnother is that it\u2019s \u201csafest\u201d to have weak tails: since the objective is E log R, small values of R\nare most harmful. So, VI wants to avoid cases where q(z) p(z, x), which is dif\ufb01cult if q is\nheavy-tailed. (This is the \u201cmode-seeking\u201d behavior of the KL-divergence [24].)\nWith IWVI, the situation changes. Asymptotically in M, RM in Eq. 4 concentrates around p(x), and\nso it is the variance of RM that matters, as formalized in the following result.\nTheorem 3. For large M, the looseness of the IW-ELBO is given by the variance of R. Formally, if\nthere exists some \u21b5> 0 such that E|R p(x)|2+\u21b5 < 1 and lim supM!1 E[1/RM ] < 1, then\n\nlim\nM!1\n\n{z\n\n5\n\nM\u21e3log p(x) IW-ELBOM [q(z)kp(z, x)]\n\u2318 = V[R]\n}\n|\n\nKL[qMkpM ]\n\n2p(x)2 .\n\nMaddison et al. [13] give a related result. Their Proposition 1 applied to RM gives the same\nconclusion (after an argument based on the Marcinkiewicz-Zygmund inequality; see appendix) but\nrequires the sixth central moment to exist, whereas we require only existence of E|R p(x)|2+\u21b5 for\nany \u21b5> 0. The lim sup assumption on E 1/RM is implied by assuming that E 1/RM < 1 for any\n\ufb01nite M (or for R itself). Rainforth et al. [18, Theorem 1 in Appendix] provide a related asymptotic\nfor errors in gradient variance, assuming at least the third moment exists.\nDirectly minimizing the variance of R is equivalent to minimizing the 2 divergence between q(z) and\np(z|x), as explored by Dieng et al. [7]. Overdispersed VI [21] reduces the variance of score-function\nestimators using heavy-tailed distributions.\n\n\fThe quantity inside the parentheses on the left-hand side is exactly the KL-divergence between qM\nand pM in Eq. 7, and accordingly, even for constant q, this divergence asymptotically decreases at a\n1/M rate.\nThe variance of R is a well-explored topic in traditional importance sampling. Here the situation is\nreversed from traditional VI\u2013 since R is non-negative, it is very large values of R that can cause high\nvariance, which occurs when q(z) \u2327 p(z, x). The typical recommendation is \u201cdefensive sampling\u201d\nor using a widely-dispersed proposal [17]. For these reasons, we believe that the best form for q will\nvary depending on the value of M. Figure 1 explores a simple example of this in 1-D.\n\n5 Elliptical Distributions\n\nElliptical distributions are a generalization of Gaussians that includes the Student-T, Cauchy, scale-\nmixtures of Gaussians, and many others. The following short review assumes a density function\nexists, enabling a simpler presentation than the typical one based on characteristic functions [8].\nWe \ufb01rst describe the special case of spherical distributions. Take some density \u21e2(r) for a non-negative\n\n0 \u21e2(r) = 1. De\ufb01ne the spherical random variable \u270f corresponding to \u21e2 as\n\nr withR 1\n(10)\nwhere S represents the uniform distribution over the unit sphere in d dimensions. The density of \u270f\ncan be found using two observations. First, it is constant for all \u270f with a \ufb01xed radius k\u270fk. Second, if\nif q\u270f(\u270f) is integrated over {\u270f : k\u270fk = r} the result must be \u21e2(r). Using these, it is not hard to show\nthat the density must be\n\n\u270f = ru, r \u21e0 \u21e2, u \u21e0 S,\n\nq\u270f(\u270f) = g(k\u270fk2\n\n2), g(a) =\n\n1\n\nSd1a(d1)/2 \u21e2pa ,\n\nwhere Sd1 is the surface area of the unit sphere in d dimensions (and so Sd1a(d1)/2 is the surface\narea of the sphere with radius a) and g is the density generator.\nGeneralizing, this, take some positive de\ufb01nite matrix \u2303 and some vector \u00b5. De\ufb01ne the elliptical\nrandom variable z corresponding to \u21e2, \u2303, and \u00b5 by\n\n(12)\nwhere A is some matrix such that A>A =\u2303 . Since z is an af\ufb01ne transformation of \u270f, it is not hard to\nshow by the \u201cJacobian determinant\u201d formula for changes of variables that the density of z is\n\nz = rA>u + \u00b5, r \u21e0 \u21e2, u \u21e0 S,\n\n(11)\n\n(13)\n\nq(z|\u00b5, \u2303) =\n\n1\n\n|\u2303|1/2 g\u21e3(z \u00b5)T \u23031 (z \u00b5)\u2318 ,\n\nwhere g is again as in Eq. 11. The mean and covariance are E[z] = \u00b5, and C[z] =E[r2]/d \u2303.\nFor some distributions, \u21e2(r) can be found from observing that r has the same distribution as k\u270fk.\nFor example, with a Gaussian, r2 = k\u270fk2 is a sum of d i.i.d. squared Gaussian variables, and so, by\nde\ufb01nition, r \u21e0 d.\n6 Reparameterization and Elliptical Distributions\nSuppose the variational family q(z|w) has parameters w to optimize during inference. The reparame-\nterization trick is based on \ufb01nding some density q\u270f(\u270f) independent of w and a \u201creparameterization\nfunction\u201d T (\u270f; w) such that T (\u270f; w) is distributed as q(z|w). Then, the ELBO can be re-written as\n\nELBO[q(z|w)kp(z, x)] = E\n\nq\u270f(\u270f)\n\nlog\n\np(T (\u270f; w), x)\nq(T (\u270f; w)|w)\n\n.\n\nThe advantage of this formulation is that the expectation is independent of w. Thus, computing the\ngradient of the term inside the expectation for a random \u270f gives an unbiased estimate of the gradient.\nBy far the most common case is the multivariate Gaussian distribution, in which case the base density\nq\u270f(\u270f) is just a standard Gaussian and for some Aw such that A>wAw =\u2303 w,\n\nT (\u270f; w) = A>w\u270f + \u00b5w.\n\n6\n\n(14)\n\n\f6.1 Elliptical Reparameterization\n\nTo understand Gaussian reparameterization from the perspective of elliptical distributions, note the\nsimilarity of Eq. 14 to Eq. 12. Essentially, the reparameterization in Eq. 14 combines r and u into\n\u270f = ru. This same idea can be applied more broadly: for any elliptical distribution, provided the\ndensity generator g is independent of w, the reparameterization in Eq. 14 will be valid, provided that\n\u270f comes from the corresponding spherical distribution.\nWhile this independence is true for Gaussians, this is not the case for other elliptical distributions. If\n\u21e2w itself is a function of w, Eq. 14 must be generalized. In that case, think of the generative process\n(for v sampled uniformly from [0, 1])\n\nwu + \u00b5w,\n\nw (v)AT\n\nT (u, v; w) = F 1\n\n(15)\nwhere F 1\nw (v) is the inverse CDF corresponding to the distribution \u21e2w(r). Here, we should think\nof the vector (u, v) playing the role of \u270f above, and the base density as qu,v(u, v) being a spherical\ndensity for u and a uniform density for v.\nTo calculate derivatives with respect to w, backpropagation through Aw and \u00b5w is simple using any\nmodern autodiff system. So, if the inverse CDF F 1\nw has a closed-form, autodiff can be directly\napplied to Eq. 15. If the inverse CDF does not have a simple closed-form, the following section\nshows that only the CDF is actually needed, provided that one can at least sample from \u21e2(r).\n\n6.2 Dealing CDFs without closed-form inverses\n\nFor many distributions \u21e2, the inverse CDF may not have a simple closed form, yet highly ef\ufb01cient\nsamplers still exist (most commonly custom rejection samplers with very high acceptance rates). In\nsuch cases, one can still achieve the effect of Eq. 15 on a random v using only the CDF (not the\ninverse). The idea is to \ufb01rst directly generate r \u21e0 \u21e2w using the specialized sampler, and only then\n\ufb01nd the corresponding v = Fw(r) using the closed-form CDF. To understand this, observe that if\nr \u21e0 \u21e2 and v \u21e0 Uniform[0, 1], then the pairs (r, Fw(r)) and (F 1\nw (v), v) are identically distributed.\nw (v) = rwFw(r)rrFw(r). All gradients can\nThen, via the implicit function theorem, rwF 1\nthen be computed by \u201cpretending\u201d that one had started with v and computed r using the inverse CDF.\n\n6.3 Student T distributions\nThe following experiments will consider student T distributions. The spherical T distribution can be\nde\ufb01ned as \u270f = p\u232b/s where \u21e0N (0, I) and s \u21e0 \u232b [8]. Equivalently, write r = k\u270fk = p\u232bt/s\nwith t \u21e0 d. This shows that r is the ratio of two independent variables, and thus determined\nby an F-distribution, the CDF of which could be used directly in Eq. 15. We found a slightly\n\u201cbespoke\u201d simpli\ufb01cation helpful. As there is no need for gradients with respect to d (which is \ufb01xed),\nwe represent \u270f as \u270f = (p\u232bt/s)u, leading to reparameterizing the elliptical T distribution as\n\nT (u, t, v; w) =\n\np\u232bt\nF 1\n(v)\n\n\u232b\n\nA>wu + \u00b5w,\n\nwhere F\u232b is the CDF for the \u232b distribution. This is convenient since the CDF of the distribution is\nmore widely available than that of the F distribution.\n\n7 Experiments\n\nAll the following experiments compare \u201cE-IWVI\u201d using student T distributions to \u201cIWVI\u201d using\nGaussians. Regular \u201cVI\u201d is equivalent to IWVI with M = 1.\nWe consider experiments on three distributions. In the \ufb01rst two, a computable log p(x) enables\nestimation of the KL-divergence and computable true mean and variance of the posterior enable a\nprecise evaluation of test integral estimation. On these, we used a \ufb01xed set of 10, 000 \u21e5 M random\ninputs to T and optimized using batch L-BFGS, avoiding heuristic tuning of a learning rate sequence.\nA \ufb01rst experiment considered random Dirichlet distributions p(\u2713|\u21b5) over the probability simplex in\nK dimensions, \u2713 2 K. Each parameter \u21b5k is drawn i.i.d. from a Gamma distribution with a shape\nparameter of 10. Since this density is de\ufb01ned only over the probability simplex, we borrow from Stan\n\n7\n\n\fFigure 3: Random Dirichlets, averaged over 20 repetitions. Top left shows an example posterior for\nK = 3. The test-integral error is kC[\u2713] \u02c6C[\u2713]kF where \u02c6C is the empirical covariance of samples\ndrawn from qM (z1) and then transformed to K. In all cases, IWVI is able to reduce the error of VI\nto negligible levels. E-IWVI provides an accuracy bene\ufb01t in low dimensions but little when K = 20.\n\nFigure 4: Clutter Distributions, averaged over 50 repetitions. The error shows the error in the\nestimated second moment E[zzT ]. IWVI reduces the errors of VI by orders of magnitude. E-IWVI\nprovides a diminishing bene\ufb01t in higher dimensions.\n\nthe strategy of transforming to an unconstrained z 2 RK1 space via a stick-breaking process [23].\nTo compute test integrals over variational distributions, the reverse transformation is used. Results\nare shown in Fig. 3.\nA second experiment uses Minka\u2019s \u201cclutter\u201d model [15]: z 2 Rd is a hidden object location, and\nx = (x1, . . . , xn) is a set of n noisy observations, with p(z) = N (z; 0, 100I) and p(xi|z) =\n0.25N (xi; z, I) + 0.75N (xi; 0, 10I). The posterior p(z | x) is a mixture of 2n Gaussians, for\nwhich we can do exact inference for moderate n. Results are shown in Fig. 4.\nFinally, we considered a (non-conjugate) logistic regression model with a Cauchy prior with a\nscale of 10, using stochastic gradient descent with various step sizes. On these higher dimensional\nproblems, we found that when the step-size was perfectly tuned and optimization had many iterations,\nboth methods performed similarly in terms of the IW-ELBO. E-IWVI never performed worse, and\n\n8\n\n55z1102z2100101102M101102103chosenbyE-IWVIK=50K=20K=10K=5K=3100101102M0.000.050.100.150.200.25estimatedKL,K=3VIIWVIE-IWVI100101102M0.00.51.01.52.02.5estimatedKL,K=20VIIWVIE-IWVI100101102M0.0000.0050.0100.015error(cov),K=3VIIWVIE-IWVI100101102M0.0000.0010.0020.0030.004error(cov),K=20VIIWVIE-IWVI124z168z2100101102M101102103104chosenbyE-IWVId=10,n=20d=2,n=15100101102M0.000.030.050.080.100.120.15estimatedKL,d=2,n=15VIIWVIE-IWVI100101102M050100150estimatedKL,d=10,n=20VIIWVIE-IWVI100101102M012345momenterror,d=2,n=15VIIWVIE-IWVI100101102M0200400600momenterror,d=10,n=20VIIWVIE-IWVI\fM = 1\n\nM = 5\n\nM = 20\n\nM = 100\n\nFigure 5: Logistic regression comparing IWVI (red) and E-IWVI (blue) with various M and step\nsizes. The IW-ELBO is shown after 2,000 (dashed lines) and 10,000 (solid) iterations. A larger\nM consistently improves both methods. E-IWVI converges more reliably, particularly on higher-\ndimensional data. From top: Madelon (d = 500) Sonar (d = 60), Mushrooms (d = 112).\n\nsometimes performed very slightly better. E-IWVI exhibited superior convergence behavior and\nwas easier to tune, as illustrated in Fig. 5, where E-IWVI converges at least as well as IWVI for all\nstep sizes. We suspect this is because when w is far from optimal, both the IW-ELBO and gradient\nvariance is better with E-IWVI.\n\nAcknowledgements\n\nWe thank Tom Rainforth for insightful comments regarding asymptotics and Theorem 3 and Linda\nSiew Li Tan for comments regarding Lemma 7. This material is based upon work supported by the\nNational Science Foundation under Grant No. 1617533.\n\nReferences\n[1] Felix V. Agakov and David Barber. An auxiliary variational method. In Neural Information\nProcessing, Lecture Notes in Computer Science, pages 561\u2013566. Springer, Berlin, Heidelberg,\n2004.\n\n[2] Philip Bachman and Doina Precup. Training deep generative models: Variations on a theme. In\n\nNIPS Workshop: Advances in Approximate Bayesian Inference, 2015.\n\n[3] Robert Bamler, Cheng Zhang, Manfred Opper, and Stephan Mandt. Perturbative black box\n\nvariational inference. In NIPS, 2017.\n\n[4] Peter J Bickel and Kjell A Doksum. Mathematical statistics: basic ideas and selected topics,\n\nvolume I, volume 117. CRC Press, 2015.\n\n[5] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\n2015.\n\n[6] Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importance-weighted\n\nautoencoders. 2017.\n\n[7] Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational\n\ninference via upper bound minimization. In NIPS, pages 2729\u20132738. 2017.\n\n[8] Kaitai Fang, Samuel Kotz, and Kai Wang Ng. Symmetric multivariate and related distributions.\n\nNumber 36 in Monographs on statistics and applied probability. Chapman and Hall, 1990.\n\n9\n\n40003500300025002000IW-ELBO180170160150IW-ELBO104103102101100stepsize150100IW-ELBOIWVI-2000IWVI-10000E-IWVI-2000E-IWVI-10000104103102101100stepsize104103102101100stepsize104103102101100stepsize\f[9] W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for complex bayesian\n\nmodelling. 43(1):169\u2013177, 1994.\n\n[10] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR.\n[11] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Auto-\n\nmatic differentiation variational inference. 18(14):1\u201345, 2017.\n\n[12] Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, and Frank Wood. Auto-Encoding\n\nSequential Monte Carlo. In ICLR, 2018.\n\n[13] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy\nMnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In NIPS, pages 6576\u20136586.\n2017.\n\n[14] J\u00f3zef Marcinkiewicz and Antoni Zygmund. Quelques th\u00e9oremes sur les fonctions ind\u00e9pendantes.\n\nFund. Math, 29:60\u201390, 1937.\n\n[15] Minka, Thomas. Expectation propagation for approximate bayesian inference. In UAI, 2001.\n[16] Christian A. Naesseth, Scott W. Linderman, Rajesh Ranganath, and David M. Blei. Variational\n\nsequential monte carlo. In AISTATS, 2018.\n\n[17] Art Owen. Monte Carlo theory, methods and examples. 2013.\n[18] Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank\n\nWood, and Yee Whye Teh. Tighter variational bounds are not necessarily better.\n\n[19] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In\n\nAISTATS, 2014.\n\n[20] Gabriel Romon. Bounds on moments of sample mean. https://math.stackexchange.\n\ncom/questions/2901196/bounds-on-moments-of-sample-mean, 2018.\n\n[21] Francisco J. R. Ruiz, Michalis K. Titsias, and David M. Blei. Overdispersed black-box varia-\n\ntional inference. In UAI, 2016.\n\n[22] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean \ufb01eld theory for sigmoid belief networks.\n\nJournal of Arti\ufb01cial Intelligence Research, 4:61\u201376, 1996.\n\n[23] Stan Development Team. Modeling language user\u2019s guide and reference manual, version 2.17.0,\n\n2017.\n\n[24] Tom Minka. Divergence measures and message passing. 2005.\n\n10\n\n\f", "award": [], "sourceid": 2196, "authors": [{"given_name": "Justin", "family_name": "Domke", "institution": "University of Massachusetts, Amherst"}, {"given_name": "Daniel", "family_name": "Sheldon", "institution": "University of Massachusetts Amherst"}]}