{"title": "Using Large Ensembles of Control Variates for Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 9960, "page_last": 9970, "abstract": "Variational inference is increasingly being addressed with stochastic optimization. In this setting, the gradient's variance plays a crucial role in the optimization procedure, since high variance gradients lead to poor convergence. A popular approach used to reduce gradient's variance involves the use of control variates. Despite the good results obtained, control variates developed for variational inference are typically looked at in isolation. In this paper we clarify the large number of control variates that are available by giving a systematic view of how they are derived. We also present a Bayesian risk minimization framework in which the quality of a procedure for combining control variates is quantified by its effect on optimization convergence rates, which leads to a very simple combination rule. Results show that combining a large number of control variates this way significantly improves the convergence of inference over using the typical gradient estimators or a reduced number of control variates.", "full_text": "Using Large Ensembles of Control Variates for\n\nVariational Inference\n\nCollege of Information and Computer Science\n\nCollege of Information and Computer Science\n\nJustin Domke\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\ndomke@cs.umass.edu\n\nTomas Geffner\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\ntgeffner@cs.umass.edu\n\nAbstract\n\nVariational inference is increasingly being addressed with stochastic optimization.\nIn this setting, the gradient\u2019s variance plays a crucial role in the optimization proce-\ndure, since high variance gradients lead to poor convergence. A popular approach\nused to reduce gradient\u2019s variance involves the use of control variates. Despite\nthe good results obtained, control variates developed for variational inference are\ntypically looked at in isolation. In this paper we clarify the large number of control\nvariates that are available by giving a systematic view of how they are derived. We\nalso present a Bayesian risk minimization framework in which the quality of a\nprocedure for combining control variates is quanti\ufb01ed by its effect on optimization\nconvergence rates, which leads to a very simple combination rule. Results show\nthat combining a large number of control variates this way signi\ufb01cantly improves\nthe convergence of inference over using the typical gradient estimators or a reduced\nnumber of control variates.\n\n1\n\nIntroduction\n\nVariational Inference (VI) [29, 2, 11] is a framework for approximate probabilistic inference. It\nhas been successfully applied in several areas including topic modeling [3, 21], generative models\n[13, 5, 22], reinforcement learning [6], and parsing [15], among others. Recently, VI has been able to\naddress a wider range of problems by adopting a \"black box\" [25] view based on only evaluating the\nvalue or gradient of the target distribution. Then, the target can be optimized via stochastic gradient\ndescent. It is desirable to reduce the variance of the gradient estimate, since this governs convergence.\nControl variates (CVs), a classical technique from statistics, is often used to accomplish this.\nThis paper investigates how to use many CVs in concert. We present a systematic view of existing\nCVs, which starts by splitting the exact gradient into four terms (Eq. 2). Then, a CV is obtained by\napplication of a generic \"recipe\": Pick a term, possibly approximate it, and take the difference of two\nestimators (Fig. 2). This suggests many possible CVs, including some seemingly not used before.\nWith many possible CVs, one can naturally ask how to use many together. In principle, the optimal\ncombination is well known (Eq. 6). However, this requires unknown (intractable) expectations. We\naddress this using decision theory. The goal is a \u201cdecision rule\u201d that takes a minibatch of evaluations\ntogether with the set of CVs to be used, and returns a gradient estimate. We adopt a Bayesian risk\nmeasuring how gradient variance impacts convergence rates of stochastic optimization, with simple\nprior over gradients and sets of CVs. A simple optimal decision rule emerges, where the intractable\nexpectations are replaced with \"regularized\" empirical estimates (Thm 4.1). To share information\nacross iterations, we suggest combining this Bayesian approach with exponential averaging by using\nan \u201ceffective\u201d minibatch size.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe demonstrate practicality on logistic regression problems, where careful combination of many\nCVs improves performance. For all learning rates, convergence is improved over any single CV.\n\n1.1 Contributions\n\nThe contribution of this work is twofold. First, in\nSection 3, we propose a systematic view of how\nto generate many existing control variates. Second,\nwe propose a an algorithm to use multiple control\nvariates simultaneously, described in Section 4. As\nshown in Section 5, combining these two ideas result\nin gradients with low variance that allow the use of\nlarger learning rates, while retaining convergence.\n\n2 Preliminaries\n\nFigure 1: An example of how combining con-\ntrol variates reduces gradient variance for the\nsame sequence of weights (australian dataset).\nVariational Inference (VI) works by transforming an inference problem into an optimization, by\ndecomposing the marginal likelihood of the observed data x given latent variables z as:\n\nlog p(x) =\n\nE\n\nZ\u223cqw(Z)\ufffd log\n\ufffd\ufffd\n\ufffd\n\np(Z, x)\n\nqw(Z)\ufffd\n\ufffd\n\nELBO(w)\n\n.\n\n+ KL(qw(Z)||p(Z|x))\n\ufffd\n\ufffd\n\nKL-divergence\n\n\ufffd\ufffd\n\nHere, the variational distribution qw(z) is used to approximate the true posterior distribution p(z|x).\nVI\u2019s goal is to \ufb01nd the parameters w that minimize the KL-divergence between qw(z) and the true\nposterior p(z|x). Since log p(x) does not depend on w, minimizing the KL-divergence is equivalent\nto maximizing the ELBO (Evidence Lower BOund).\nHistorically, models and variational families for which expectations were simple enough to allow\nclosed-form updates of w were used [2, 3, 32]. However, for more complex models, closed form\nexpressions are usually not available, which has led to widespread use of stochastic optimization\nmethods [8, 18, 19, 20, 26]. These require approximating the target\u2019s gradient\n\ng(w) = \u2207wELBO(w) = \u2207w\n\nE\n\nZ\u223cqw(Z)\ufffd log p(Z, x) \u2212 log qw(Z)\ufffd.\n\n(1)\n\nGood gradient estimates play an important role, since high variance will negatively impact on\nconvergence and optimization speed. Several methods have been developed to improve gradient\nestimates, including Rao-Blackwellization [20], control variates [7, 17, 18, 19, 20, 28, 30, 33],\nclosed-form solutions for certain expectations [27], discarding terms [23], and different estimators.\n\n2.1 Control variates\n\nA control variate (CV) is a random variable with expectation zero that is added to another random\nvariable in the hope of reducing variance. Let X be a random variable with unknown mean, and let C\nbe a random variable with mean zero. Then for any scalar a, Y = X +a C has the same expectation as\nX but (usually) different variance. A standard result from statistics is that the value of a that minimizes\nthe variance of Y is a = Cov(X, C)/Var(C), for which Var(Y ) = Var(X)(1 \u2212 Corr(X, C)2).\nThus, a good control variate for X is a random variable C that is highly correlated with X.\n\n3 Systematic generation of control variates\n\nThis section gives a generic recipe for creating control variates (Fig. 2) and reviews how existing\ncontrol variates are an instance of it (see also Sec. 6.4 in the appendix). We begin by splitting the\nELBO gradient into four terms as\ng(w) = \u2207w E\n\n+ \u2207w E\n\n\u2212 \u2207w E\n\n\u2212\u2207w E\n\nlog p(Z)\n\nqw\n\nqw\n\nqv\n\nlog qv(Z)\ufffd\ufffdv=w\n\ufffd\ufffd\n\ufffd\n\n\ufffd\n\n.\n\nlog qw(Z)\ufffd\ufffdv=w\n\ufffd\ufffd\n\ufffd\n\n(2)\n\nqw\n\nlog p(x|Z)\ng1(w): Data term\n\n\ufffd\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\ufffd\n\n\ufffd\n\ng2(w): Prior term\n\ng3(w): Variational term\n\ng4(w): Score term\n\n\ufffd\n\n2\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fPick Term t(w).\n(Part of g1, g2, g3)\n\nt\n\nApproximate\nTerm (optional)\n\n\u02dct\n\n\u02dct\n\n1st estimate\n\n(SF, RP, CF, etc.)\n\n2nd Estimate\n\n(SF, RP, CF, etc.)\n\nT\n\nT \ufffd\n\nTake Difference\n\nT \u2212 T \ufffd\n\nFigure 2: Generic control variate recipe. (SF: score function RP: reparameterization CF: closed form.)\nSec. 6.4 (appendix) casts several existing ideas [23, 19, 17, 30, 28, 7] as instances of this recipe.\n\nThe \ufb01rst three terms all correspond to the in\ufb02uence of w on the expectation of some function\nindependent of w. Control variates for these terms, and for any combination of them, are discussed\nin Sec. 3.1-3.2. The score term, discussed in Sec. 3.3, is different, since the function inside the\nexpectation depends on w. (Roeder et al. [23] give a related decomposition, albeit speci\ufb01cally for\nreparameterization estimators.)\n\n3.1 Control Variates from Pairs of estimators\n\nThe basic technique for deriving CVs is to take the difference between a pair of unbiased estimators\nof a general term t(w) (any of g1, g2, g3 or a combination of them), which must therefore have\nexpectation zero. The terms g1, g2 or g3 are all the expectation (over qw) of some function f\n(independent of w). 1 Thus, t(w) can be written as\n\nt(w) = \u2207w E\n\nqw(Z)\n\n[f (Z)]\n\nor\n\n\ufffd\u2207w E\n\nqw(Z)\n\n[fv(Z)]\ufffd\ufffd\ufffd\ufffdv=w\n\n.\n\nt(w) = E\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nMany methods exist to estimate gradients of this type. Mathematically, we think of these as random\nvariables (with a corresponding generation algorithm). A few estimators are summarized in Eq. 3\n(dropping dependence of f on v). If we write T a for an estimator for t(w) using method a, then\n\nw (\ufffd))\n\nScore function\n\nT SF = f (Z)\u2207w log qw(Z)\nT RP1 = \u2207wf (T 1\nT RP2 = \u2207wf (T 2\nT GR = f (Z)\u2207w log qw(Z) + \u2207wf (Tw(\ufffd)) Gen. Reparam.\nClosed Form\nT CF = \u2207w Eqw [f (Z)]\n\nZ \u223c qw\nReparameterization \ufffd \u223c \u00afq\nOther Reparam.\n\ufffd \u223c \u00afq\n\ufffd \u223c \u00afqw, Z = Tw(\ufffd)\n\nw (\ufffd))\n\n(3)\nScore function (SF) estimation, or REINFORCE [31], uses the equality \u2207wqw(z) =\nqw(z)\u2207w log qw(z) [18, 20]. This gives t(w) = Eqw T SF , with T SF as in Eq. 3. Unbiased\nestimates for the gradient can be obtained using Monte Carlo sampling, with samples from qw(z).\nReparameterization (RP) estimators [13, 17, 26] are based on splitting the procedure to sample\nfrom qw into sampling and transformation steps. First, sample \ufffd \u223c \u00afq(\ufffd); second, transform z = Tw(\ufffd).\nHere, \u00afq is a \ufb01xed distribution (indep. of w) and Tw is a deterministic transformation. When sampling\nis done this way, it follows that Eqw f (Z) = E\u00afq f (Tw(\ufffd)), rendering the expectation independent of\nw. The general term can therefore be written as t(w) = E\u00afq T RP , with T RP = \u2207wf (Tw(\ufffd)). The\nmultivariate Gaussian distribution N (\u00b5w, \u03a3w) illustrates this: A sample can be generated by drawing\n\ufffd \u223c N (0, I) and setting Tw(\ufffd) = Mw \ufffd + \u00b5w, where Mw is a matrix such that MwM T\nMultiple reparameterizations are typically possible. For example, the above estimator for the\nmultivariate Gaussian is valid with any Mw such that MwM T\nw = \u03a3w. For instance, Mw could be a\nlower triangular matrix obtained via the Cholesky factorization of \u03a3w [4, 26]. (Often, entries of w\ndirectly specify entries in the Cholesky factorization, obviating the need to explicitly compute it.)\nAnother option is the matrix square root of \u03a3w [14]. All valid reparameterizations give unbiased\ngradients, but with different statistical properties.\n\nw = \u03a3w.\n\n1For g1, g2, and g3, use f (z) = log p(x|z), f (z) = log p(z), and fv(z) = log qv(z) respectively.\n\n3\n\n\fGeneralized reparameterization (GR) is intended for distributions where reparameterization is\nnot applicable, e.g.\nthe gamma or beta [24]. Take a transformation Tw and a base distribution\n\u00afqw(\ufffd) (both dependent on w) such that Tw(\ufffd) is distributed identically to qw(Z). Then, Eqw f (Z) =\nE\u00afqw f (Tw(\ufffd)). The dependence of this expectation on w is mediated partially through w\u2019s in\ufb02uence\non \u00afqw and partially through w\u2019s in\ufb02uence on Tw. This leads to a representation of a general term as\nt(w) = E\u00afqw(\ufffd) T GR, where T GR is as in Eq. 3. This has essentially has a score function-like term\nand a reparameterization-like term, corresponding to w\u2019s in\ufb02uence on \u00afq and Tw, respectively.\nClosed form (CF) expressions are sometimes available for general terms involving g2 and g3, but\nrarely for g1. This is because a closed-form expression needs q and f to be simple enough, that is\nrarely the case for the data term g1, which is usually estimated with one of the methods described\nabove [17, 19, 20, 24]. However, there are some cases for which g1 can be computed exactly [4].\nData Subsampling is often applied to the data term g1 [12]. If the likelihood treats x as i.i.d., then\nf (z) = log p(x|z) can be approximated without bias from a minibatch of data. If fd(z) is that\nestimate, an equivalent representation of the data term is g1(w) = ED \u2207w Eqw(Z) fD(Z) where D\nis uniform over subsets of data. Thus, one can de\ufb01ne an unbiased estimator by using one of the\ntechniques above (to cope with Eqw(Z)) on a random minibatch D (to cope with ED). With large\ndatasets this can be much faster, but sampling D acts as an additional source of variance.\n\n3.2 Control Variates from approximations\n\nThe previous section used that the difference of two unbiased estimators of a term has expectation\nzero, and so is a control variate. Another class of control variates uses the insight that if a general term\nt(w) is replaced with an approximation, the difference between two estimators of the (approximate)\ngeneral term still produces a valid control variate. The motivation is that approximations might allow\nthe use of high-quality estimators (e.g. a closed-form) not otherwise available.\nFundamentally, the randomness in the above estimators is due to two types of sampling. First,\nexpectations over qw are approximated by sampling, introducing \"distributional sampling error\".\nSecond, with large data, the data term can be approximated by drawing a minibatch, introducing\n\"data subsampling error\". Approximations to terms have been devised so that expectations (either\nover qw or the full dataset) can be ef\ufb01ciently computed.\nCorrecting for distributional sampling: Here, the goal is to approximate f with some function \u02dcf\nso as to make E[ \u02dcf (Z)] easier to estimate \u2013 typically so admits a closed-form solution. Paisley et al.\n[19] approximate the data term with either a Taylor approximation in z or a bound and then de\ufb01ne a\ncontrol variate as the difference between E[ \u02dcf (Z)] computed exactly and its estimator using the score\nfunction method, which greatly reduces the variance of their gradient estimate, obtained with the\nscore function method. Miller et al. [17] also use a Taylor approximation of the data term, but use the\ndifference between E[ \u02dcf (Z)] computed exactly and and its estimator using reparameterization. They\nuse this control variate together with a base gradient estimate obtained via reparameterization.\nCorrecting for data subsampling: As discussed in Sec. 3.1 it is common with large datasets to\nde\ufb01ne estimators for the data term that only evaluate the likelihood on random subsets of data. To\nreduce the variance introduced by this subsampling, Wang et al. [30] propose to approximate fd(z)\nwith a Taylor expansion in x, leading to an approximate data term \u02dcg1(z) = \u2207w Eqw ED \u02dcfD(z). For\nsome models the inner expectation (over D) can be computed ef\ufb01ciently by caching the 1st and\n2nd order empirical moments of the data. Since the outer expectation (over qw) usually remains\nintractable, a \ufb01nal control variate is obtained by applying one of the estimation methods described in\nSec. 3.1 (SF, RP, etc) to both fD(z) and ED fD(z) and taking the difference.\nBoth correction mechanisms described above represent particular scenarios that are included in\nthe proposed framework shown in Fig. 2, which also includes other control variates based on\napproximations. First, it imposes no restrictions on other approximations, such as the ones based on\napproximating the distribution qw instead of f. And second, it includes control variates based on the\ndifference of two estimates of an approximate general term, despite neither being CF. These two ideas\nare used in the control variate introduced by Tucker et al. [28], which use a continuous relaxation\n[9, 16] to approximate the distribution qw (discrete in this case), and construct a control variate by\ntaking the difference between the SF and RP estimates of the resulting term based on the relaxation.\n\n4\n\n\fFollowing a similar idea, Grathwohl, et al. [7] use a neural network as a surrogate for f, and use as\ncontrol variate the difference between the SF and RP estimation of the term involving the surrogate.\n\n3.3 Control variate from the score term (g4)\n\nIt\u2019s easy to show that the score term is always zero, i.e. g4(w) = 0 (proof in appendix). Thus, it does\nnot need to be estimated. However, since it has expectation zero, one can use the naive control variate\nT4 = \u2207w log qw(Z), Z \u223c qw [20, 23].\n4 Combining multiple control variates\n\nIn order to use control variates we need to de\ufb01ne a base gradient estimator h(w) \u2208 RD and a set of\ncontrol variates, {c1, ..., cL}, ci \u2208 RD, that we want to use to reduce the base gradient\u2019s variance. We\nmultiply each control variate ci with a scalar weight ai to get the estimator\n\n\u02c6g(w) = h(w) +\n\nai ci(w).\n\nL\ufffdi=1\n\n(4)\n\nDe\ufb01ning a \u2208 RL as the vector of weights and C \u2208 RD\u00d7L as the matrix with ci as the i-th column, \u02c6g\ncan be equivalently expressed as\n\n\u02c6g(w) = h(w) + C(w)a.\n\n(5)\n\nThe goal is to \ufb01nd a such that the \ufb01nal gradient has low variance. This follows from theoretical results\non stochastic optimization with a \ufb01rst-order unbiased gradient oracle that indicate that convergence is\ngoverned by the expected squared norm E\ufffd\u02c6g\ufffd2 of the gradient oracle [1], which is equivalent (up to\na constant) to the trace of the variance. In particular, in the case in which the CVs are all differences\nbetween unbiased estimators for different terms, \ufb01nding the optimal a is equivalent to \ufb01nding the\nbest af\ufb01ne combination of the estimators.2\nLemma 4.1. Let h(w) \u2208 RD be a random variable, C(w) \u2208 RL\u00d7D a matrix of random variables\nsuch that each element has mean zero. For a \u2208 RL, de\ufb01ne \u02c6g(w) = h(w) + C(w)a. The value of a\nthat minimizes E\ufffd\u02c6g(w)\ufffd2 for a given w is\n\na\u2217(w) = \u2212 E\n\np(C,h|w)\ufffdC T C\ufffd\u22121 E\ufffdC T h\ufffd.\n\n(6)\n\nVariants of this result are known [30]. Of course, this requires the expectations E[C T C] and E[C T h],\nwhich are usually not available in closed form. One solution is, given some observed gradients\nh1, ..., hM and control variates C1, ..., CM , to estimate a\u2217 using empirical expectations in place of\nthe true ones. However, this approach does not account for how errors in the estimates of these\nexpectations affect a and therefore the \ufb01nal variance of \u02c6g.\n\n4.1 Bayesian regularization\n\nWe deal with this problem from a \"risk minimization\" perspective. We imagine that the joint\ndistribution over C and h is governed by some (unknown) parameter vector \u03b8. Then, we can de\ufb01ne\nthe loss for selecting the vector of weights a when the true parameter vector is \u03b8 as\n\nWe seek a \"decision rule\"\n\nL(a, \u03b8) = E\n\nC,h|\u03b8 \ufffdh + Ca\ufffd2.\n\n\u03b1(C1, h1, ..., CM , hM )\n\nthat takes as input a \"minibatch\" of M evaluations of h and C and returns a weight vector a. Then,\nfor a pre-speci\ufb01ed probabilistic model p(C, h, \u03b8), we can de\ufb01ne the Bayesian regret as\n\n2Intuitively, given two estimators, if one is used as the base estimator and the difference as a CV, then \ufb01nding\n\nthe best weight for that CV is equivalent to \ufb01nding the best mixture of the estimators.\n\n5\n\n\fBayesRegret(\u03b1) = E\n\u03b8\n\nE\n\nC1,h1,...,CM ,hM|\u03b8\n\n[L (\u03b1(C1, h1, ..., CM , hM ), \u03b8)] .\n\nThe following theorem shows that if we model p(C, h|\u03b8) jointly as a Gaussian with canonical\nparameters \u03b8 = (\u03b7, \u039b), and use a Normal-Wishart prior for p(\u03b8), then the decision rule \u03b1 minimizing\nthe Bayesian risk ends up being similar to Eq. 6, with two modi\ufb01cations. First, the unknown\nexpectations are replaced with empirical expectations. Second, the empirical expectation of C T C is\n\"regularized\" by a term determined by the prior. For simplicity, the following result is stated assuming\nthat the Normal-Wishart prior uses V0 being a constant times the identity. However, in the appendix\nwe state (and prove) a more general result where V0 is arbitrary. This can also be implemented\nef\ufb01ciently, although the result is more clumsy to state.\nTheorem 4.1. If p(C, h|\u03b8) is a Gaussian parameterized as\n\np(C, h|\u03b8 = (\u03b7, \u039b)) = Gaussian\ufffd [vec(C), h]\ufffd\ufffd\ufffd\u00b5 = \u039b\u22121\u03b7, \u03a3 = \u039b\u22121\ufffd,\n\nand the prior is a Normal-Wishart, parameterized as p(\u03b8 = (\u03b7, \u039b)) \u221d exp(tT\nn0A(\u03b7, \u039b)), then the decision rule that minimizes the Bayesian regret for V0 = v0I is\n\n0 \u03b7 \u2212 trace(V T\n\n0 \u039b) \u2212\n\nC T h\n\n(7)\n\n\u03b1\u2217(C1, h1, ..., CM , hM ) = \u2212\ufffd d v0\nM\ufffdM\n\nm and C T h = 1\n\nI + C T C\ufffd\u22121\nM\ufffdM\n\nM\n\nmhm.\n\nm=1 C T\n\nm=1 CmC T\n\nWhere h \u2208 Rd, C T C = 1\nThe proof idea is as follows: Since the loss is the expected squared norm, the optimal decision rule\ncan be reduced to a form similar to Eq. 6 but with the expectations replaced by posterior expectations\nconditioned on the observations C1, ..., CM and h1, ..., hM . For exponential families with conjugate\npriors (e.g. the Gaussian with a Normal-Wishart prior), the posterior expectation of suf\ufb01cient statistics\ngiven observations has a simple closed-form solution [10]. The suf\ufb01cient statistics for the Gaussian\nare the \ufb01rst and second joint moments of [vec(C), h], from which the expectations needed for the\noptimal decision rule can be extracted.\nThe rule in Eq. 7 is surprisingly simple: just compute the empirical averages and add a diagonal\nregularizer before solving the linear system. Using a large M provides better estimates for the\nexpectation and thus reduces the amount of \u201cregularization\u201d applied, while using a small M provides\nworse estimates, which are regularized more heavily.\n\n4.2 Empirical Averages\n\nThe probabilistic model described above does not explicitly mention the parameters w. One way\nto use this would be to apply it separately in each iteration. It is desirable, however, to exploit\nthe fact that the parameters change slowly during learning. Algorithmically, the procedure above\nrequires as input only empirical expectations for C T C and C T h. Instead of using samples from a\nsingle step alone, we propose using an exponential average. At every step we compute a weighted\naverage of the previous empirical expectation and the current one. This results in the update rule\nEt = (1 \u2212 \u03b3)Et\u22121 + \u03b3 \u02c6Et, ; \u03b3 \u2208 [0, 1] where E represents either C T C or C T h, and \u02c6Et is the\nempirical average obtained using the samples drawn at step t. To combine this with the Bayesian\nregularization procedure, we use an \u201ceffective M\u201d, Mef f = B\ufffdT\nt=1(1 \u2212 \u03b3)t, which indicates how\nmany samples are effectively being included in the empirical averages, where B is the minibatch size.\nMef f is used instead of M in equation 7. Technically, the regularization procedure assumes that the\nsamples for the empirical expectations are independent of those actually used for the \ufb01nal gradient\nestimate \u02c6g. To re\ufb02ect this, we compute \u03b1 at step t using the empirical average from step t \u2212 1, Et\u22121.\n5 Experiments and Results\n\nWe tried several control variates and the combination algorithm on a Bayesian binary logistic\nregression model with a standard Gaussian prior, using three well known datasets: ionosphere,\naustralian, and sonar. We use simple SGD with momentum (\u03b2 = 0.9) as our optimization algorithm,\n\n6\n\n\fFigure 3: For each dataset, optimization results for different gradients with different learning rates.\nLegends indicate what control variates are used together with the base gradient. The right column\nshows results with the best learning rate retrospectively selected for each iteration. For clarity we\nlimit the y-axis of the plots, which leaves some of the results (worst ones) out of the range being plot.\n\nminibatches of size 10, a decay factor of \u03b3 = 0.02 for the exponentially decayed empirical averages,\nand v0 = 10\u22123, value based on results obtained for the sensitivity analysis carried out (see Sec. 5.1).\nWe chose a full covariance Gaussian as variational distribution qw(z) parameterized using the mean\nand a Cholesky factorization of the covariance. Since both the prior and the variational distribution\nare Gaussian, the prior and variational terms can be computed in closed form.\nAs base gradient we use what seems to be the most common estimator, with reparameterization (RP1)\nto estimate the data term g1 (with the local reparameterization trick [12]) and the prior term g2, and\na closed form expression for the variational/entropy term g3. Here, RP1 is the reparameterization\n\nestimator using T (\ufffd; w) = Cholesky(\u03a3w)\ufffd + \u00b5w, while RP2 uses T (\ufffd; w) = \u221a\u03a3w\ufffd + \u00b5w [14] with\n\nthe matrix square root. For CVs, we chose to use the following seven, which provide a reasonable\ncoverage of the different methods described in Section 3:\n\n\u2022 c1: The difference between the RP1 and closed-form estimates of the variational term.\n\u2022 c2: The difference between the RP1 and closed-form estimates of the prior term.\n\u2022 c3: The difference between the RP1 and RP2 estimates of the prior term.\n\u2022 c4: The difference between the RP1 and RP2 estimates of the data term.\n\u2022 c5: Taylor expansion of the RP1 estimate of the data term, correcting for data subsampling [30].\n\u2022 c6: Taylor expansion of the RP2 estimate of the data term, correcting for data subsampling [30].\n\u2022 c7: Taylor expansion of the RP1 estimate of the data term, correcting for sampling from qw(z).\nThis control variate is based on the work of Miller, et al [17], but adapted to a full covariance\n(rather than diagonal) Gaussian (see appendix).\n\nWe compare the optimization results obtained using the base gradient alone and the base gradient com-\nbined with different subsets of CVs, which were chosen following a simple approach: We tried each\n\n7\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fFigure 4: ELBO after 500 iterations for each gradient vs. learning rate (legends as in Fig. 3).\n\nCV in isolation, and chose the four worst performing ones as one subset, the \ufb01ve worst performing\nones as another subset, and so on. The \ufb01nal subsets of CVs obtained this way are S4 = {c2, c1, c3, c4},\nS5 = {c2, c1, c3, c4, c6}, S6 = {c2, c1, c3, c4, c6, c5}, and S7 = {c2, c1, c3, c4, c6, c5, c7}. We also\nshow results for the two best control variates, c5 and c7, used in isolation. All the results shown in\nthis section, \ufb01gures and tables, were obtained averaging the results from 50 runs.\nBest learning rate. Table 1 shows the ELBO value achieved after 500 iterations, with the largest\nlearning rate3 for which optimization converged with at least one estimator. It can be seen that\nincreasing the number of CVs often leads to higher \ufb01nal values for the ELBO and that, in all cases,\nthe higher ELBOs (better) were achieved by using all CVs together.\n\nTable 1: Average ELBO achieved after 500 iterations for each dataset using the base gradient with\ndifferent subsets of control variates and particular learning rates (lr).\n\n-\n\nS4\n\nControl variates used\n\nDataset (lr)\nS7\nIon. (0.4)\n\u2212110.1 \u221275.6\n\u221272\nAus. (0.4)\n\u2212251.8 \u2212259.4 \u2212254.4\nSonar (0.2) \u2212442.4 \u2212270.2 \u2212149.1 \u2212148.3 \u2212117.1 \u2212200.6 \u2212120.2\n\nS5\n\u2212157.3 \u2212112.5 \u221285.3\n\u2212378.2 \u2212357.2 \u2212255.1\n\nS6\n\u221285.3\n\u2212255\n\nc5\n\nc7\n\nComparing across learning rates. Now we compare the performance achieved using each gradient\nestimator with different learning rates. To do so we present two sets of images. First, the two leftmost\ncolumns of Fig. 3 show, for each dataset, the ELBO vs. iterations for two different learning rates;\nwhile the third column shows, for each gradient estimator and iteration, the ELBO for the best\nlearning rate (vs. iteration). As in Table 1 it can be seen that for a given learning rate (or when\nchoosing the best at each iteration) the gradients that combine more control variates are better suited\nfor optimization and display a strictly dominant performance.\nFinally, Fig. 4 shows, for several gradients, the \ufb01nal ELBO (after 500 iterations) vs. learning rate used,\nproviding a systematic comparison of how the gradient estimates perform with different learning rates.\nAgain, estimates employing more CVs display a dominant performance, with larger improvements at\nlarger learning rates. Furthermore, the \u201cbest\u201d learning rate increases with better estimators.\n\n5.1 Sensitivity analysis\n\nIt is natural to ask how the variance of the gradient estimate is related to the choice of the prior\nparameter v0 and the minibatch size M. Recall from Thm. 4.1 that a larger value of v0 corresponds\nto a more concentrated prior, and is thus a more conservative choice \u2013 essentially it results in more\n\"regularization\" of the empirical moments. To answer this we carried out a simple experiment,\nwhere we \ufb01x w and estimate E||\u02c6g(w)||2 with a variety of v0 and M. To choose w, we applied SGD\nwith a low-variance gradient (computed with many samples), and a learning rate of 0.08 and same\ninitialization as in the previous section, and selected the parameters found after 25 iterations. This is\nintended to be \"typical\", in that it is neither at the start nor the end of optimization.\nEstimating \u02c6g(w) is a three step process: (1) Use one set of evaluations of C and h to estimate\nE[C T C] and E[C T h]. (2) Apply the prior to compute a from those estimate (Eq. 7). Recall from\n\n3The loss is normalized by the number of samples in the dataset. If it was not the equivalent learning rates\n\nwould be smaller\n\n8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fFigure 5: Expected squared norm of the gradient estimate vs v0, for different minibatch sizes. The\ntwo left most images were obtained with estimates of E[C T C] and E[C T h] using the current weights\nw, while for the image on the right moments were estimated using gradients from an older iteration.\nThe sonar and australian datasets give results similar to those of ionosphere.\n\nThm. 4.1 that a larger v0 corresponds to a more concentrated prior, essentially \"regularizing\" more.\n(3) Use a second set of evaluations of C and h to compute \u02c6g(w), using weights a (Eq. 4 / 5).\nIn a \ufb01rst experiment, we tested exactly that procedure, drawing two independent evaluations of C\nand h using the current weights w. Results are shown in Figure 5. We found a small arti\ufb01cial dataset\nillustrative, with samples x \u2208 R2. For this \"2D\" dataset, with small minibatches, a fairly large value\nof v0 provided the best results. However with ionosphere, even a very small v0 tended to perform\nwell.\nFor ef\ufb01ciency, our logistic regression experiments used exponential averaging over previous iterations\nto estimate E[C T C] and E[C T h], rather than drawing two evaluations at each iteration. So, even\nwith large value of M these are not fully reliable. To roughly simulate this, we performed a second\n\"lagged\" experiment estimating E[C T C] and E[C T h] from evaluations of C and h at the weight\nfrom 10 iterations previous during SGD. (This was chosen considering the \"average age\" of gradients\nwhen using exponential averaging, and that 0.08 is a relatively small learning rate.) The results of\nthis are shown on the right of Fig. 5. Lagged evaluations result in stochastic gradients with more\nvariance, with a different dependence on v0. (Note, however, that the gradient remains unbiased,\nlagging is cheaper, and that all estimators have a variance decreasing with M.)\nWe emphasize that several somewhat arbitrary decisions were made for these experiments, such as\nthe learning rate, the choice of iteration, the amount of \"lag\". However, we believe that the results\nillustrate an important phenomenon related to the use of regularization: when using past gradient\ninformation (as exponential averaging does) larger values of v0 are bene\ufb01cial and result in gradients\nwith lower variance. While intuitively plausible, note that this bene\ufb01t of regularization for countering\nerrors introduced by the use of old gradients is not really captured by our theoretical analysis in\nSection 4 which is entirely based on \"single-iteration\" reasoning.\n\n6 Conclusion\n\nThis work focuses on how to obtain low variance gradients given a \ufb01xed set of control variates. We\n\ufb01rst present a uni\ufb01ed view that attempts to explain how most control variates used for variational\ninference are derived, which sheds light on the large number of CVs available. We then propose\na combination algorithm to use multiple control variates in concert. We show experimentally that,\ngiven a set of control variates, the combination algorithm provides a simple and effective combination\nrule that leads to gradients with less variance than those obtained using a reduced number of CVs (or\nno CVs at all). The algorithm assumes that a \ufb01xed set of control variates to be used is given, and\nminimizes the \ufb01nal gradient\u2019s variance using them, without analyzing how favorable using all the\nCVs actually is. A \u201csmarter\u201d algorithm could, for instance, decide whether to use all the CVs given\nor a just a subset. We leave the development of such algorithm for future work.\n\n9\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fReferences\n[1] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information-\ntheoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans.\nInformation Theory, 58(5):3235\u20133249, 2012.\n\n[2] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of\n\nmachine Learning research, 3(Jan):993\u20131022, 2003.\n\n[4] Edward Challis and David Barber. Concave gaussian variational approximations for inference in\nlarge-scale bayesian linear models. In Proceedings of the Fourteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, pages 199\u2013207, 2011.\n\n[5] Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders. arXiv preprint\n\narXiv:1412.6581, 2014.\n\n[6] Thomas Furmston and David Barber. Variational methods for reinforcement learning.\n\nIn\nProceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 241\u2013248, 2010.\n\n[7] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation\nthrough the void: Optimizing control variates for black-box gradient estimation. arXiv preprint\narXiv:1711.00123, 2017.\n\n[8] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational\n\ninference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[9] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[10] Michael I. Jordan. The exponential family: Conjugate priors. https://people.eecs.\n\nberkeley.edu/~jordan/courses/260-spring10/other-readings/chapter9.pdf.\n\n[11] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An in-\ntroduction to variational methods for graphical models. Machine learning, 37(2):183\u2013233,\n1999.\n\n[12] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local\nreparameterization trick. In Advances in Neural Information Processing Systems, pages 2575\u2013\n2583, 2015.\n\n[13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[14] Steven Cheng-Xian Li and Benjamin M. Marlin. A scalable end-to-end gaussian process adapter\nfor irregularly sampled time series classi\ufb01cation. In Advances in Neural Information Processing\nSystems, pages 1804\u20131812, 2016.\n\n[15] Percy Liang, Slav Petrov, Michael Jordan, and Dan Klein. The in\ufb01nite pcfg using hierarchical\ndirichlet processes.\nIn Proceedings of the 2007 Joint Conference on Empirical Methods\nin Natural Language Processing and Computational Natural Language Learning (EMNLP-\nCoNLL), 2007.\n\n[16] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[17] Andrew C Miller, Nicholas J Foti, Alexander D\u2019Amour, and Ryan P Adams. Reducing\n\nreparameterization gradient variance. arXiv preprint arXiv:1705.07880, 2017.\n\n[18] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks.\n\narXiv preprint arXiv:1402.0030, 2014.\n\n10\n\n\f[19] John Paisley, David Blei, and Michael Jordan. Variational bayesian inference with stochastic\n\nsearch. arXiv preprint arXiv:1206.6430, 2012.\n\n[20] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 814\u2013822, 2014.\n\n[21] Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and David Blei. Deep exponential families.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 762\u2013771, 2015.\n\n[22] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[23] Geoffrey Roeder, Yuhuai Wu, and David Duvenaud. Sticking the landing: An asymptotically\nzero-variance gradient estimator for variational inference. arXiv preprint arXiv:1703.09194,\n2017.\n\n[24] Francisco Ruiz, Titsias Michalis, and David Blei. The generalized reparameterization gradient.\n\nIn Advances in Neural Information Processing Systems, pages 460\u2013468, 2016.\n\n[25] Francisco JR Ruiz, Michalis K Titsias, and David M Blei. Overdispersed black-box variational\n\ninference. arXiv preprint arXiv:1603.01140, 2016.\n\n[26] Michalis Titsias and Miguel L\u00e1zaro-Gredilla. Doubly stochastic variational bayes for non-\nconjugate inference. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 1971\u20131979, 2014.\n\n[27] Michalis Titsias and Miguel L\u00e1zaro-Gredilla. Local expectation gradients for black box vari-\national inference. In Advances in neural information processing systems, pages 2638\u20132646,\n2015.\n\n[28] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein.\nIn\n\nRebar: Low-variance, unbiased gradient estimates for discrete latent variable models.\nAdvances in Neural Information Processing Systems, pages 2627\u20132636, 2017.\n\n[29] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and\nvariational inference. Foundations and Trends R\ufffd in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n[30] Chong Wang, Xi Chen, Alexander J Smola, and Eric P Xing. Variance reduction for stochastic\ngradient optimization. In Advances in Neural Information Processing Systems, pages 181\u2013189,\n2013.\n\n[31] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[32] John Winn and Christopher M Bishop. Variational message passing. Journal of Machine\n\nLearning Research, 6(Apr):661\u2013694, 2005.\n\n[33] Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational\n\ninference. arXiv preprint arXiv:1711.05597, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6461, "authors": [{"given_name": "Tomas", "family_name": "Geffner", "institution": "University of Massachusetts, Amherst"}, {"given_name": "Justin", "family_name": "Domke", "institution": "University of Massachusetts, Amherst"}]}