{"title": "Stein Variational Gradient Descent as Gradient Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 3115, "page_last": 3123, "abstract": "Stein variational gradient descent (SVGD) is a deterministic sampling algorithm that iteratively transports a set of particles to approximate given distributions, based on a gradient-based update constructed to optimally decrease the KL divergence within a function space. This paper develops the first theoretical analysis on SVGD. We establish that the empirical measures of the SVGD samples weakly converge to the target distribution, and show that the asymptotic behavior of SVGD is characterized by a nonlinear Fokker-Planck equation known as Vlasov equation in physics. We develop a geometric perspective that views SVGD as a gradient flow of the KL divergence functional under a new metric structure on the space of distributions induced by Stein operator.", "full_text": "Stein Variational Gradient Descent as Gradient Flow\n\nQiang Liu\n\nDepartment of Computer Science\n\nDartmouth College\nHanover, NH 03755\n\nqiang.liu@dartmouth.edu\n\nAbstract\n\nStein variational gradient descent (SVGD) is a deterministic sampling algorithm\nthat iteratively transports a set of particles to approximate given distributions, based\non a gradient-based update that guarantees to optimally decrease the KL divergence\nwithin a function space. This paper develops the \ufb01rst theoretical analysis on SVGD.\nWe establish that the empirical measures of the SVGD samples weakly converge\nto the target distribution, and show that the asymptotic behavior of SVGD is\ncharacterized by a nonlinear Fokker-Planck equation known as Vlasov equation\nin physics. We develop a geometric perspective that views SVGD as a gradient\n\ufb02ow of the KL divergence functional under a new metric structure on the space of\ndistributions induced by Stein operator.\n\n1\n\nIntroduction\n\nStein variational gradient descent (SVGD) [1] is a particle-based algorithm for approximating complex\ndistributions. Unlike typical Monte Carlo algorithms that rely on randomness for approximation,\nSVGD constructs a set of points (or particles) by iteratively applying deterministic updates that\nis constructed to optimally decrease the KL divergence to the target distribution at each iteration.\nSVGD has a simple form that ef\ufb01cient leverages the gradient information of the distribution, and\ncan be readily applied to complex models with massive datasets for which typical gradient descent\nhas been found ef\ufb01cient. A nice property of SVGD is that it strictly reduces to the typical gradient\nascent for maximum a posteriori (MAP) when using only a single particle (n = 1), while turns into a\nfull sampling method with more particles. Because MAP often provides reasonably good results in\npractice, SVGD is found more particle-ef\ufb01cient than typical Monte Carlo methods which require\nmuch larger numbers of particles to achieve good results.\nSVGD can be viewed as a variational inference algorithm [e.g., 2], but is signi\ufb01cantly different from\nthe typical parametric variational inference algorithms that use parametric sets to approximate given\ndistributions and have the disadvantage of introducing deterministic biases and (often) requiring\nnon-convex optimization. The non-parametric nature of SVGD allows it to provide consistent\nestimation for generic distributions like Monte Carlo does. There are also particle algorithms based\non optimization, or variational principles, with theoretical guarantees [e.g., 3\u20135], but they often do\nnot use the gradient information effectively and do not scale well in high dimensions.\nHowever, SVGD is dif\ufb01cult to analyze theoretically because it involves a system of particles that\ninteract with each other in a complex way. In this work, we take an initial step towards analyzing\nSVGD. We characterize the SVGD dynamics using an evolutionary process of the empirical measures\nof the particles that is known as Vlasov process in physics, and establish that empirical measures of\nthe particles weakly converge to the given target distribution. We develop a geometric interpretation\nof SVGD that views SVGD as a gradient \ufb02ow of KL divergence, de\ufb01ned on a new Riemannian-like\nmetric structure imposed on the space of density functions.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Stein Variational Gradient Descent (SVGD)\n\ni=1 whose empirical measure \u02c6\u00b5n(dx) =(cid:80)n\n\nWe start with a brief overview of SVGD [1]. Let \u03bdp be a probability measure of interest with a\npositive, (weakly) differentiable density p(x) on an open set X \u2286 Rd. We want to approximate \u03bdp\nwith a set of particles {xi}n\ni=1 \u03b4(x \u2212 xi)/ndx weakly\nconverges to \u03bdp as n \u2192 \u221e (denoted by \u02c6\u00b5n \u21d2 \u03bdp), in the sense that we have E\u02c6\u00b5n [h] \u2192 E\u03bdp [h] as\nn \u2192 \u221e for all bounded, continuous test functions h.\nTo achieve this, we initialize the particles with some simple distribution \u00b5, and update them via map\n\nT (x) = x + \u0001\u03c6(x),\n\nwhere \u0001 is a small step size, and \u03c6(x) is a perturbation direction, or velocity \ufb01eld, which should\nbe chosen to maximally decrease the KL divergence of the particle distribution with the target\ndistribution; this is framed by [1] as solving the following functional optimization,\n\n(cid:26)\n\nmax\n\u03c6\u2208H\n\n\u2212 d\nd\u0001\n\nKL(T \u00b5 || \u03bdp)(cid:12)(cid:12)\u0001=0\n\n(cid:27)\n\ns.t.\n\n||\u03c6||H \u2264 1\n\n.\n\n(1)\n\nwhere \u00b5 denotes the (empirical) measure of the current particles, and T \u00b5 is the measure of the\nupdated particles x(cid:48) = T (x) with x \u223c \u00b5, or the pushforward measure of \u00b5 through map T , and H is\na normed function space chosen to optimize over.\nA key observation is that the objective in (1) is a linear functional of \u03c6 that draws connections to\nideas in the Stein\u2019s method [6] used for proving limit theorems or probabilistic bounds in theoretical\nstatistics. Liu and Wang [1] showed that\n\n\u2212 d\nd\u0001\n\nwhere \u2207 \u00b7 \u03c6 :=(cid:80)d\n\n(2)\nk=1 \u2202xk \u03c6k(x), and Sp is a linear operator that maps a vector-valued function \u03c6 to\na scalar-valued function Sp\u03c6, and Sp is called the Stein operator in connection with the so-called\nStein\u2019s identity, which shows that the RHS of (2) equals zero if \u00b5 = \u03bdp,\n\nSp\u03c6(x) := \u2207 log p(x)(cid:62)\u03c6(x) + \u2207 \u00b7 \u03c6(x),\n\nKL(T \u00b5 || \u03bdp)(cid:12)(cid:12)\u0001=0 = E\u00b5[Sp\u03c6], with\n\nEp[Sp\u03c6] = Ep[\u2207 log p(cid:62)\u03c6 + \u2207 \u00b7 \u03c6] =\n\n\u2207 \u00b7 (p\u03c6)dx = 0;\n\n(3)\n\n(cid:90)\n\n(cid:8)E\u00b5[Sp\u03c6],\n\n||\u03c6||H \u2264 1(cid:9),\n\nit is the result of integration by parts, assuming proper zero boundary conditions. Therefore, the\noptimization (1) reduces to\n\nD(\u00b5 || \u03bdp) := max\n\u03c6\u2208H\n\ns.t.\n\n(4)\nwhere D(\u00b5 || \u03bdp) is called Stein discrepancy, which provides a discrepancy measure between \u00b5 and\n\u03bdp, since D(\u00b5 || \u03bdp) = 0 if \u00b5 = \u03bdp and D(\u00b5 || \u03bdp) > 0 if \u00b5 (cid:54)= \u03bdp given H is suf\ufb01ciently large.\nBecause (4) induces an in\ufb01nite dimensional functional optimization, it is critical to select a nice space\nH that is both suf\ufb01ciently rich and also ensures computational tractability in practice. Kernelized\nStein discrepancy (KSD) provides one way to achieve this by taking H to be a reproducing kernel\nHilbert space (RKHS), for which the optimization yields a closed form solution [7\u201310].\nTo be speci\ufb01c, let H0 be a RKHS of scalar-valued functions with a positive de\ufb01nite kernel k(x, x(cid:48)),\nand H = H0 \u00d7 \u00b7\u00b7\u00b7 \u00d7 H0 the corresponding d \u00d7 1 vector-valued RKHS. Then it can be shown that\nthe optimal solution of (4) is\n\n\u03c6\n\n\u00b5,p(\u00b7) \u221d Ex\u223c\u00b5[Sp \u2297 k(x,\u00b7)], with\n\u2217\n\n(5)\nwhere Sp\u2297 is an outer product variant of Stein operator which maps a scalar-valued function to a\nvector-valued one. Further, it has been shown in [e.g., 7] that\n\nSp \u2297 k(x,\u00b7) := \u2207 log p(x)k(x,\u00b7) + \u2207xk(x,\u00b7),\n\n(cid:113)Ex,x(cid:48)\u223c\u00b5[\u03bap(x, x(cid:48))], with \u03bap(x, x(cid:48)) := S x\n\nD(\u00b5 || \u03bdp) = ||\u03c6\n\u00b5,p||H =\n\u2217\n\n(6)\nwhere \u03bap(x, x(cid:48)) is a \u201cSteinalized\u201d positive de\ufb01nite kernel obtained by applying Stein operator twice;\nS x\np and S x(cid:48)\np are the Stein operators w.r.t. variable x and x(cid:48), respectively. The key advantage of KSD\nis its computational tractability: it can be empirically evaluated with samples drawn from \u00b5 and the\ngradient \u2207 log p, which is independent of the normalization constant in p [see 7, 8].\n\npS x(cid:48)\n\np \u2297 k(x, x(cid:48)),\n\n2\n\n\fAlgorithm 1 Stein Variational Gradient Descent [1]\n\nInput: The score function \u2207x log p(x).\nGoal: A set of particles {xi}n\ni=1 that approximates p(x).\nInitialize a set of particles {xi\n0}n\ni=1; pick a positive de\ufb01nite kernel k(x, x(cid:48)) and step-size {\u0001(cid:96)}.\nFor iteration (cid:96) do\n\u2217\n(cid:96) ,p(xi\n(cid:96)),\n\u02c6\u00b5n\n\u2217\n(cid:96) ,p(x) =\n\u02c6\u00b5n\n\n\u2200i = 1, . . . , n,\n1\nn\n\n(cid:2)\u2207 log p(xj\n\n(cid:96), x)(cid:3),\n\n(cid:96))k(xj\n\n(cid:96), x) + \u2207xj\n\n(cid:96)\n\nwhere \u03c6\n\n(cid:96)+1 \u2190 xi\nxi\n\n(cid:96) + \u0001\u03c6\n\nn(cid:88)\n\nj=1\n\nk(xj\n\n(8)\n\nAn important theoretic issue related to KSD is to characterize when H is rich enough to ensure\nD(\u00b5 || \u03bdp) = 0 iff \u00b5 = \u03bdp; this has been studied by Liu et al. [7], Chwialkowski et al. [8], Oates\net al. [11]. More recently, Gorham and Mackey [10] (Theorem 8) established a stronger result that\nStein discrepancy implies weak convergence on X = Rd: let {\u00b5(cid:96)}\u221e\n(cid:96)=1 be a sequence of probability\nmeasures, then\n\nD(\u00b5(cid:96) || \u03bdp) \u2192 0 \u21d0\u21d2 \u00b5(cid:96) \u21d2 \u03bdp\n\nas (cid:96) \u2192 \u221e,\n\n(7)\n\n(cid:96) ,p, with \u00b5 in (5) replaced by \u02c6\u00b5n\n\n(cid:96) (dx) = (cid:80)\n\nfor \u03bdp that are distantly dissipative (De\ufb01nition 4 of Gorham and Mackey [10]) and a class of inverse\nmulti-quadric kernels. Since the focus of this work is on SVGD, we will assume (7) holds without\nfurther examination.\nIn SVGD algorithm, we iteratively update a set of particles using the optimal transform just derived,\n(cid:96)}n\nstarting from certain initialization. Let {xi\ni=1 be the particles at the (cid:96)-th iteration. In this case, the\n(cid:96)}n\nexact distributions of {xi\ni=1 are unknown or dif\ufb01cult to keep track of, but can be best approximated\ni \u03b4(x \u2212 xi\nby their empirical measure \u02c6\u00b5n\n(cid:96))dx/n. Therefore, it is natural to think that\n\u2217\n(cid:96) , provides the best update direction for moving the particles (and\n\u03c6\n\u02c6\u00b5n\nequivalently \u02c6\u00b5n\n(cid:96) ) \u201ccloser to\u201d \u03bdp. Implementing this update (8) iteratively, we get the main SVGD\nalgorithm in Algorithm 1.\nIntuitively, the update in (8) pushes the particles towards the high probability regions of the target\nprobability via the gradient term \u2207 log p, while maintaining a degree of diversity via the second\nterm \u2207k(x, xi). In addition, (8) reduces to the typical gradient descent for maximizing log p if we\nuse only a single particle (n = 1) and the kernel strati\ufb01es \u2207k(x, x(cid:48)) = 0 for x = x(cid:48); this allows\nSVGD to provide a spectrum of approximation that smooths between maximum a posterior (MAP)\noptimization to a full sampling approximation by using different particle sizes, enabling ef\ufb01cient\ntrade-off between accuracy and computation cost.\nDespite the similarity to gradient descent, we should point out that the SVGD update in (8) does\nnot correspond to minimizing any objective function F ({xi\n(cid:96)}) in terms of the particle location {xi\n(cid:96)},\nbecause one would \ufb01nd \u2202xi \u2202xj F (cid:54)= \u2202xj \u2202xiF if this is true. Instead, it is best to view SVGD as a type\nof (particle-based) numerical approximation of an evolutionary partial differential equation (PDE)\nof densities or measures, which corresponds to a special type of gradient \ufb02ow of the KL divergence\nfunctional whose equilibrium state equals the given target distribution \u03bdp, as we discuss in the sequel.\n\n3 Density Evolution of SVGD Dynamics\n\nThis section collects our main results. We characterize the evolutionary process of the empirical\n(cid:96) of the SVGD particles and their large sample limit as n \u2192 \u221e (Section 3.1) and large\nmeasures \u02c6\u00b5n\ntime limit as (cid:96) \u2192 \u221e (Section 3.2), which together establish the weak convergence of \u02c6\u00b5n\n(cid:96) to the target\nmeasure \u03bdp. Further, we show that the large sample limit of the SVGD dynamics is characterized\nby a Vlasov process, which monotonically decreases the KL divergence to target distributions with\na decreasing rate that equals the square of Stein discrepancy (Section 3.2-3.3). We also establish a\ngeometric intuition that interpret SVGD as a gradient \ufb02ow of KL divergence under a new Riemannian\nmetric structure induced by Stein operator (Section 3.4). Section 3.5 provides a brief discussion on\nthe connection to Langevin dynamics.\n\n3\n\n\f3.1 Large Sample Asymptotic of SVGD\n\n\u2217\n\u2217\nConsider the optimal transform T \u00b5,p(x) = x + \u0001\u03c6\n\u00b5,p de\ufb01ned in (5). We de\ufb01ne its\n\u00b5,p(x) with \u03c6\nrelated map \u03a6p : \u00b5 (cid:55)\u2192 T \u00b5,p\u00b5, where T \u00b5,p\u00b5 denotes the pushforward measure of \u00b5 through transform\nT \u00b5,p. This map fully characterizes the SVGD dynamics in the sense that the empirical measure \u02c6\u00b5n\n(cid:96)\ncan be obtained by recursively applying \u03a6p starting from the initial measure \u02c6\u00b5n\n0 .\n\n\u02c6\u00b5n\n(cid:96)+1 = \u03a6p(\u02c6\u00b5n\n\n(cid:96) ), \u2200(cid:96) \u2208 N.\n\n(9)\n\nNote that \u03a6p is a nonlinear map because the transform T \u00b5,p depends on the input map \u00b5. If \u00b5 has a\ndensity q and \u0001 is small enough so that T \u00b5,p is invertible, the density q(cid:48) of \u00b5(cid:48) = \u03a6p(\u00b5) is given by\nthe change of variables formula:\n\nq(cid:48)(z) = q(T \u22121\n\n\u00b5,p(z)) \u00b7 | det(\u2207T \u22121\n\n\u00b5,p(z))|.\n\n(10)\n\n0 at the 0-th iteration weakly converges to a measure \u00b5\u221e\n\nWhen \u00b5 is an empirical measure and q is a Dirac delta function, this equation still holds formally in\nthe sense of distribution (generalized functions).\nCritically, \u03a6p also fully characterizes the large sample limit property of SVGD. Assume the initial\n0 as n \u2192 \u221e, which can\nempirical measure \u02c6\u00b5n\nbe achieved, for example, by drawing {xi\n0 , or using MCMC or Quasi Monte Carlo\nmethods. Starting from the limit initial measure \u00b5\u221e\n\u00b5\u221e\n(cid:96)+1 = \u03a6p(\u00b5\u221e\n\n0 and applying \u03a6p recursively, we get\n(cid:96) ), \u2200(cid:96) \u2208 N.\n0 by initialization, we may expect that \u02c6\u00b5n\n\nfor all the \ufb01nite iterations (cid:96) if\nAssuming \u02c6\u00b5n\n\u03a6p satis\ufb01es certain Lipschitz condition. This is naturally captured by the bounded Lipschitz metric.\nFor two measures \u00b5 and \u03bd, their bounded Lipschitz (BL) metric is de\ufb01ned to be their difference of\nmeans on the set of bounded, Lipschitz test functions:\n\n0} i.i.d. from \u00b5\u221e\n\n(cid:96) \u21d2 \u00b5\u221e\n\n0 \u21d2 \u00b5\u221e\n\n(11)\n\n(cid:96)\n\n(cid:8)E\u00b5f \u2212 E\u03bdf s.t. ||f||BL \u2264 1(cid:9), where\n\n||f||BL = max{||f||\u221e, ||f||Lip},\n\nBL(\u00b5, \u03bd) = sup\n\nf\n\n|f (x)\u2212f (y)|\nwhere ||f||\u221e = supx |f (x)| and ||f||Lip = supx(cid:54)=y\nLipschitz function f = [f1, . . . , fd](cid:62), we de\ufb01ne its norm by ||f||2\nthat the BL metric metricizes weak convergence, that is, BL(\u00b5n, \u03bd) \u2192 0 if and only if \u00b5n \u21d2 \u03bd.\nLemma 3.1. Assuming g(x, y) := S x\n||g||BL < \u221e, then for any two probability measures \u00b5 and \u00b5(cid:48), we have\n\np \u2297 k(x, y) is bounded Lipschitz jointly on (x, y) with norm\n\n. For a vector-valued bounded\nBL. It is known\n\ni=1 ||fi||2\n\n||x\u2212y||2\n\nBL =(cid:80)d\n\nBL(\u03a6p(\u00b5), \u03a6p(\u00b5(cid:48))) \u2264 (1 + 2\u0001||g||BL) BL(\u00b5, \u00b5(cid:48)).\n\nTheorem 3.2. Let \u02c6\u00b5n\n\n(cid:96)}n\n(cid:96) be the empirical measure of {xi\ni=1 at the (cid:96)-th iteration of SVGD. Assuming\n0 ) \u2192 0,\n0 , \u00b5\u221e\n(cid:96) de\ufb01ned in (11), at any \ufb01nite iteration (cid:96), we have\n(cid:96) ) \u2192 0.\n\nn\u2192\u221e BL(\u02c6\u00b5n\nlim\n\n(cid:96) , \u00b5\u221e\n\nn\u2192\u221e BL(\u02c6\u00b5n\nlim\n\nthen for \u00b5\u221e\n\nProof. It is a direct result of Lemma 3.1.\n\n(cid:96) \u21d2 \u02c6\u00b5\u221e\n\n(cid:96)\n\nfor \u2200(cid:96), if \u02c6\u00b5n\n\n0 \u21d2 \u02c6\u00b5\u221e\n\n0 by\nSince BL(\u00b5, \u03bd) metricizes weak convergence, our result suggests \u02c6\u00b5n\ninitialization. The bound of BL metric in Lemma 3.1 increases by a factor of (1 + 2\u0001||g||BL) at each\niteration. We can prevent the explosion of the BL bound by decaying step size suf\ufb01ciently fast. It may\nbe possible to obtain tighter bounds, however, it is fundamentally impossible to get a factor smaller\nthan one without further assumptions: suppose we can get BL(\u03a6p(\u00b5), \u03a6p(\u00b5(cid:48))) \u2264 \u03b1BL(\u00b5, \u00b5(cid:48)) for\nsome constant \u03b1 \u2208 [0, 1), then starting from any initial \u02c6\u00b5n\n0 , with any \ufb01xed particle size n (e.g., n = 1),\n(cid:96) , \u03bdp) = O(\u03b1(cid:96)) \u2192 0 as (cid:96) \u2192 0, which is impossible because we can not get\nwe would have BL(\u02c6\u00b5n\narbitrarily accurate approximate of \u03bdp with \ufb01nite n. It turns out that we need to look at KL divergence\nin order to establish convergence towards \u03bdp as (cid:96) \u2192 \u221e, as we discuss in Section 3.2-3.3.\n\n4\n\n\fRemark Because g(x, y) = \u2207x log p(x)k(x, y) +\u2207xk(x, y), and \u2207x log p(x) is often unbounded\nif the domain X is not unbounded. Therefore, the condition that g(x, y) must be bounded in\nLemma 3.1 suggests that it can only be used when X is compact. It is an open question to establish\nresults that can work for more general domain X.\n\n3.2 Large Time Asymptotic of SVGD\nTheorem 3.2 ensures that we only need to consider the update (11) starting from the limit initial \u00b5\u221e\n0 ,\nwhich we can assume to have nice density functions and have \ufb01nite KL divergence with the target \u03bdp.\nWe show that update (11) monotonically decreases the KL divergence between \u00b5\u221e\n(cid:96) and \u03bdp and hence\nallows us to establish the convergence \u00b5\u221e\nTheorem 3.3. 1. Assuming p is a density that satis\ufb01es Stein\u2019s identity (3) for \u2200\u03c6 \u2208 H, then the\nmeasure \u03bdp of p is a \ufb01xed point of map \u03a6p in (11).\n(cid:80)\n2||\u2207 log p||Lipk(x, x) + 2\u2207xx(cid:48)k(x, x)} < \u221e, where \u2207xx(cid:48)k(x, x) =\n2. Assume R = supx{ 1\n:=\n(2 supx \u03c1(\u2207\u03c6\n\u00b5(cid:96),p + \u2207\u03c6\n\u2217(cid:62)\n\u2217\n\u00b5(cid:96),p))\u22121, where \u03c1(A) denotes the spectrum norm of a matrix A.\nIf\nKL(\u00b5\u221e\n\nk(x, x(cid:48))(cid:12)(cid:12)x=x(cid:48), and the step size \u0001(cid:96) at the (cid:96)-th iteration is no larger than \u0001\u2217\n\n0 || \u03bdp) < \u221e by initialization, then\n\n(cid:96) \u21d2 \u03bdp.\n\ni \u2202xi \u2202x(cid:48)\n\n(cid:96)\n\ni\n\n(cid:96)+1 || \u03bdp) \u2212 KL(\u00b5\u221e\n\n(cid:96)\n\n|| \u03bdp)(cid:3) \u2264 \u2212(1 \u2212 \u0001(cid:96)R) D(\u00b5\u221e\n\n(cid:96)\n\n(cid:2)KL(\u00b5\u221e\n\n1\n\u0001(cid:96)\n\n|| \u03bdp)2,\n\n(12)\n\nthat is, the population SVGD dynamics always deceases the KL divergence when using suf\ufb01ciently\nsmall step sizes, with a decreasing rate upper bounded by the squared Stein discrepancy. Further, if\n|| \u03bdp) \u2192 0\nwe set the step size \u0001(cid:96) to be \u0001(cid:96) \u221d D(\u00b5\u221e\nas (cid:96) \u2192 \u221e.\n(cid:96) \u21d2 \u03bdp (see (7)), then Theorem 3.3(2) implies\nRemark Assuming D(\u00b5\u221e\n(cid:96) \u21d2 \u03bdp. Further, together with Theorem 3.2, we can establish the weak convergence of the\n\u00b5\u221e\nempirical measures of the SVGD particles: \u02c6\u00b5n\n\n|| \u03bdp)\u03b2 for any \u03b2 > 0, then (12) implies that D(\u00b5\u221e\n\n|| \u03bdp) \u2192 0 implies \u00b5\u221e\n\n(cid:96) \u21d2 \u03bdp, as (cid:96) \u2192 \u221e, n \u2192 \u221e.\n\n(cid:96)\n\n(cid:96)\n\n(cid:96)\n\n(cid:96) with \ufb01nite sample\n(cid:96) || \u03bdp) = \u221e in the beginning. It is necessary to use BL metric and\n\nRemark Theorem 3.3 can not be directly applied on the empirical measures \u02c6\u00b5n\nsize n, since it would give KL(\u02c6\u00b5n\nKL divergence to establish the convergence w.r.t. sample size n and iteration (cid:96), respectively.\nRemark The requirement that \u0001(cid:96) \u2264 \u0001\u2217\nx + \u0001\u03c6\nthe Appendix, we can derive an upper bound of the spectrum radius:\n\n(cid:96) is needed to guarantee that the transform T \u00b5(cid:96),p(x) =\n\u2217\n\u00b5(cid:96),p(x) has a non-singular Jacobean matrix everywhere. From the bound in Equation A.6 of\n\n(cid:112)\u2207xx(cid:48)k(x, x) D(\u00b5(cid:96) || \u03bdp).\n\n\u03c1(\u2207\u03c6\n\nsup\n\nx\n\n\u00b5(cid:96),p + \u2207\u03c6\n\u2217\n\n\u00b5(cid:96),p) \u2264 2 sup\n\u2217(cid:62)\n\nx\n\n||\u2207\u03c6\n\u00b5(cid:96),p||F \u2264 2 sup\n\u2217\n\nx\n\nThis suggest that the step size should be upper bounded by the inverse of Stein discrepancy, i.e.,\n\u00b5(cid:96),p||\u22121H , where D(\u00b5(cid:96) || \u03bdp) can be estimated using (6) (see [7]).\n(cid:96) \u221d D(\u00b5(cid:96) || \u03bdp)\u22121 = ||\u03c6\n\u2217\n\u0001\u2217\n3.3 Continuous Time Limit and Vlasov Process\nMany properties can be understood more easily as we take the continuous time limit (\u0001 \u2192 0), reducing\nour system to a partial differential equation (PDE) of the particle densities (or measures), under which\nwe show that the negative gradient of KL divergence exactly equals the square Stein discrepancy (the\nlimit of (12) as \u0001 \u2192 0).\nTo be speci\ufb01c, we de\ufb01ne a continuous time t = \u0001(cid:96), and take in\ufb01nitesimal step size \u0001 \u2192 0, the\nevolution of the density q in (10) then formally reduces to the following nonlinear Fokker-Planck\nequation (see Appendix A.3 for the derivation):\n\nqt(x) = \u2212\u2207 \u00b7 (\u03c6\n\u2217\nqt,p(x)qt(x)).\n\n\u2202\n\u2202t\n\n(13)\n\nThis PDE is a type of deterministic Fokker-Planck equation that characterizes the movement of\n\u2217\nparticles under deterministic forces, but it is nonlinear in that the velocity \ufb01eld \u03c6\nqt,p(x) depends on\nqt,p(x) = Ex(cid:48)\u223cqt[S x(cid:48)\n\u2217\nthe current particle density qt through the drift term \u03c6\n\np \u2297 k(x, x(cid:48))].\n\n5\n\n\fIt is not surprising to establish the following continuous version of Theorem 3.3(2), which is of\ncentral importance to our gradient \ufb02ow perspective in Section 3.4:\nTheorem 3.4. Assuming {\u00b5t} are the probability measures whose densities {qt} satisfy the PDE in\n(13), and KL(\u00b50 || \u03bdp) < \u221e, then\nd\ndt\n\nRemark This result suggests a path integration formula, KL(\u00b50 || \u03bdp) = (cid:82) \u221e\n\nKL(\u00b5t || \u03bdp) = \u2212D(\u00b5t || \u03bdp)2.\n\nD(\u00b5t || \u03bdp)2dt,\n\nwhich can be potentially useful for estimating KL divergence or the normalization constant.\nPDE (13) only works for differentiable densities qt. Similar to the case of \u03a6p as a map between\n(empirical) measures, one can extend (13) to a measure-value PDE that incorporates empirical\nmeasures as weak solutions. Take a differentiable test function h and integrate the both sides of (13):\n\n(14)\n\n0\n\n(cid:90) \u2202\n\n\u2202t\n\n(cid:90)\n\nh(x)qt(x)dx = \u2212\n\nh(x)\u2207 \u00b7 (\u03c6\n\u2217\nqt,p(x)qt(x))dx,\n\n\u2217\nUsing integration by parts on the right side to \u201cshift\u201d the derivative operator from \u03c6\nqt,pqt to h, we get\n\nE\u00b5t[h] = E\u00b5t[\u2207h(cid:62)\u03c6\n\u2217\n\u00b5t,p],\n\nd\ndt\n\n(15)\n\nwhich depends on \u00b5t only through the expectation operator and hence works for empirical measures\nas well,. A set of measures {\u00b5t} is called the weak solution of (13) if it satis\ufb01es (15).\nUsing results in Fokker-Planck equation, the measure process (13)-(15) can be translated to an\nordinary differential equation on random particles {xt} whose distribution is \u00b5t:\n\n\u2217\n\u00b5t,p(xt)dt,\n\ndxt = \u03c6\n\n(16)\ninitialized from random variable x0 with distribution \u00b50. Here the nonlinearity is re\ufb02ected in the fact\nthat the velocity \ufb01eld depends on the distribution \u00b5t of the particle at the current time.\nIn particular, if we initialize (15) using an empirical measure \u02c6\u00b5n\n(16) reduces to the following continuous time limit of n-particle SVGD dynamics:\n\n0 of a set of \ufb01nite particles {xi\n\n\u00b5t is the distribution of random variable xt,\n\n0}n\ni=1,\n\n\u02c6\u00b5n\n\ndxi\n\nwith\n\nt)dt,\n\nt)dx,\n\nt (dx) =\n\n\u03b4(x \u2212 xi\n\n\u2200i = 1, . . . , n,\n\n\u2217\nt ,p(xi\nt = \u03c6\n\u02c6\u00b5n\nt } can be shown to be a weak solution of (13)-(15), parallel to (9) in the discrete time case.\n\nwhere {\u02c6\u00b5n\n(16) can be viewed as the large sample limit (n \u2192 \u221e) of (17).\nThe process (13)-(17) is a type of Vlasov processes [12, 13], which are (deterministic) interacting\nparticle processes of the particles interacting with each other though the dependency on their \u201cmean\n\ufb01eld\u201d \u00b5t (or \u02c6\u00b5n\nt ), and have found important applications in physics, biology and many other areas.\nThere is a vast literature on theories and applications of interacting particles systems in general, and\nwe only refer to Spohn [14], Del Moral [15] and references therein as examples. Our particular\nform of Vlasov process, constructed based on Stein operator in order to approximate arbitrary given\ndistributions, seems to be new to the best of our knowledge.\n\n(17)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n3.4 Gradient Flow, Optimal Transport, Geometry\n\nWe develop a geometric view for the Vlasov process in Section 3.3, interpreting it as a gradient \ufb02ow\nfor minimizing the KL divergence functional, de\ufb01ned on a new type of optimal transport metric on\nthe space of density functions induced by Stein operator.\nWe focus on the set of \u201cnice\u201d densities q paired with a well de\ufb01ned Stein operator Sq, acting on a\nHilbert space H. To develop the intuition, consider a density q and its nearby density q(cid:48) obtained\nby applying transform T (x) = x + \u03c6(x)dt on x \u223c q with in\ufb01nitesimal dt and \u03c6 \u2208 H, then we can\nshow that (See Appendix A.3)\n\nlog q(cid:48)(x) = log q(x) \u2212 Sq\u03c6(x)dt,\n\nq(cid:48)(x) = q(x) \u2212 q(x)Sq\u03c6(x)dt,\n\n(18)\n\n6\n\n\f\u2207\u00b7(\u03c6q)\n\nBecause one can show that Sq\u03c6 =\nfrom (2), we de\ufb01ne operator qSq by qSq\u03c6(x) =\nq(x)Sq\u03c6(x) = \u2207 \u00b7 (\u03c6(x)q(x)). Eq (18) suggests that the Stein operator Sq (resp. qSq) serves\nto translate a \u03c6-perturbation on the random variable x to the corresponding change on the log-density\n(resp. density). This fact plays a central role in our development.\nDenote by Hq (resp. qHq) the space of functions of form Sq\u03c6 (resp. qSq\u03c6) with \u03c6 \u2208 H, that is,\n\nq\n\nHq = {Sq\u03c6 : \u03c6 \u2208 H},\n\nqHq = {qSq\u03c6 : \u03c6 \u2208 H}.\n\nEquivalently, qHq is the space of functions of form qf where f \u2208 Hq. This allows us to consider the\ninverse of Stein operator for functions in Hq. For each f \u2208 Hq, we can identify an unique function\n\u03c8f \u2208 H that has minimum || \u00b7 ||H norm in the set of \u03c8 that satisfy Sq\u03c8 = f, that is,\n\n(cid:8)||\u03c8||H s.t. Sq\u03c8 = f(cid:9),\n\n\u03c8q,f = arg min\n\n\u03c8\u2208H\n\nwhere Sq\u03c8 = f is known as the Stein equation. This allows us to de\ufb01ne inner products on Hq and\nqHq using the inner product on H:\n\n(cid:104)f1 f2(cid:105)Hq := (cid:104)qf1, qf2(cid:105)qHq := (cid:104)\u03c8q,f1, \u03c8q,f2(cid:105)H.\n\n(19)\nBased on standard results in RKHS [e.g., 16], one can show that if H is a RKHS with kernel\nk(x, x(cid:48)), then Hq and qHq are both RKHS; the reproducing kernel of Hq is \u03bap(x, x(cid:48)) in (6), and\ncorrespondingly, the kernel of qHq is q(x)\u03bap(x, x(cid:48))q(x(cid:48)).\nNow consider q and a nearby q(cid:48) = q+qf dt, \u2200f \u2208 Hq, obtained by an in\ufb01nitesimal perturbation on the\ndensity function using functions in space Hq. Then the \u03c8q,f can be viewed as the \u201coptimal\u201d transform,\nin the sense of having minimum || \u00b7 ||H norm, that transports q to q(cid:48) via T (x) = x + \u03c8q,f (x)dt. It is\ntherefore natural to de\ufb01ne a notion of distance between q and q(cid:48) = q + qf dt via,\n\nWH(q, q(cid:48)) := ||\u03c8q,f||Hdt.\n\nFrom (18) and (19), this is equivalent to\n\nWH(q, q(cid:48)) = ||q \u2212 q(cid:48)||qHq dt = || log q(cid:48) \u2212 log q||Hq dt.\n\nUnder this de\ufb01nition, we can see that the in\ufb01nitesimal neighborhood {q(cid:48) : WH(q, q(cid:48)) \u2264 dt} of q,\nconsists of densities (resp. log-densities) of form\n\nq(cid:48) = q + gdt,\nlog q(cid:48) = log q + f dt,\n\n\u2200g \u2208 qHq,\n\u2200f \u2208 Hq,\n\n||g||qHq \u2264 1,\n||f||Hq \u2264 1.\n\nGeometrically, this means that qHq (resp. Hq) can be viewed as the tangent space around density\nq (resp. log-density log q). Therefore, the related inner product (cid:104)\u00b7, \u00b7(cid:105)qHq (resp. (cid:104)\u00b7, \u00b7(cid:105)Hq) forms a\nRiemannian metric structure that corresponds to WH(q, q(cid:48)).\nThis also induces a geodesic distance that corresponds to a general, H-dependent form of optimal\ntransport metric between distributions. Consider two densities p and q that can be transformed from\none to the other with functions in H, in the sense that there exists a curve of velocity \ufb01elds {\u03c6t : \u03c6t \u2208\nH, t \u2208 [0, 1]} in H, that transforms random variable x0 \u223c q to x1 \u223c p via dxt = \u03c6t(x)dt. This is\nequivalent to say that there exists a curve of densities {\u03c1t : t \u2208 [0, 1]} such that\n\n\u2202t\u03c1t = \u2212\u2207 \u00b7 (\u03c6t\u03c1t),\n\nand \u03c10 = q, \u03c11 = p.\n\n(cid:8)(cid:90) 1\n\n0\n\ns.t.\n\n{\u03c6t, \u03c1t}\n\n||\u03c6t||Hdt,\n\nWH(q, p) = inf\n\nIt is therefore natural to de\ufb01ne a geodesic distance between q and p via\n\n\u2202t\u03c1t = \u2212\u2207 \u00b7 (\u03c6t\u03c1t), \u03c10 = p, \u03c11 = q(cid:9).\n= E\u03c1t[f 2], replacing the cost with(cid:82) ||\u03c6t||L2\n\n(20)\nWe call WH(p, q) an H-Wasserstein (or optimal transport) distance between p and q, in connection\nwith the typical 2-Wasserstein distance, which can be viewed as a special case of (20) by taking H\nto be the L2\ndt;\nthe 2-Wasserstein distance is widely known to relate to Langevin dynamics as we discuss more in\nSection 3.5 [e.g., 17, 18].\nNow for a given functional F (q), this metric structure induced a notion of functional covariant\ngradient: the covariant gradient gradHF (q) of F (q) is de\ufb01ned to be a functional that maps q to an\nelement in the tangent space qHq of q, and satis\ufb01es\n\n\u03c1t space equipped with norm ||f||L2\n\n\u03c1t\n\n\u03c1t\n\nF (q + f dt) = F (q) + (cid:104)gradHF (q), f dt(cid:105)qHq ,\n\n(21)\n\nfor any f in the tangent space qHq.\n\n7\n\n\fTheorem 3.5. Following (21), the gradient of the KL divergence functional F (q) := KL(q || p) is\n\ngradHKL(q || p) = \u2207 \u00b7 (\u03c6\n\u2217\nq,pq).\n\nTherefore, the SVGD-Valsov equation (13) is a gradient \ufb02ow of KL divergence under metric WH(\u00b7,\u00b7):\n\n\u2202qt\n\u2202t\n\n= \u2212gradHKL(qt || p).\n\nIn addition, ||gradHKL(q || p)||qHq = D(q || p).\n(cid:26)\nRemark We can also de\ufb01nite the functional gradient via\n\ngradHF (q) \u221d arg max\nf : ||f||qHq\u22641\n\nF (q + \u0001f ) \u2212 F (q)\nWH(q + \u0001f, q)\n\nlim\n\u0001\u21920+\n\n(cid:27)\n\n,\n\nwhich speci\ufb01es the steepest ascent direction of F (q) (with unit norm). The result in Theorem (3.5) is\nconsistent with this de\ufb01nition.\n\n3.5 Comparison with Langevin Dynamics\n\nThe theory of SVGD is parallel to that of Langevin dynamics in many perspectives, but with\nimportance differences. We give a brief discussion on their similarities and differences.\nLangevin dynamics works by iterative updates of form\n\u221a\nx(cid:96)+1 \u2190 x(cid:96) + \u0001\u2207 log p(x(cid:96)) + 2\n\nwhere a single particle {x(cid:96)} moves along the gradient direction, perturbed with a random Gaussian\nnoise that plays the role of enforcing the diversity to match the variation in p (which is accounted by\nthe deterministic repulsive force in SVGD). Taking the continuous time limit (\u0001 \u2192 0), We obtain a\nIto stochastic differential equation, dxt = \u2212\u2207 log p(xt)dt + 2dWt,where Wt is a standard Brownian\nmotion, and x0 is a random variable with initial distribution q0. Standard results show that the density\nqt of random variable xt is governed by a linear Fokker-Planck equation, following which the KL\ndivergence to p decreases with a rate that equals Fisher divergence:\n\n\u03be(cid:96) \u223c N (0, 1),\n\n\u0001\u03be(cid:96),\n\n= \u2212\u2207 \u00b7 (qt\u2207 log p) + \u2206qt,\n\n\u2202qt\n\u2202t\n\nKL(qt || p) = \u2212F(qt, p),\n\n(22)\n\nd\ndt\n\nL2\nq\n\nwhere F(q, p) = ||\u2207 log(q/p)||2\n. This result is parallel to Theorem 3.4, and the role of square Stein\ndiscrepancy (and RKHS H) is replaced by Fisher divergence (and L2\nq space). Further, parallel to\nTheorem 3.5, it is well known that (22) can be also treated as a gradient \ufb02ow of the KL functional\nKL(q || p), but under the 2-Wasserstein metric W2(q, p) [17]. The main advantage of using RKHS\nover L2\nq is that it allows tractable computation of the optimal transport direction; this is not case when\nusing L2\nq and as a result Langevin dynamics requires a random diffusion term in order to form a\nproper approximation.\nPractically, SVGD has the advantage of being deterministic, and reduces to exact MAP optimization\nwhen using only a single particle, while Langevin dynamics has the advantage of being a standard\nMCMC method, inheriting its statistical properties, and does not require an O(n2) cost to calculate\nthe n-body interactions as SVGD. However, the connections between SVGD and Langevin dynamics\nmay allow us to develop theories and algorithms that unify the two, or combine their advantages.\n\n4 Conclusion and Open Questions\n\nWe developed a theoretical framework for analyzing the asymptotic properties of Stein variational\ngradient descent. Many components of the analysis provide new insights in both theoretical and\npractical aspects. For example, our new metric structure can be useful for solving other learning\nproblems by leveraging its computational tractability. Many important problems remains to be open.\nFor example, an important open problem is to establish explicit convergence rate of SVGD, for\nwhich the existing theoretical literature on Langevin dynamics and interacting particles systems may\nprovide insights. Another problem is to develop \ufb01nite sample bounds for SVGD that can take the\nfact that it reduces to MAP optimization when n = 1 into account. It is also an important direction\nto understand the bias and variance of SVGD particles, or combine it with traditional Monte Carlo\nwhose bias-variance analysis is clearer (see e.g., [19]).\n\n8\n\n\fAcknowledgement This work is supported in part by NSF CRII 1565796. We thank Lester Mackey\nand the anonymous reviewers for their comments.\n\nReferences\n[1] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference\n\nalgorithm. In Advances in Neural Information Processing Systems, 2016.\n\n[2] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[3] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Conference on\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2010.\n\n[4] J. Dick, F. Y. Kuo, and I. H. Sloan. High-dimensional integration: the quasi-monte carlo way.\n\nActa Numerica, 22:133\u2013288, 2013.\n\n[5] B. Dai, N. He, H. Dai, and L. Song. Provable Bayesian inference via particle mirror descent. In\n\nThe 19th International Conference on Arti\ufb01cial Intelligence and Statistics, 2016.\n\n[6] C. Stein. Approximate computation of expectations. Lecture Notes-Monograph Series, 7:i\u2013164,\n\n1986.\n\n[7] Q. Liu, J. D. Lee, and M. I. Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t tests and\n\nmodel evaluation. In International Conference on Machine Learning (ICML), 2016.\n\n[8] K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness-of-\ufb01t. In Interna-\n\ntional Conference on Machine Learning (ICML), 2016.\n\n[9] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration.\n\nJournal of the Royal Statistical Society, Series B, 2017.\n\n[10] J. Gorham and L. Mackey. Measuring sample quality with kernels. In International Conference\n\non Machine Learning (ICML), 2017.\n\n[11] C. J. Oates, J. Cockayne, F.-X. Briol, and M. Girolami. Convergence rates for a class of\n\nestimators based on Stein\u2019s identity. arXiv preprint arXiv:1603.03220, 2016.\n\n[12] W. Braun and K. Hepp. The Vlasov dynamics and its \ufb02uctuations in the 1/n limit of interacting\n\nclassical particles. Communications in mathematical physics, 56(2):101\u2013113, 1977.\n\n[13] A. A. Vlasov. On vibration properties of electron gas. J. Exp. Theor. Phys, 8(3):291, 1938.\n\n[14] H. Spohn. Large scale dynamics of interacting particles. Springer Science & Business Media,\n\n2012.\n\n[15] P. Del Moral. Mean \ufb01eld simulation for Monte Carlo integration. CRC Press, 2013.\n\n[16] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and\n\nstatistics. Springer Science & Business Media, 2011.\n\n[17] F. Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001.\n\n[18] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\n[19] J. Han and Q. Liu. Stein variational adaptive importance sampling. In Uncertainty in Arti\ufb01cial\n\nIntelligence, 2017.\n\n9\n\n\f", "award": [], "sourceid": 1762, "authors": [{"given_name": "Qiang", "family_name": "Liu", "institution": "Dartmouth College"}]}