{"title": "Random Feature Stein Discrepancies", "book": "Advances in Neural Information Processing Systems", "page_first": 1899, "page_last": 1909, "abstract": "Computable Stein discrepancies have been deployed for a variety of applications, ranging from sampler selection in posterior inference to approximate Bayesian inference to goodness-of-fit testing. Existing convergence-determining Stein discrepancies admit strong theoretical guarantees but suffer from a computational cost that grows quadratically in the sample size. While linear-time Stein discrepancies have been proposed for goodness-of-fit testing, they exhibit avoidable degradations in testing power\u2014even when power is explicitly optimized. To address these shortcomings, we introduce feature Stein discrepancies (\u03a6SDs), a new family of quality measures that can be cheaply approximated using importance sampling. We show how to construct \u03a6SDs that provably determine the convergence of a sample to its target and develop high-accuracy approximations\u2014random \u03a6SDs (R\u03a6SDs)\u2014which are computable in near-linear time. In our experiments with sampler selection for approximate posterior inference and goodness-of-fit testing, R\u03a6SDs perform as well or better than quadratic-time KSDs while being orders of magnitude faster to compute.", "full_text": "Random Feature Stein Discrepancies\n\nJonathan H. Huggins\u21e4\n\nDepartment of Biostatistics, Harvard\n\njhuggins@mit.edu\n\nLester Mackey\u21e4\n\nMicrosoft Research New England\n\nlmackey@microsoft.com\n\nAbstract\n\nComputable Stein discrepancies have been deployed for a variety of applications,\nranging from sampler selection in posterior inference to approximate Bayesian\ninference to goodness-of-\ufb01t testing. Existing convergence-determining Stein dis-\ncrepancies admit strong theoretical guarantees but suffer from a computational cost\nthat grows quadratically in the sample size. While linear-time Stein discrepancies\nhave been proposed for goodness-of-\ufb01t testing, they exhibit avoidable degradations\nin testing power\u2014even when power is explicitly optimized. To address these\nshortcomings, we introduce feature Stein discrepancies (SDs), a new family of\nquality measures that can be cheaply approximated using importance sampling.\nWe show how to construct SDs that provably determine the convergence of a\nsample to its target and develop high-accuracy approximations\u2014random SDs\n(RSDs)\u2014which are computable in near-linear time. In our experiments with\nsampler selection for approximate posterior inference and goodness-of-\ufb01t testing,\nRSDs perform as well or better than quadratic-time KSDs while being orders of\nmagnitude faster to compute.\n\n1\n\nIntroduction\n\nMotivated by the intractable integration problems arising from Bayesian inference and maximum\nlikelihood estimation [9], Gorham and Mackey [10] introduced the notion of a Stein discrepancy as a\nquality measure that can potentially be computed even when direct integration under the distribution of\ninterest is unavailable. Two classes of computable Stein discrepancies\u2014the graph Stein discrepancy\n[10, 12] and the kernel Stein discrepancy (KSD) [7, 11, 19, 21]\u2014have since been developed to\nassess and tune Markov chain Monte Carlo samplers, test goodness-of-\ufb01t, train generative adversarial\nnetworks and variational autoencoders, and more [7, 10\u201312, 16\u201319, 27]. However, in practice, the\ncost of these Stein discrepancies grows quadratically in the size of the sample being evaluated,\nlimiting scalability. Jitkrittum et al. [16] introduced a special form of KSD termed the \ufb01nite-set Stein\ndiscrepancy (FSSD) to test goodness-of-\ufb01t in linear time. However, even after an optimization-based\npreprocessing step to improve power, the proposed FSSD experiences a unnecessary degradation of\npower relative to quadratic-time tests in higher dimensions.\nTo address the distinct shortcomings of existing linear- and quadratic-time Stein discrepancies, we\nintroduce a new class of Stein discrepancies we call feature Stein discrepancies (SDs). We show\nhow to construct SDs that provably determine the convergence of a sample to its target, thus making\nthem attractive for goodness-of-\ufb01t testing, measuring sample quality, and other applications. We\nthen introduce a fast importance sampling-based approximation we call random SDs (RSDs).\nWe provide conditions under which, with an appropriate choice of proposal distribution, an RSD\nis close in relative error to the SD with high probability. Using an RSD, we show how, for any\n> 0, we can compute OP (N1/2)-precision estimates of an SD in O(N 1+) (near-linear) time\nwhen the SD precision is \u2326(N1/2). Additionally, to enable applications to goodness-of-\ufb01t testing,\n\n\u21e4Contributed equally\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fwe (1) show how to construct RSDs that can distinguish between arbitrary distributions and (2)\ndescribe the asymptotic null distribution when sample points are generated i.i.d. from an unknown\ndistribution. In our experiments with biased Markov chain Monte Carlo (MCMC) hyperparameter\nselection and fast goodness-of-\ufb01t testing, we obtain high-quality results\u2014which are comparable\nto or better than those produced by quadratic-time KSDs\u2014using only ten features and requiring\norders-of-magnitude less computation.\n\n(\u00b51k)(x0)\n\n:= R f (x)\u00b51(dx),\n\n:= R k(x, x0)\u00b51(dx), and (\u00b51 \u21e5 \u00b52)(k)\n\nNotation For measures \u00b51, \u00b52 on RD and functions f : RD ! C, k : RD \u21e5 RD ! C, we\nlet \u00b51(f )\n:=\nRR k(x1, x2)\u00b51(dx1)\u00b52(dx2). We denote the generalized Fourier transform of f by \u02c6f or F (f ) and\nits inverse by F 1(f ). For r  1, let Lr := {f : kfkLr := (R |f (x)|r dx)1/r < 1} and Cn denote\nthe space of n-times continuously differentiable functions. We let D=) and P! denote convergence\nin distribution and in probability, respectively. We let a denote the complex conjugate of a. For\nD 2 N, de\ufb01ne [D] := {1, . . . , D}. The symbol & indicates greater than up to a universal constant.\n2 Feature Stein discrepancies\n\nNPN\n\nWhen exact integration under a target distribution P is infeasible, one often appeals to a discrete\nmeasure QN = 1\nn=1 xn to approximate expectations, where the sample points x1, . . . , xN 2\nRD are generated from a Markov chain or quadrature rule. The aim in sample quality measurement\nis to quantify how well QN approximates the target in a manner that (a) recognizes when a sample\nsequence is converging to the target, (b) highlights when a sample sequence is not converging to\nthe target, and (c) is computationally ef\ufb01cient. It is natural to frame this comparison in terms of\nan integral probability metric (IPM) [20], dH(QN , P ) := suph2H |QN (h)  P (h)|, measuring the\nmaximum discrepancy between target and sample expectations over a class of test functions. However,\nwhen generic integration under P is intractable, standard IPMs like the 1-Wasserstein distance and\nDudley metric may not be ef\ufb01ciently computable.\nTo address this need, Gorham and Mackey [10] introduced the Stein discrepancy framework for\ngenerating IPM-type quality measures with no explicit integration under P . For any approximating\nprobability measure \u00b5, each Stein discrepancy takes the form\n\ndT G(\u00b5, P ) = sup\n\ng2G |\u00b5(T g)| where 8g 2G , P (T g) = 0.\n\nHere, T is an operator that generates mean-zero functions under P , and G is the Stein set of functions\non which T operates. For concreteness, we will assume that P has C1 density p with support Rd and\nrestrict our attention to the popular Langevin Stein operator [10, 21] de\ufb01ned by T g :=PD\nd=1 Tdgd\nfor (Tdgd)(x) := p(x)1@xd(p(x)gd(x)) and g : RD ! RD. To date, two classes of computable\nStein discrepancies with strong convergence-determining guarantees have been identi\ufb01ed. The graph\nStein discrepancies [10, 12] impose smoothness constraints on the functions g and are computed by\nsolving a linear program, while the kernel Stein discrepancies [7, 11, 19, 21] de\ufb01ne G as the unit ball\nof a reproducing kernel Hilbert space and are computed in closed-form. Both classes, however, suffer\nfrom a computational cost that grows quadratically in the number of sample points. Our aim is to\ndevelop alternative discrepancy measures that retain the theoretical and practical bene\ufb01ts of existing\nStein discrepancies at a greatly reduced computational cost.\nOur strategy is to identify a family of convergence-determining discrepancy measures that can be\naccurately and inexpensively approximated with random sampling. To this end, we de\ufb01ne a new\ndomain for the Stein operator centered around a feature function  : RD \u21e5 RD ! C which, for some\nr 2 [1,1) and all x, z 2 RD, satis\ufb01es (x,\u00b7) 2 Lr and (\u00b7, z) 2 C1:\nG,r :=ng : RD ! R| gd(x) =R (x, z)fd(z) dz with PD\nr1o.\nLs \uf8ff 1 for s = r\nWhen combined with the Langevin Stein operator T , this feature Stein set gives rise to a feature Stein\ndiscrepancy (SD) with an appealing explicit form (PD\n,r(\u00b5, P ) := supg2G,r |\u00b5(T g)|2 = supg2G,rPD\n\nd=1 \u00b5(Tdgd)\n\nd=1k\u00b5(Td)k2\n\nSD2\n\nd=1kfdk2\n\nLr )1/2:\n\n2\n\n2\n\n\f= supf :vd=kfdkLs ,kvk2\uf8ff1PD\n= supv:kvk2\uf8ff1PD\n\nd=1k\u00b5(Td)kLr vd\n\nd=1R \u00b5(Td)(z)fd(z) dz\n\n(1)\nIn Section 3.1, we will show how to select the feature function  and order r so that SD,r provably\ndetermines convergence, in line with our desiderata (a) and (b).\nTo achieve ef\ufb01cient computation, we will approximate the SD in expression (1) using an importance\nsample of size M drawn from a proposal distribution with (Lebesgue) density \u232b. We call the resulting\nstochastic discrepancy measure a random SD (RSD):\n\nd=1k\u00b5(Td)k2\nLr .\n\n=PD\n\n2\n\n2\n\nRSD2\n\n,r,\u232b,M (\u00b5, P ) :=PD\n\ni.i.d.\u21e0 \u232b.\nImportantly, when \u00b5 is the sample approximation QN, the RSD can be computed in O(M N )\ntime by evaluating the M N D rescaled random features, (Td)(xn, Zm)/\u232b(Zm)1/r; this computa-\ntion is also straightforwardly parallelized. In Section 3.2, we will show how to choose \u232b so that\nRSD,r,\u232b,M approximates SD,r with small relative error.\n\nm=1 \u232b(Zm)1|\u00b5(Td)(Zm)|r\u23182/r\n\nMPM\n\nd=1\u21e3 1\n\nfor Z1, . . . , ZM\n\nd=1k\u00b5(Td)k2\n\nSpecial cases When r = 2, the SD is an instance of a kernel Stein discrepancy (KSD) with base\n\nd=1(\u00b5 \u21e5 \u00b5)((Td \u2326T d)k) =PD\n\nreproducing kernel k(x, y) =R (x, z)(y, z) dz. This follows from the de\ufb01nition [7, 11, 19, 21]\nKSDk(\u00b5, P )2 :=PD\nL2 =SD ,2(\u00b5, P )2. However,\nwe will see in Sections 3 and 5 that there are signi\ufb01cant theoretical and practical bene\ufb01ts to using\nSDs with r 6= 2. Namely, we will be able to approximate SD,r with r 6= 2 more effectively\nwith a smaller sampling budget. If (x, z) = eihz,xi \u02c6 (z)1/2 and \u232b / \u02c6 for 2 L2, then\nRSD,2,\u232b,M is the random Fourier feature (RFF) approximation [22] to KSDk with k(x, y) =\n (x  y). Chwialkowski et al. [6, Prop. 1] showed that the RFF approximation can be a undesirable\nchoice of discrepancy measure, as there exist uncountably many pairs of distinct distributions that,\nwith high probability, cannot be distinguished by the RFF approximation. Following Chwialkowski\net al. [6] and Jitkrittum et al. [16], we show how to select  and \u232b to avoid this property in Section 4.\nThe random \ufb01nite set Stein discrepancy [FSSD-rand, 16] with proposal \u232b is an RSD,2,\u232b,M with\n(x, z) = f (x, z)\u232b(z)1/2 for f a real analytic and C0-universal [4, Def. 4.1] reproducing kernel. In\nSection 3.1, we will see that features  of a different form give rise to strong convergence-determining\nproperties.\n\n3 Selecting a Random Feature Stein Discrepancy\n\nIn this section, we provide guidance for selecting the components of an RSD to achieve our\ntheoretical and computational goals. We \ufb01rst discuss the choice of the feature function  and order r\nand then turn our attention to the proposal distribution \u232b. Finally, we detail two practical choices of\nRSD that will be used in our experiments. To ease notation, we will present theoretical guarantees\nin terms of the sample measure QN, but all results continue to hold if any approximating probability\nmeasure \u00b5 is substituted for QN.\n\n3.1 Selecting a feature function \nA principal concern in selecting a feature function is ensuring that the SD detects non-convergence\u2014\nthat is, QN D=) P whenever SD,r(QN , P ) ! 0. To ensure this, we will construct SDs that\nupper bound a reference KSD known to detect non-convergence. This is enabled by the following\ninequality proved in Appendix A.\nIf k(x, y) =R F ((x,\u00b7))(!)F ((y,\u00b7))(!)\u21e2(!) d!,\nProposition 3.1 (KSD-SD inequality).\nr 2 [1, 2], and \u21e2 2 Lt for t = r/(2  r), then\n\n(2)\n\nKSD2\n\nk(QN , P ) \uf8ff k\u21e2kLt SD2\n\n,r(QN , P ).\n\nOur strategy is to \ufb01rst pick a KSD that detects non-convergence and then choose  and r such that\n(2) applies. Unfortunately, KSDs based on many common base kernels, like the Gaussian and Mat\u00b4ern,\nfail to detect non-convergence when D > 2 [11, Thm. 6]. A notable exception is the KSD with\ninverse multiquadric (IMQ) base kernel.\n\n3\n\n\fa\n\n(sech)\n\n(!) = sech\n\nc, (x  y) := (c2 + kx  yk2\n\n2 axd. Since \u02c6 sech\n\na\n\na\n\nExample 3.1 (IMQ kernel). The IMQ kernel is given by IMQ\n2), where\nc > 0 and < 0. Gorham and Mackey [11, Thm. 8] proved that when  2 (1, 0), KSDs with\nan IMQ base kernel determine weak convergence on RD whenever P 2P , the set of distantly\ndissipative distributions for which r log p is Lipschitz.2\nLet mN := EX\u21e0QN [X] denote the mean of QN. We would like to consider a broader class of base\nkernels, the form of which we summarize in the following assumption:\nAssumption A. The base kernel has the form k(x, y) = AN (x) (x  y)AN (y) for 2 C2,\nA 2 C1, and AN (x) := A(x  mN ), where A > 0 and r log A is bounded and Lipschitz.\nThe IMQ kernel falls within the class de\ufb01ned by Assumption A (let A = 1 and = IMQ\nc, ). On the\nother hand, our next result, proved in Appendix B, shows that tilted base kernels with A increasing\nsuf\ufb01ciently quickly also control convergence.\nTheorem 3.2 (Tilted KSDs detect non-convergence). Suppose that P 2P , Assumption A holds,\n1/A 2 L2, and H(u) := sup!2RD ek!k2\n2/(2u2)/ \u02c6 (!) is \ufb01nite for all u > 0. Then for any sequence\nof probability measures (\u00b5N )1N =1, if KSDk(\u00b5N , P ) ! 0 then \u00b5N D=) P .\nExample 3.2 (Tilted hyperbolic secant kernel). The hyperbolic secant\nis sech(u) := 2/(eu + eu).\n sech\ndetects non-convergence when = sech\n\n(x) :=QD\nA(x) =QD\nA1 2 L2).\nWith our appropriate reference KSDs in hand, we will now design upper bounding SDs. To\naccomplish this we will have  mimic the form of the base kernels in Assumption A:\nAssumption B. Assumption A holds and (x, z) = AN (x)F (x  z), where F 2 C1 is positive,\nand there exist a norm k\u00b7k and constants s, C > 0 such that\n|@xd log F (x)|\uf8ff C(1 + kxks),\n(1 + kxks)F (x) = 0, and F (x  z) \uf8ff CF (z)/F (x).\nIn addition, there exist a constant c 2 (0, 1] and continuous, non-increasing function f such that\nc f (kxk) \uf8ff F (x) \uf8ff f (kxk).\nAssumption B requires a minimal amount of regularity from F , essentially that F be suf\ufb01ciently\nsmooth and behave as if it is a function only of the norm of its argument. A conceptually straightfor-\nward choice would be to set F = F 1( \u02c6 1/2)\u2014that is, to be the square root kernel of . We would\n\nfunction\nFor x 2 RD and a > 0, de\ufb01ne the sech kernel\n1/a (!)/aD, KSDk from Theorem 3.2\nand A1 2 L2. Valid tilting functions include\n2)b for any b > D/4 (to ensure\n\nd=1 sechp \u21e1\nd=1 ecp1+x2\n\nd for any c > 0 and A(x) = (c2 + kxk2\n\nthen have that (x  y) =R F (x  z)F (y  z) dz, so in particular SD,2 = KSDk. Since the\nexact square-root kernel of a base kernel can be dif\ufb01cult to compute in practice, we require only that\nF be a suitable approximation to the square root kernel of :\nAssumption C. Assumption B holds, and there exists a smoothness parameter  2 (1/2, 1] such\nthat if  2 (1/2, ), then \u02c6F / \u02c6 /2 2 L2.\nRequiring that \u02c6F / \u02c6 /2 2 L2 is equivalent to requiring that F belongs to the reproducing kernel\nHilbert space K induced by the kernel F 1( \u02c6 ). The smoothness of the functions in K increases\nas  increases. Hence  quanti\ufb01es the smoothness of F relative to .\nFinally, we would like an assurance that the SD detects convergence\u2014that is, SD,r(QN , P ) ! 0\nwhenever QN converges to P in a suitable metric. The following result, proved in Appendix C,\nprovides such a guarantee for both the SD and the RSD.\nProposition 3.3. Suppose Assumption B holds with F 2 Lr, 1/A bounded, x 7! x/A(x) Lipschitz,\nand EP [A(Z)kZk2\n\n2] < 1. If the tilted Wasserstein distance\n\nlim\n\nkxk!1\n\nWAN (QN , P ) := suph2H |QN (AN h)  P (AN h)|\n2We say P satis\ufb01es distant dissipativity [8, 12] if \uf8ff0\n\ninf{2hr log p(x)  r log p(y), x  yi/kx  yk2\n\n2 : kx  yk2 = r}.\n\n(H := {h : krh(x)k2 \uf8ff 1,8x 2 RD})\n:= lim inf r!1 \uf8ff(r) > 0 for \uf8ff(r) =\n\n4\n\n\fconverges to zero, then SD,r(QN , P ) ! 0 and RSD,r,\u232bN ,MN (QN , P ) P! 0 for any choices\nof r 2 [1, 2], \u232bN , and MN  1.\nRemark 3.4. When A is constant, WAN is the familiar 1-Wasserstein distance.\n3.2 Selecting an importance sampling distribution \u232b\nOur next goal is to select an RSD proposal distribution \u232b for which the RSD is close to its\nreference SD even when the importance sample size M is small. Our strategy is to choose \u232b so that\nthe second moment of each RSD feature, wd(Z, QN ) := |(QNTd)(Z)|r/\u232b(Z), is bounded by a\npower of its mean:\nDe\ufb01nition 3.5 ((C, ) second moments). Fix a target distribution P . For Z \u21e0 \u232b, d 2 [D], and\nN  1, let YN,d := wd(Z, QN ). If for some C > 0 and  2 [0, 2] we have E[Y 2\nN,d] \uf8ff CE[YN,d]2\nfor all d 2 [D] and N  1, then we say (, r,\u232b ) yields (C, ) second moments for P and QN.\nThe next proposition, proved in Appendix D, demonstrates the value of this second moment property.\nProposition 3.6. Suppose (, r,\u232b ) yields (C, ) second moments for P and QN .\nIf M \n2CE[YN,d] log(D/)/\u270f2 for all d 2 [D], then, with probability at least 1  ,\n\nRSD,r,\u232b,M (QN , P )  (1  \u270f)1/r SD,r(QN , P ).\n\nUnder the further assumptions of Proposition 3.1, if the reference KSDk(QN , P ) & N1/2,3 then a\nsample size M & N r/2Ck\u21e2kr/2\n\nlog(D/)/\u270f2 suf\ufb01ces to have, with probability at least 1  ,\n\nLt\n\nk\u21e2k1/2\n\nLt RSD,r,\u232b,M (QN , P )  (1  \u270f)1/r KSDk(QN , P ).\n\nNotably, a smaller r leads to substantial gains in the sample complexity M =\u2326( N r/ 2). For\nexample, if r = 1, it suf\ufb01ces to choose M =\u2326( N 1/2) whenever the weight function wd is bounded\n(so that  = 1); in contrast, existing analyses of random Fourier features [15, 22, 25, 26, 30] require\nM =\u2326( N ) to achieve the same error rates. We will ultimately show how to select \u232b so that  is\narbitrarily close to 0. First, we provide simple conditions and a choice for \u232b which guarantee (C, 1)\nsecond moments.\nProposition 3.7. Assume that P 2P , Assumptions A and B hold with s = 0, and there exists a\nconstant C0 > 0 such that for all N  1, QN ([1 + k\u00b7k]AN ) \uf8ffC 0. If \u232b(z) / QN ([1 + k\u00b7k](\u00b7, z)),\nthen for any r  1, (, r,\u232b ) yields (C, 1) second moments for P and QN .\nProposition 3.7, which is proved in Appendix E, is based on showing that the weight function\nwd(z, QN ) is uniformly bounded. In order to obtain (C, ) moments for < 1, we will choose\n\u232b such that wd(z, QN ) decays suf\ufb01ciently quickly as kzk ! 1. We achieve this by choosing\nan overdispersed \u232b\u2014that is, we choose \u232b with heavy tails compared to F . We also require two\nintegrability conditions involving the Fourier transforms of and F .\n\u02c6 1/2(!) 2 L1, and for t = r/(2  r), \u02c6 / \u02c6F 2 2 Lt.\nAssumption D. Assumptions A and B hold, !2\n1\nThe L1 condition is an easily satis\ufb01ed technical condition while the Lt condition ensures that the\nKSD-SD inequality (2) applies to our chosen SD.\nTheorem 3.8. Assume that P 2P , Assumptions A to D hold, and there exists C > 0 such that,\n\nQN ([1 + k\u00b7k + k\u00b7  mNks]AN /F (\u00b7 mN )) \uf8ffC\n\n(3)\nThen there is a constant b 2 [0, 1) such that the following holds. For any \u21e0 2 (0, 1  b), c > 0, and\n\u21b5> 2(1  ), if \u232b(z)  c (z  mN )\u21e0r, then there exists a constant C\u21b5 > 0 such that (, r,\u232b )\nyields (C\u21b5, \u21b5) second moments for P and QN , where \u21b5 := \u21b5 + (2  \u21b5)\u21e0/(2  b  \u21e0).\nTheorem 3.8 suggests a strategy for improving the importance sample growth rate  of an RSD:\nincrease the smoothness  of F and decrease the over-dispersion parameter \u21e0 of \u232b.\n\nfor all N  1.\n\n3Note that KSDk(QN , P ) =\u2326 P (N1/2) whenever the sample points x1, . . . , xN are drawn i.i.d. from a\nk(QN , P ) diverges when \u232b 6= P and converges in distribution\ndistribution \u00b5, since the scaled V-statistic N KSD2\nto a non-zero limit when \u232b = P [23, Thm. 32]. Moreover, working in a hypothesis testing framework of\nshrinking alternatives, Gretton et al. [13, Thm. 13] showed that KSDk(QN , P ) =\u21e5( N1/2) was the smallest\nlocal departure distinguishable by an asymptotic KSD test.\n\n5\n\n\f(a) Ef\ufb01ciency of L1 IMQ\n\n(c) M necessary for std(RSD)\nFigure 1: Ef\ufb01ciency of RSDs. The L1 IMQ RSD displays exceptional ef\ufb01ciency.\n\n(b) Ef\ufb01ciency of L2 SechExp\n\nSD < 1\n\n2\n\n3.3 Example RSDs\n\n2a . As shown in Appendix I, if we choose r = 2 and \u232b(z) / sech\n\nIn our experiments, we will consider two RSDs that determine convergence by Propositions 3.1\nand 3.3 and that yield (C, ) second moments for any  2 (0, 1] using Theorem 3.8.\nExample 3.3 (L2 tilted hyperbolic secant RSD). Mimicking the construction of the hyperbolic\nsecant kernel in Example 3.2 and following the intuition that F should behave like the square root of\n , we choose F = sech\n4a\u21e0 (z  mN )\nwe can verify all the assumptions necessary for Theorem 3.8 to hold. Moreover, the theorem holds\nfor any b > 0 and hence any \u21e0 2 (0, 1) may be chosen. Note that \u232b can be sampled from ef\ufb01ciently\nusing the inverse CDF method.\nExample 3.4 (Lr IMQ RSD). We can also parallel the construction of the reference IMQ kernel\nk(x, y) = IMQ\nc, (x  y) from Example 3.1, where c > 0 and  2 [D/2, 0). (Recall we have\nA = 1 in Assumption A.) In order to construct a corresponding RSD we must choose the constant\n 2 (1/2, 1) that will appear in Assumption C and \u21e0 2 (0, 1/2), the minimum \u21e0 we will be\nable to choose when constructing \u232b. We show in Appendix J that if we choose F = IMQ\nc0,0, then\nAssumptions A to D hold when c0 = c/2, 0 2 [D/(2\u21e0),/(2\u21e0)  D/(2\u21e0)), r = D/(20\u21e0),\n\u21e0 2 (\u21e0, 1), and \u232b(z) / IMQ\nc0,0(z  mN )\u21e0r. A particularly simple setting is given by 0 = D/(2\u21e0),\nwhich yields r = 1. Note that \u232b can be sampled from ef\ufb01ciently since it is a multivariate t-distribution.\n\nIn the future it would be interesting to construct other RSDs. We can recommend the following\nfairly simple default procedure for choosing an RSD based on a reference KSD admitting the form\nin Assumption A. (1) Choose any > 0, and set \u21b5 = /3, \u00af = 1  \u21b5/2, and \u21e0 = 4\u21b5/(2 + \u21b5).\nThese are the settings we will use in our experiments. It may be possible to initially skip this step\nand reason about general choices of , \u21e0, and \u00af. (2) Pick any F that satis\ufb01es \u02c6F / \u02c6 /2 2 L2 for some\n 2 (1/2, \u00af) (that is, Assumption C holds) while also satisfying \u02c6 / \u02c6F 2 2 Lt for some t 2 [1,1].\nThe selection of t induces a choice of r via Assumption D. A simple choice for F is F 1 \u02c6 . (3)\nCheck if Assumption B holds (it usually does if F decays no faster than a Gaussian); if it does not, a\nslightly different choice of F should be made. (4) Choose \u232b(z) / (z  mN )\u21e0r.\n4 Goodness-of-\ufb01t testing with RSDs\n\nWe now detail additional properties of RSDs relevant to testing goodness of \ufb01t. In goodness-of-\ufb01t\nn=1 underlying QN are assumed to be drawn i.i.d. from a distribution\ntesting, the sample points (Xn)N\n\u00b5, and we wish to use the test statistic Fr,N := RSD 2\n,r,\u232b,M (QN , P ) to determine whether the null\nhypothesis H0 : P = \u00b5 or alternative hypothesis H1 : P 6= \u00b5 holds. For this end, we will restrict\nour focus to real analytic  and strictly positive analytic \u232b, as by Chwialkowski et al. [6, Prop. 2 and\nLemmas 1-3], with probability 1, P = \u00b5 , RSD,r,\u232b,M (\u00b5, P ) = 0 when these properties hold.\nThus, analytic RSDs do not suffer from the shortcoming of RFFs\u2014which are unable to distinguish\nbetween in\ufb01nitely many distributions with high probability [6].\nIt remains to estimate the distribution of the test statistic Fr,N under the null hypothesis and to verify\nthat the power of a test based on this distribution approaches 1 as N ! 1. To state our result, we\nassume that M is \ufb01xed. Let \u21e0r,N,dm(x) := (Td)(x, ZN,m)/(M\u232b (ZN,m))1/r for r 2 [1, 2], where\n\n6\n\n\f(a) Step size selection using RSDs and quadratic-time KSD baseline. With M  10, each quality measure\nselects a step size of \" = .01 or .005.\n\n(b) SGLD sample points with equidensity contours of p overlaid. The samples produced by SGLD with \" = .01\nor .005 are noticeably better than those produced using smaller or large step sizes.\n\nFigure 2: Hyperparameter selection for stochastic gradient Langevin dynamics (SGLD)\n\nindep\u21e0 \u232bN, so that \u21e0r,N (x) 2 RDM. The following result, proved in Appendix K, provides the\nZN,m\nbasis for our testing guarantees.\nProposition 4.1 (Asymptotic distribution of RSD). Assume \u2303r,N := CovP (\u21e0r,N ) is \ufb01nite for all\nN and \u2303r := limN!1 \u2303r,N exists. Let \u21e3 \u21e0 N (0, \u2303r). Then as N ! 1: (1) under H0 : P = \u00b5,\nN Fr,N D=) PD\nRemark 4.2. The condition \u2303r := limN!1 \u2303r,N holds if \u232bN = \u232b0(\u00b7 mN ) for a distribution \u232b0.\nOur second asympotic result provides a roadmap for using RSDs for hypothesis testing and is\nsimilar in spirit to Theorem 3 from Jitkrittum et al. [16]. In particular, it furnishes an asymptotic null\ndistribution and establishes asymptotically full power.\n\nm=1 |\u21e3dm|r)2/r and (2) under H1 : P 6= \u00b5, N Fr,N\n\nd=1(PM\n\nP! 1.\n\nthe test N Fr,N ,\n\nTheorem 4.3 (Goodness of \ufb01t testing with RSD). Let \u02c6\u00b5 := N1PN\nn=1 \u21e0r,N (X0n) and \u02c6\u2303 :=\nN1PN\ni.i.d.\u21e0 P . Suppose for\nn=1 \u21e0r,N (X0n)\u21e0r,N (X0n)>  \u02c6\u00b5\u02c6\u00b5> with either X0n = Xn or X0n\nthe test threshold \u2327\u21b5 is set to the (1  \u21b5)-quantile of the distribution of\nPD\nd=1(PM\nm=1 |\u21e3dm|r)2/r, where \u21e3 \u21e0 N (0, \u02c6\u2303). Then, under H0 : P = \u00b5, asymptotically the\nfalse positive rate is \u21b5. Under H1 : P 6= \u00b5, the test power PH1(N Fr,N >\u2327 \u21b5) ! 1 as N ! 1.\n\n5 Experiments\n\nWe now investigate the importance-sample and computational ef\ufb01ciency of our proposed RSDs\nand evaluate their bene\ufb01ts in MCMC hyperparameter selection and goodness-of-\ufb01t testing.4 In our\nexperiments, we considered the RSDs described in Examples 3.3 and 3.4: the tilted sech kernel\nd (L2 SechExp) and the inverse multiquadric kernel using\nr = 1 (L1 IMQ). We selected kernel parameters as follows. First we chose a target  and then\nselected , \u21b5, and \u21e0 in accordance with the theory of Section 3 so that (, r,\u232b ) yielded (C, )\n\nusing r = 2 and A(x) =QD\n\nd=1 ea0p1+x2\n\n4See https://bitbucket.org/jhhuggins/random-feature-stein-discrepancies for our code.\n\n7\n\n\fu-norm, where the estimate is based on using a small subsample of the full dataset. For L2 SechExp,\n\nsecond moments. In particular, we chose \u21b5 = /3,  = 1  \u21b5/2, and \u21e0 = 4\u21b5/(2 + \u21b5). Except for\nthe importance sample ef\ufb01ciency experiments, where we varied  explicitly, all experiments used\n = 1/4. Let dmedu denote the estimated median of the distance between data points under the\nwe took a1 = p2\u21e1 dmed1, except in the sample quality experiments where we set a1 = p2\u21e1 .\nFinding hyperparameter settings for the L1 IMQ that were stable across dimension and appropriately\ncontrolled the size for goodness-of-\ufb01t testing required some care. However, we can offer some basic\nguidelines. We recommend choosing \u21e0 = D/(D + df ), which ensures \u232b has df degrees of freedom.\nWe speci\ufb01cally suggest using df 2 [0.5, 3] so that \u232b is heavy-tailed no matter the dimension. For\nmost experiments we took  = 1/2, c = 4dmed2, and df = 0.5. The exceptions were in the sample\nquality experiments, where we set c = 1, and the restricted Boltzmann machine testing experiment,\nwhere we set c = 10dmed2 and df = 2.5. For goodness-of-\ufb01t testing, we expect appropriate choices\n\nfor c and df will depend on the properties of the null distribution.\n\nImportance sample ef\ufb01ciency To validate the impor-\ntance sample ef\ufb01ciency theory from Sections 3.2 and 3.3,\nwe calculated P[RSD > SD/4] as the importance\nsample size M was increased. We considered choices\nof the parameters for L2 SechExp and L1 IMQ that\nproduced (C, ) second moments for varying choices\nof . The results, shown in Figs. 1a and 1b, indicate\ngreater sample ef\ufb01ciency for L1 IMQ than L2 Sech-\nExp. L1 IMQ is also more robust to the choice of\n. Fig. 1c, which plots the values of M necessary to\nachieve stdev(RSD)/ SD < 1/2, corroborates the\ngreater sample ef\ufb01ciency of L1 IMQ.\n\nFigure 3: Speed of IMQ KSD vs. RSDs\nwith M = 10 importance sample points\n(dimension D = 10). Even for moderate\nsample sizes N, the RSDs are orders of\nmagnitude faster than the KSD.\n\nComputational complexity We compared the com-\nputational complexity of the RSDs (with M = 10) to\nthat of the IMQ KSD. We generated datasets of dimen-\nsion D = 10 with the sample size N ranging from 500 to 5000. As seen in Fig. 3, even for moderate\ndataset sizes, the RSDs are computed orders of magnitude faster than the KSD. Other RSDs like\nFSSD and RFF obtain similar speed-ups; however, we will see the power bene\ufb01ts of the L1 IMQ and\nL2 SechExp RSDs below.\n\nApproximate MCMC hyperparameter selection We follow the stochastic gradient Langevin\ndynamics [SGLD, 28] hyperparameter selection setup from Gorham and Mackey [10, Section 5.3].\nSGLD with constant step size \" is a biased MCMC algorithm that approximates the overdamped\nLangevin diffusion. No Metropolis-Hastings correction is used, and an unbiased estimate of the score\nfunction from a data subsample is calculated at each iteration. There is a bias-variance tradeoff in the\nchoice of step size parameter: the stationary distribution of SGLD deviates more from its target as \"\ngrows, but as \" gets smaller the mixing speed of SGLD decreases. Hence, an appropriate choice of \"\nis critical for accurate posterior inference. We target the bimodal Gaussian mixture model (GMM)\nposterior of Welling and Teh [28] and compare the step size selection made by the two RSDs to\nthat of IMQ KSD [11] when N = 1000. Fig. 2a shows that L1 IMQ and L2 SechExp agree with\nIMQ KSD (selecting \" = .005) even with just M = 10 importance samples. L1 IMQ continues to\nselect \" = .005 while L2 SechExp settles on \" = .01, although the value for \" = .005 is only slightly\nlarger. Fig. 2b compares the choices of \" = .005 and .01 to smaller and larger values of \". The\nvalues of M considered all represent substantial reductions in computation as the RSD replaces the\nDN (N + 1)/2 KSD kernel evaluations of the form ((Td \u2326T d)k)(xn, xn0) with only DN M feature\nfunction evaluations of the form (Td)(xn, zm).\nGoodness-of-\ufb01t testing Finally, we investigated the performance of RSDs for goodness-of-\ufb01t\ntesting. In our \ufb01rst two experiments we used a standard multivariate Gaussian p(x) = N (x| 0, I) as\nthe null distribution while varying the dimension of the data. We explored the power of RSD-based\ntests compared to FSSD [16] (using the default settings in their code), RFF [22] (Gaussian and Cauchy\nkernels with bandwidth = dmed2), and KSD-based tests [7, 11, 19] (Gaussian kernel with bandwidth\n\n8\n\n\f(a) Gaussian null\n\n(b) Gaussian vs. Laplace (c) Gauss vs. multivariate t\n\n(d) RBM\n\nFigure 4: Quadratic-time KSD and linear-time RSD, FSSD, and RFF goodness-of-\ufb01t tests with\nM = 10 importance sample points (see Section 5 for more details). All experiments used N = 1000\nexcept the multivariate t, which used N = 2000. (a) Size of tests for Gaussian null. (b, c, d) Power\nof tests. Both RSDs offer competitive performance.\n\n= dmed2 and IMQ kernel IMQ\n\n1,1/2). We did not consider other linear-time KSD approximations due\nto relatively poor empirical performance [16]. There are two types of FSSD tests: FSSD-rand uses\nrandom sample locations and \ufb01xed hyperparameters while FSSD-opt uses a small subset of the\ndata to optimize sample locations and hyperparameters for a power criterion. All linear-time tests\nused M = 10 features. The target level was \u21b5 = 0.05. For each dimension D and RSD-based\ntest, we chose the nominal test level by generating 200 p-values from the Gaussian asymptotic null,\nthen setting the nominal level to the minimum of \u21b5 and the 5th percentile of the generated p-values.\nAll other tests had nominal level \u21b5. We veri\ufb01ed the size of the FSSD, RFF, and RSD-based\ntests by generating 1000 p-values for each experimental setting in the Gaussian case (see Fig. 4a).\nOur \ufb01rst experiment replicated the Gaussian vs. Laplace experiment of Jitkrittum et al. [16] where,\n\nd=1 Lap(xd|0, 1/p2 ), a product of Laplace distributions\nunder the alternative hypothesis, q(x) =QD\nwith variance 1 (see Fig. 4b). Our second experiment, inspired by the Gaussian vs. multivariate t\nexperiment of Chwialkowski et al. [7], tested the alternative in which q(x) = T (x|0, 5), a standard\nmultivariate t-distribution with 5 degrees of freedom (see Fig. 4c). Our \ufb01nal experiment replicated the\nrestricted Boltzmann machine (RBM) experiment of Jitkrittum et al. [16] in which each entry of the\nmatrix used to de\ufb01ne the RBM was perturbed by independent additive Gaussian noise (see Fig. 4d).\nThe amount of noise was varied from per = 0 (that is, the null held) up to per = 0.06. The L1\nIMQ test performed well across all dimensions and experiments, with power of at least 0.93 in almost\nall experiments. The only exceptions were the Laplace experiment with D = 20 (power \u21e1 0.88) and\nthe RBM experiment with per = 0.02 (power \u21e1 0.74). The L2 SechExp test performed comparably\nto or better than the FSSD and RFF tests. Despite theoretical issues, the Cauchy RFF was competitive\nwith the other linear-time methods\u2014except for the superior L1 IMQ. Given its superior power control\nand computational ef\ufb01ciency, we recommend the L1 IMQ over the L2 SechExp.\n\n6 Discussion and related work\n\nIn this paper, we have introduced feature Stein discrepancies, a family of computable Stein discrepan-\ncies that can be cheaply approximated using importance sampling. Our stochastic approximations,\nrandom feature Stein discrepancies (RSDs), combine the computational bene\ufb01ts of linear-time dis-\ncrepancy measures with the convergence-determining properties of quadratic-time Stein discrepancies.\nWe validated the bene\ufb01ts of RSDs on two applications where kernel Stein discrepancies have shown\nexcellent performance: measuring sample quality and goodness-of-\ufb01t testing. Empirically, the L1\nIMQ RSD performed particularly well: it outperformed existing linear-time KSD approximations\nin high dimensions and performed as well or better than the state-of-the-art quadratic-time KSDs.\nRSDs could also be used as drop-in replacements for KSDs in applications to Monte Carlo variance\nreduction with control functionals [21], probabilistic inference using Stein variational gradient\ndescent [18], and kernel quadrature [2, 3]. Moreover, the underlying principle used to generalize\nthe KSD could also be used to develop fast alternatives to maximum mean discrepancies in two-\nsample testing applications [6, 13]. Finally, while we focused on the Langevin Stein operator, our\ndevelopment is compatible with any Stein operator, including diffusion Stein operators [12].\n\n9\n\n\fAcknowledgments\nPart of this work was done while JHH was a research intern at MSR New England.\n\nReferences\n[1] M. Abramowitz and I. Stegun, editors. Handbook of Mathematical Functions. Dover Publica-\n\ntions, 1964.\n\n[2] F. Bach. On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions.\n\nJournal of Machine Learning Research, 18:1\u201338, 2017.\n\n[3] F.-X. Briol, C. J. Oates, J. Cockayne, W. Y. Chen, and M. A. Girolami. On the Sampling\n\nProblem for Kernel Quadrature. In International Conference on Machine Learning, 2017.\n\n[4] C. Carmeli, E. De Vito, A. Toigo, and V. Umanit\u00b4a. Vector valued reproducing kernel hilbert\n\nspaces and universality. Analysis and Applications, 8(01):19\u201361, 2010.\n\n[5] F. Chung and L. Lu. Complex Graphs and Networks, volume 107. American Mathematical\n\nSociety, Providence, Rhode Island, 2006.\n\n[6] K. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast Two-Sample Testing\nwith Analytic Representations of Probability Measures. In Advances in Neural Information\nProcessing Systems, 2015.\n\n[7] K. Chwialkowski, H. Strathmann, and A. Gretton. A Kernel Test of Goodness of Fit. In\n\nInternational Conference on Machine Learning, 2016.\n\n[8] A. Eberle. Re\ufb02ection couplings and contraction rates for diffusions. Probability Theory and\n\nRelated Fields, 166(3-4):851\u2013886, 2016.\n\n[9] C. J. Geyer. Markov Chain Monte Carlo Maximum Likelihood. In Computing Science and\n\nStatistics, Proceedings of the 23rd Symposium on the Interface, pages 156\u2013163, 1991.\n\n[10] J. Gorham and L. Mackey. Measuring Sample Quality with Stein\u2019s Method. In Advances in\n\nNeural Information Processing Systems, 2015.\n\n[11] J. Gorham and L. Mackey. Measuring Sample Quality with Kernels. In International Conference\n\non Machine Learning, 2017.\n\n[12] J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey. Measuring Sample Quality with\n\nDiffusions. arXiv.org, Nov. 2016, 1611.06972v3.\n\n[13] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00a8olkopf, and A. J. Smola. A Kernel Two-Sample\n\nTest. Journal of Machine Learning Research, 13:723\u2013773, 2012.\n\n[14] R. Herb and P. Sally Jr. The Plancherel formula, the Plancherel theorem, and the Fourier\ntransform of orbital integrals. In Representation Theory and Mathematical Physics: Conference\nin Honor of Gregg Zuckerman\u2019s 60th Birthday, October 24\u201327, 2009, Yale University, volume\n557, page 1. American Mathematical Soc., 2011.\n\n[15] J. Honorio and Y.-J. Li. The Error Probability of Random Fourier Features is Dimensionality\n\nIndependent. arXiv.org, Oct. 2017, 1710.09953v1.\n\n[16] W. Jitkrittum, W. Xu, Z. Szab\u00b4o, K. Fukumizu, and A. Gretton. A Linear-Time Kernel Goodness-\n\nof-Fit Test. In Advances in Neural Information Processing Systems, 2017.\n\n[17] Q. Liu and J. D. Lee. Black-box Importance Sampling. In International Conference on Arti\ufb01cial\n\nIntelligence and Statistics, 2017.\n\n[18] Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference\n\nAlgorithm. In Advances in Neural Information Processing Systems, 2016.\n\n[19] Q. Liu, J. D. Lee, and M. I. Jordan. A Kernelized Stein Discrepancy for Goodness-of-\ufb01t Tests\n\nand Model Evaluation. In International Conference on Machine Learning, 2016.\n\n10\n\n\f[20] A. M\u00a8uller. Integral probability metrics and their generating classes of functions. Ann. Appl.\n\nProbab., 29(2):pp. 429\u2013443, 1997.\n\n[21] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695\u2013718,\n2017.\n\n[22] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems, 2007.\n\n[23] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based\nand RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263\u20132291,\n2013.\n\n[24] R. J. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, New\n\nYork, 1980.\n\n[25] B. K. Sriperumbudur and Z. Szab\u00b4o. Optimal rates for random Fourier features. In Advances in\n\nNeural Information Processing Systems, pages 1144\u20131152, 2015.\n\n[26] D. J. Sutherland and J. Schneider. On the error of random Fourier features. In Proceedings of\n\nthe Thirty-First Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 862\u2013871, 2015.\n\n[27] D. Wang and Q. Liu. Learning to Draw Samples - With Application to Amortized MLE for\n\nGenerative Adversarial Learning. arXiv, stat.ML, 2016.\n\n[28] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In\n\nInternational Conference on Machine Learning, 2011.\n\n[29] H. Wendland. Scattered Data Approximation. Cambridge University Press, New York, NY,\n\n2005.\n\n[30] J. Zhao and D. Meng. FastMMD: Ensemble of circular discrepancy for ef\ufb01cient two-sample\n\ntest. Neural computation, 27(6):1345\u20131372, 2015.\n\n11\n\n\f", "award": [], "sourceid": 959, "authors": [{"given_name": "Jonathan", "family_name": "Huggins", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Lester", "family_name": "Mackey", "institution": "Microsoft Research"}]}