{"title": "Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 37, "page_last": 45, "abstract": "Monte Carlo sampling for Bayesian posterior inference is a common approach used in machine learning. The Markov Chain Monte Carlo procedures that are used are often discrete-time analogues of associated stochastic differential equations (SDEs). These SDEs are guaranteed to leave invariant the required posterior distribution. An area of current research addresses the computational benefits of stochastic gradient methods in this setting. Existing techniques rely on estimating the variance or covariance of the subsampling error, and typically assume constant variance. In this article, we propose a covariance-controlled adaptive Langevin thermostat that can effectively dissipate parameter-dependent noise while maintaining a desired target distribution. The proposed method achieves a substantial speedup over popular alternative schemes for large-scale machine learning applications.", "full_text": "Covariance-Controlled Adaptive Langevin\n\nThermostat for Large-Scale Bayesian Sampling\n\nXiaocheng Shang\u2217\nUniversity of Edinburgh\nx.shang@ed.ac.uk\n\nZhanxing Zhu\u2217\n\nUniversity of Edinburgh\n\nzhanxing.zhu@ed.ac.uk\n\nBenedict Leimkuhler\nUniversity of Edinburgh\n\nb.leimkuhler@ed.ac.uk\n\nAmos J. Storkey\n\nUniversity of Edinburgh\na.storkey@ed.ac.uk\n\nAbstract\n\nMonte Carlo sampling for Bayesian posterior inference is a common approach\nused in machine learning. The Markov chain Monte Carlo procedures that are\nused are often discrete-time analogues of associated stochastic differential equa-\ntions (SDEs). These SDEs are guaranteed to leave invariant the required posterior\ndistribution. An area of current research addresses the computational bene\ufb01ts of\nstochastic gradient methods in this setting. Existing techniques rely on estimating\nthe variance or covariance of the subsampling error, and typically assume constant\nvariance. In this article, we propose a covariance-controlled adaptive Langevin\nthermostat that can effectively dissipate parameter-dependent noise while main-\ntaining a desired target distribution. The proposed method achieves a substantial\nspeedup over popular alternative schemes for large-scale machine learning appli-\ncations.\n\n1\n\nIntroduction\n\nIn machine learning applications, direct sampling with the entire large-scale dataset is computation-\nally infeasible. For instance, standard Markov chain Monte Carlo (MCMC) methods [16], as well\nas typical hybrid Monte Carlo (HMC) methods [3, 6, 9], require the calculation of the acceptance\nprobability and the creation of informed proposals based on the whole dataset.\nIn order to improve the computational ef\ufb01ciency, a number of stochastic gradient methods [4, 5, 20,\n21] have been proposed in the setting of Bayesian sampling based on random (and much smaller)\nsubsets to approximate the likelihood of the whole dataset, thus substantially reducing the com-\nputational cost in practice. Welling and Teh proposed the so-called stochastic gradient Langevin\ndynamics (SGLD) [21], combining the ideas of stochastic optimization [18] and traditional Brow-\nnian dynamics, with a sequence of stepsizes decreasing to zero. A \ufb01xed stepsize is often adopted\nin practice which is the choice in this article as in Vollmer et al. [20], where a modi\ufb01ed SGLD\n(mSGLD) was also introduced that was designed to reduce the sampling bias.\nSGLD generates samples from \ufb01rst order Brownian dynamics, and thus, with a \ufb01xed timestep, one\ncan show that it is unable to dissipate excess noise in gradient approximations while maintaining the\ndesired invariant distribution [4]. A stochastic gradient Hamiltonian Monte Carlo (SGHMC) method\nwas proposed by Chen et al. [4], which relies on second order Langevin dynamics and incorporates a\nparameter-dependent diffusion matrix that is intended to effectively offset the stochastic perturbation\nof the gradient. However, it is dif\ufb01cult to accommodate the additional diffusion term in practice.\n\n\u2217The \ufb01rst and second authors contributed equally, and the listed author order was decided by lot.\n\n1\n\n\fMoreover, as pointed out in [5], poor estimation of it may have a signi\ufb01cant adverse in\ufb02uence on the\nsampling of the target distribution; for example, the effective system temperature may be altered.\nThe \u201cthermostat\u201d idea, which is widely used in molecular dynamics [7, 13], was recently adopted\nin the stochastic gradient Nos\u00b4e-Hoover thermostat (SGNHT) by Ding et al. [5] in order to adjust\nthe kinetic energy during simulation in such a way that the canonical ensemble is preserved (i.e. so\nthat a prescribed constant temperature distribution is maintained). In fact, the SGNHT method is\nessentially equivalent to the adaptive Langevin (Ad-Langevin) thermostat proposed earlier by Jones\nand Leimkuhler [10] in the molecular dynamics setting (see [15] for discussions).\nDespite the substantial interest generated by these methods,\nthe mathematical foundation for\nstochastic gradient methods has been incomplete. The underlying dynamics of the SGNHT\nmethod [5] was taken up by Leimkuhler and Shang [15], together with the design of discretiza-\ntion schemes with high effective order of accuracy. SGNHT methods are designed based on the\nassumption of constant noise variance. In this article, we propose a covariance-controlled adaptive\nLangevin (CCAdL) thermostat, that can handle parameter-dependent noise, improving both robust-\nness and reliability in practice, and which can effectively speed up the convergence to the desired\ninvariant distribution in large-scale machine learning applications.\nThe rest of the article is organized as follows. In Section 2, we describe the setting of Bayesian\nsampling with noisy gradients and brie\ufb02y review existing techniques. Section 3 considers the con-\nstruction of the novel CCAdL method that can effectively dissipate parameter-dependent noise while\nmaintaining the correct distribution. Various numerical experiments are performed in Section 4 to\nverify the usefulness of CCAdL in a wide range of large-scale machine learning applications. Final-\nly, we summarize our \ufb01ndings in Section 5.\n\n2 Bayesian Sampling with Noisy Gradients\n\n\u03c0(\u03b8|X) \u221d \u03c0(X|\u03b8)\u03c0(\u03b8) ,\n\nIn the typical setting of Bayesian sampling [3, 19], one is interested in drawing states from a posterior\ndistribution de\ufb01ned as\n\n(1)\nwhere \u03b8 \u2208 RNd is the parameter vector of interest, X denotes the entire dataset, and, \u03c0(X|\u03b8)\nand \u03c0(\u03b8) are the likelihood and prior distributions, respectively. We introduce a potential energy\nfunction U (\u03b8) by de\ufb01ning \u03c0(\u03b8|X) \u221d exp(\u2212\u03b2U (\u03b8)), where \u03b2 is a positive parameter and can be\ninterpreted as being proportional to the reciprocal temperature in an associated physical system, i.e.\n\u03b2\u22121 = kBT (kB is the Boltzmann constant and T is the temperature). In practice, \u03b2 is often set to\nbe unity for notational simplicity. Taking the logarithm of (1) yields\nU (\u03b8) = \u2212 log \u03c0(X|\u03b8)\u2212log \u03c0(\u03b8) .\n\n(2)\nAssuming the data are independent and identically distributed (i.i.d.), the logarithm of the likelihood\ncan be calculated as\n\nlog \u03c0(X|\u03b8) =\n\nlog \u03c0(xi|\u03b8) ,\n\n(3)\n\nwhere N is the size of the entire dataset.\nHowever, as already mentioned, it is computationally infeasible to deal with the entire large-scale\ndataset at each timestep as would typically be required in MCMC and HMC methods. Instead, in\norder to improve the ef\ufb01ciency, a random (and much smaller, i.e. n (cid:28) N) subset is preferred in\nstochastic gradient methods, in which the likelihood of the dataset for given parameters is approxi-\nmated by\n\nlog \u03c0(X|\u03b8) \u2248 N\nn\n\nlog \u03c0(xri|\u03b8) ,\n\n(4)\n\nN(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n2\n\nwhere {xri}n\nas\n\ni=1 represents a random subset of X. Thus, the \u201cnoisy\u201d potential energy can be written\n\n\u02dcU (\u03b8) = \u2212 N\nn\n\nlog \u03c0(xri|\u03b8)\u2212log \u03c0(\u03b8) ,\n\n(5)\n\nwhere the negative gradient of the potential is referred to as the \u201cnoisy\u201d force, i.e. \u02dcF(\u03b8) = \u2212\u2207 \u02dcU (\u03b8).\n\nn(cid:88)\n\ni=1\n\n\f\u02dcF(\u03b8) = \u2212\u2207U (\u03b8)+(cid:112)\u03a3(\u03b8)M1/2R ,\n\nOur goal is to correctly sample the Gibbs distribution \u03c1(\u03b8) \u221d exp(\u2212\u03b2U (\u03b8)) (1). As in [4, 5], the\ngradient noise is assumed to be Gaussian with mean zero and unknown variance, in which case one\nmay rewrite the noisy force as\n\nR is a vector of i.i.d. standard normal random variables. Note that(cid:112)\u03a3(\u03b8)M1/2R here is actually\n\n(6)\nwhere M typically is a diagonal matrix, \u03a3(\u03b8) represents the covariance matrix of the noise, and,\nequivalent to N (0, \u03a3(\u03b8)M).\nIn a typical setting of numerical integration with associated stepsize h, one has\n\n(cid:16)\u2212\u2207U (\u03b8)+(cid:112)\u03a3(\u03b8)M1/2R\n(cid:17)\n\nh \u02dcF(\u03b8) = h\n\n(7)\nand therefore, assuming a constant covariance matrix (i.e. \u03a3 = \u03c32I, where I is the identity matrix),\nthe SGNHT method by Ding et al. [5] has the following underlying dynamics, written as a standard\nIt\u00afo stochastic differential equation (SDE) system [15]:\n\nM1/2R ,\n\nh\n\n(cid:16)(cid:112)h\u03a3(\u03b8)\n\n(cid:17)\n\n\u221a\n\n= \u2212h\u2207U (\u03b8)+\n\nd\u03b8 = M\u22121pdt ,\ndp = \u2212\u2207U (\u03b8)dt+\u03c3\n\nd\u03be = \u00b5\u22121(cid:2)pT M\u22121p\u2212NdkBT(cid:3) dt ,\n\n\u221a\n\nhM1/2dW\u2212\u03bepdt+\n\n(cid:112)\n\n2A\u03b2\u22121M1/2dWA ,\n\n(8)\n\noften informally denoted by N (0, dtI) [4]. The coef\ufb01cient(cid:112)2A\u03b2\u22121M1/2 represents the strength\n\nwhere, colloquially, dW and dWA represent vectors of independent Wiener increments; and are\n\nof arti\ufb01cial noise added into the system to improve ergodicity, and A, which can be termed the \u201cef-\nfective friction\u201d, is a positive parameter and proportional to the variance of the noise. The auxiliary\nvariable \u03be \u2208 R is governed by a Nos\u00b4e-Hoover device [8, 17] via a negative feedback mechanism,\ni.e. when the instantaneous temperature (average kinetic energy per degree of freedom) calculated\nas\n\n(9)\nis below the target temperature, the \u201cdynamical friction\u201d \u03be would decrease allowing an increase\nof temperature, while \u03be would increase when the temperature is above the target. \u00b5 is a coupling\nparameter which is referred to as the \u201cthermal mass\u201d in the molecular dynamics setting.\nProposition 1 (See Jones and Leimkuhler [10]). The SGNHT method (8) preserves the modi\ufb01ed\nGibbs (stationary) distribution:\n\n\u02dc\u03c1\u03b2(\u03b8, p, \u03be) = Z\u22121 exp (\u2212\u03b2H(\u03b8, p)) exp(cid:0)\u2212\u03b2\u00b5(\u03be\u2212 \u00af\u03be)2/2(cid:1) ,\n\nwhere Z is the normalizing constant, H(\u03b8, p) = pT M\u22121p/2+U (\u03b8) is the Hamiltonian, and\n\nkBT = pT M\u22121p/Nd\n\n(10)\n\n\u00af\u03be = A+\u03b2h\u03c32/2 .\n\n(11)\nProposition 1 tells us that the SGNHT method can adaptively dissipate excess noise pumped into\nthe system while maintaining the correct distribution. The variance of the gradient noise, \u03c32, does\nnot need to be known a priori. As long as \u03c32 is constant, the auxiliary variable \u03be will be able to\nautomatically \ufb01nd its mean value \u00af\u03be on the \ufb02y. However, with a parameter-dependent covariance\nmatrix \u03a3(\u03b8), the SGNHT method (8) would not produce the required target distribution (10).\nDing et al. [5] claimed that it is reasonable to assume the covariance matrix \u03a3(\u03b8) is constant when\nthe size of the dataset, N, is large, in which case the variance of the posterior of \u03b8 is small. The\nmagnitude of the posterior variance does not actually relate to the constancy of the \u03a3, however,\nin general, \u03a3 is not constant. Simply assuming the non-constancy of the \u03a3 can have a signi\ufb01cant\nimpact on the performance of the method (most notably the stability measured by the largest usable\nstepsize). Therefore, it is essential to have an approach that can handle parameter-dependent noise.\nIn the following section, we propose a covariance-controlled thermostat that can effectively dissipate\nparameter-dependent noise while maintaining the target stationary distribution.\n\n3 Covariance-Controlled Adaptive Langevin Thermostat\n\nAs mentioned in the previous section, the SGNHT method (8) can only dissipate noise with a con-\nstant covariance matrix. When the covariance matrix becomes parameter-dependent, in general, a\nparameter-dependent covariance matrix does not imply the required \u201cthermal equilibrium\u201d, i.e. the\nsystem cannot be expected to converge to the desired invariant distribution (10), typically resulting\nin poor estimation of functions of parameters of interest. In fact, in that case it is not clear whether\nor not there exists an invariant distribution at all.\n\n3\n\n\fIn order to construct a stochastic-dynamical system that preserves the canonical distribution, we\nsuggest adding a suitable damping (viscous) term to effectively dissipate the parameter-dependent\ngradient noise. To this end, we propose the following covariance-controlled adaptive Langevin\n(CCAdL) thermostat:\n\nd\u03b8 = M\u22121pdt ,\n\ndp = \u2212\u2207U (\u03b8)dt+(cid:112)h\u03a3(\u03b8)M1/2dW\u2212(h/2)\u03b2\u03a3(\u03b8)pdt\u2212\u03bepdt+\nd\u03be = \u00b5\u22121(cid:2)pT M\u22121p\u2212NdkBT(cid:3) dt .\n\n(cid:112)\n\n2A\u03b2\u22121M1/2dWA ,\n\n(12)\n\nProposition 2. The CCAdL thermostat (12) preserves the modi\ufb01ed Gibbs (stationary) distribution:\n(13)\n\n\u02c6\u03c1\u03b2(\u03b8, p, \u03be) = Z\u22121 exp (\u2212\u03b2H(\u03b8, p)) exp(cid:0)\u2212\u03b2\u00b5(\u03be\u2212A)2/2(cid:1) .\n+\u03be\u2207p\u00b7(p\u03c1)+A\u03b2\u22121\u2207p\u00b7(M\u2207p\u03c1)\u2212\u00b5\u22121(cid:2)pT M\u22121p\u2212NdkBT(cid:3)\u2207\u03be\u03c1 .\n\nProof. The Fokker-Planck equation corresponding to (12) is\n\u03c1t = L\u2020\u03c1 := \u2212M\u22121p\u00b7\u2207\u03b8\u03c1+\u2207U (\u03b8)\u00b7\u2207p\u03c1+(h/2)\u2207p\u00b7(\u03a3(\u03b8)M\u2207p\u03c1)+(h/2)\u03b2\u2207p\u00b7(\u03a3(\u03b8)p\u03c1)\n\nJust insert \u02c6\u03c1\u03b2 (13) into the Fokker-Planck operator L\u2020 to see that it vanishes.\nThe incorporation of the parameter-dependent covariance matrix \u03a3(\u03b8) in (12) is intended to offset\nthe covariance matrix coming from the gradient approximation. However, in practice, one does not\nknow \u03a3(\u03b8) a priori. Thus instead one must estimate \u03a3(\u03b8) during the simulation, a task which will\nbe addressed in Section 3.1. This procedure is related to the method used in the SGHMC method\nproposed by Chen et al. [4], which uses dynamics of the following form:\n\ndp = \u2212\u2207U (\u03b8)dt+(cid:112)h\u03a3(\u03b8)M1/2dW\u2212Apdt+(cid:112)2\u03b2\u22121 (AI\u2212\u03b2h\u03a3(\u03b8)/2)M1/2dWA .\n\nd\u03b8 = M\u22121pdt ,\n\nIt can be shown that the SGHMC method preserves the Gibbs canonical distribution:\n\n\u03c1\u03b2(\u03b8, p) = Z\u22121 exp (\u2212\u03b2H(\u03b8, p)) .\n\n(14)\n\n(15)\n\nAlthough both CCAdL (12) and SGHMC (14) preserve their respective invariant distributions, let\nus note several advantages of the former over the latter in practice:\n\n(i) CCAdL and SGHMC both require estimation of the covariance matrix \u03a3(\u03b8) during simu-\nlation, which can be costly in high dimension. In numerical experiments, we have found\nthat simply using the diagonal of the covariance matrix, at signi\ufb01cantly reduced computa-\ntional cost, works quite well in CCAdL. By contrast, it is dif\ufb01cult to \ufb01nd a suitable value\nof the parameter A in SGHMC since one has to make sure the matrix AI\u2212\u03b2h\u03a3(\u03b8)/2 is\npositive semi-de\ufb01nite. One may attempt to use a large value of the \u201ceffective friction\u201d A\nand/or a small stepsize h. However, too-large a friction would essentially reduce SGHMC\nto SGLD, which is not desirable, as pointed out in [4], while extremely small stepsize\nwould signi\ufb01cantly impact the computational ef\ufb01ciency.\n\n(ii) Estimation of the covariance matrix \u03a3(\u03b8) unavoidably introduces additional noise in both\nCCAdL and SGHMC. Nonetheless, CCAdL can still effectively control the system tem-\nperature (i.e. maintaining the correct distribution of the momenta) due to the use of the\nstabilizing Nos\u00b4e-Hoover control, while in SGHMC, poor estimation of the covariance ma-\ntrix may lead to signi\ufb01cant deviations of the system temperature (as well as the distribution\nof the momenta), resulting in poor sampling of the parameters of interest.\n\n3.1 Covariance Estimation of Noisy Gradients\n\nUnder the assumption that the noise of the stochastic gradient follows a normal distribution, we\napply a similar method to that of [2] to estimate the covariance matrix associated with the noisy\ngradient. If we let g(\u03b8; x) = \u2207\u03b8 log \u03c0(x|\u03b8) and assume that the size of subset n is large enough for\nthe central limit theorem to hold, we have\n\ng(\u03b8t; xri) \u223c N\n\nEx[g(\u03b8t; x)],\n\n(16)\n\n(cid:80)n\nwhere It = Cov[g(\u03b8t; x)] is the covariance of the gradient at \u03b8t. Given the noisy (stochastic)\ngradient based on the current subset \u2207 \u02dcU (\u03b8t) = \u2212 N\ni=1 g(\u03b8t; xri)\u2212\u2207 log \u03c0(\u03b8t) and the clean\n\ni=1\n\nn\n\nn(cid:88)\n\n1\nn\n\n(cid:19)\n\n1\nn\n\nIt\n\n,\n\n(cid:18)\n\n4\n\n\ft=1.\n\nAlgorithm 1 Covariance-Controlled Adaptive Langevin (CCAdL) Thermostat\n1: Input: h, A, {\u03bat} \u02c6T\n2: Initialize \u03b80, p0, I0, and \u03be0 = A.\n3: for t = 1, 2, . . . , \u02c6T do\n\u03b8t = \u03b8t\u22121 +pt\u22121h;\n4:\nEstimate \u02c6It using Eq. (18);\n5:\npt = pt\u22121\u2212\u2207 \u02dcU (\u03b8t)h\u2212 h\n6:\n7:\n8: end for\n\n\u02c6Itpt\u22121h\u2212\u03bet\u22121pt\u22121h+\n\n\u03bet = \u03bet\u22121 +(cid:0)pT\n\nt pt/Nd\u22121(cid:1) h;\n(full) gradient \u2207U (\u03b8t) = \u2212(cid:80)N\n\n2AhN (0, I);\n\nN 2\nn\n\n\u221a\n\n2\n\nand thus\n\ni=1 g(\u03b8t; xi)\u2212\u2207 log \u03c0(\u03b8t), we have Ex[\u2207 \u02dcU (\u03b8t)] = Ex[\u2207U (\u03b8t)],\n\u2207 \u02dcU (\u03b8t) = \u2207U (\u03b8t)+N\n\n(17)\n\n0,\n\nIt\n\n,\n\n(cid:19)\n\n(cid:18)\n\nN 2\nn\n\ni.e. \u03a3(\u03b8t) = N 2It/n. Assuming \u03b8t does not change dramatically over time, we use the moving\naverage update to estimate It:\n\n\u02c6It = (1\u2212\u03bat)\u02c6It\u22121 +\u03batV(\u03b8t) ,\n\n(18)\n\nwhere \u03bat = 1/t and\n\nV(\u03b8t) =\n\n1\nn\u22121\n\nn(cid:88)\n\ni=1\n\n(g(\u03b8t; xri)\u2212\u00afg(\u03b8t)) (g(\u03b8t; xri)\u2212\u00afg(\u03b8t))T\n\n(19)\n\nis the empirical covariance of the gradient. \u00afg(\u03b8t) represents the mean gradient of the log likelihood\ncomputed from a subset. As proved in [2], this estimator has a convergence order of O(1/N ).\nAs already mentioned, estimating the full covariance matrix is computationally infeasible in high\ndimension. However, we have found that employing a diagonal approximation of the covariance\nmatrix (i.e. estimating the variance only along each dimension of the noisy gradient) works quite\nwell in practice, as demonstrated in Section 4.\nThe procedure of the CCAdL method is summarized in Algorithm 1, where we simply used M = I,\n\u03b2 = 1, and \u00b5 = Nd in order to be consistent with the original implementation of SGNHT [5].\nNote that this is a simple, \ufb01rst order (in terms of the stepsize) algorithm. A recent article [15] has\nintroduced higher order of accuracy schemes which can improve accuracy, but our interest here is in\nthe direct comparison of the underlying machinery of SGHMC, SGNHT, and CCAdL, so we avoid\nfurther modi\ufb01cations and enhancements related to timestepping at this stage.\nIn the following section, we compare the newly established CCAdL method with SGHMC and\nSGNHT on various machine learning tasks to demonstrate the bene\ufb01ts of CCAdL in Bayesian sam-\npling with a noisy gradient.\n\n4 Numerical Experiments\n\n4.1 Bayesian Inference for Gaussian Distribution\n\nWe \ufb01rst compare the performance of the newly established CCAdL method with SGHMC and\nSGNHT for a simple task using synthetic data, i.e. Bayesian inference of both the mean and vari-\nance of a one-dimensional normal distribution. We apply the same experimental setting as in [5]. We\ngenerated N = 100 samples from a standard normal distribution N (0, 1). We used the likelihood\nfunction of N (xi|\u00b5, \u03b3\u22121) and assigned a Normal-Gamma distribution as their prior distribution, i.e.\n\u00b5, \u03b3 \u223c N (\u00b5|0, \u03b3)Gam(\u03b3|1, 1). Then the corresponding posterior distribution is another Normal-\nGamma distribution, i.e. (\u00b5, \u03b3)|X \u223c N (\u00b5|\u00b5N , (\u03baN \u03b3)\u22121)Gam(\u03b3|\u03b1N , \u03b2N ), with\n\u00b5N =\n\n(xi\u2212 \u00afx)2\n\n\u03baN = 1+N ,\n\nN(cid:88)\n\n\u03b1N = 1+\n\n\u03b2N = 1+\n\nN \u00afx2\n\n+\n\n,\n\n,\n\n,\n\ni=1 xi/N. A random subset of size n = 10 was selected at each timestep to approxi-\n\nmate the full gradient, resulting in the following stochastic gradients:\n\n\u2207\u00b5 \u02dcU = (N +1)\u00b5\u03b3\u2212 \u03b3N\nn\n\nxri ,\n\n\u2207\u03b3 \u02dcU = 1\u2212 N +1\n2\u03b3\n\n+\n\n\u00b52\n2\n\n+\n\nN\n2n\n\n2\n\n2(1+N )\n\ni=1\n\nn(cid:88)\n(xri\u2212\u00b5)2 .\n\ni=1\n\nN \u00afx\nN +1\n\nwhere \u00afx =(cid:80)N\n\nn(cid:88)\n\ni=1\n\nN\n2\n\n5\n\n\fIt can be seen that the variance of the stochastic gradient noise is no longer constant and actually\ndepends on the size of the subset, n, and the values of \u00b5 and \u03b3 in each iteration. This directly violates\nthe constant noise variance assumption of SGNHT [5], while CCAdL adjusts to the varying noise\nvariance.\nThe marginal distributions of \u00b5 and \u03b3 obtained from various methods with different combinations\nof h and A were compared and plotted in Figure 1, with Table 1 consisting of the corresponding\nroot mean square error (RMSE) of the distribution and autocorrelation time from 106 samples. In\nmost of the cases, both SGNHT and CCAdL easily outperform the SGHMC method possibly due\nto the presence of the Nos\u00b4e-Hoover device, with SGHMC only showing superiority with a small\nvalue of h and a large value of A, neither of which is desirable in practice as discussed in Section 3.\nBetween SGNHT and the newly proposed CCAdL method, the latter achieves better performance in\neach of the cases investigated, highlighting the importance of the covariance control with parameter-\ndependent noise.\n\n(a) h = 0.001, A = 1\n\n(b) h = 0.001, A = 10\n\n(c) h = 0.01, A = 1\n\n(d) h = 0.01, A = 10\n\nFigure 1: Comparisons of marginal distribution (density) of \u00b5 (top row) and \u03b3 (bottom row) with various\nvalues of h and A indicated in each column. The peak region is highlighted in the inset.\n\nTable 1: Comparisons of (RMSE, Autocorrelation time) of (\u00b5, \u03b3) of various methods for Bayesian inference\nof the mean and variance of a Gaussian distribution.\n\nMethods\nSGHMC\nSGNHT\nCCAdL\n\nh = 0.001, A = 1\n(0.0148, 236.12)\n(0.0037, 238.32)\n(0.0034, 238.06)\n\nh = 0.001, A = 10\n(0.0029, 333.04)\n(0.0035, 406.71)\n(0.0031, 402.45)\n\nh = 0.01, A = 1\n(0.0531, 29.78)\n(0.0044, 26.71)\n(0.0021, 26.71)\n\nh = 0.01, A = 10\n(0.0132, 39.33)\n(0.0043, 55.00)\n(0.0035, 54.43)\n\n4.2 Large-scale Bayesian Logistic Regression\n\nhood function of \u03c0(cid:0){xi, yi}N\n\ni=1|w(cid:1) \u221d(cid:81)N\n\ni=1 1/(cid:0)1+exp(\u2212yiwT xi)(cid:1) and the prior distribution of\n\nWe then consider a Bayesian logistic regression model trained on the benchmark MNIST dataset\nfor binary classi\ufb01cation of digits 7 and 9 using 12, 214 training data points, with a test set of size\n2037. A 100-dimensional random projection of the original features was used. We used the likeli-\n\u03c0(w) \u221d exp(\u2212wT w/2). A subset of size n = 500 was used at each timestep. Since the dimen-\nsionality of this problem is not that high, a full covariance estimation was used for CCAdL.\nWe investigate in Figure 2 (top row) the convergence speed of each method through measuring test\nlog likelihood using the posterior mean against the number of passes over the entire dataset. CCAdL\ndisplays signi\ufb01cant improvements over SGHMC and SGNHT with different values of h and A:\n(1) CCAdL converges much faster than the other two, which also indicates its faster mixing speed\nand shorter burn-in period; (2) CCAdL shows robustness in different values of the effective friction\nA, with SGHMC and SGNHT relying on a relative large value of A (especially for the SGHMC\nmethod), which is intended to dominate the gradient noise.\nTo compare the sample quality obtained from each method, Figure 2 (bottom row) plots the two-\ndimensional marginal posterior distribution in randomly selected dimensions of 2 and 5 based on\n106 samples from each method after the burn-in period (i.e. we start to collect samples when the test\n\n6\n\n\u22120.500.501234\u00b5Density TrueSGHMCSGNHTCCAdL\u22120.500.501234\u00b5Density TrueSGHMCSGNHTCCAdL\u22120.500.501234\u00b5Density TrueSGHMCSGNHTCCAdL\u22120.500.501234\u00b5Density TrueSGHMCSGNHTCCAdL0.511.50123\u03b3Density TrueSGHMCSGNHTCCAdL0.511.50123\u03b3Density TrueSGHMCSGNHTCCAdL0.511.50123\u03b3Density TrueSGHMCSGNHTCCAdL0.511.50123\u03b3Density TrueSGHMCSGNHTCCAdL\flog likelihood stabilizes). The true (reference) distribution was obtained by a suf\ufb01ciently long run of\nstandard HMC. We implemented 10 runs of standard HMC and found there was no variation between\nthese runs, which guarantees its quali\ufb01cation as the true (reference) distribution. Again, CCAdL\nshows much better performance than SGHMC and SGNHT. Note that the contour of SGHMC does\nnot even \ufb01t in the region of the plot, and in fact it shows signi\ufb01cant deviation even in the estimation\nof the mean.\n\n(a) h = 0.2\u00d710\u22124\n\n(b) h = 0.5\u00d710\u22124\n\n(c) h = 1\u00d710\u22124\n\nFigure 2: Comparisons of Bayesian logistic regression of various methods on the MNIST dataset of digits 7\nand 9 with various values of h and A: (top row) test log likelihood using the posterior mean against the number\nof passes over the entire dataset; (bottom row) two-dimensional marginal posterior distribution in (randomly\nselected) dimensions 2 and 5 with A = 10 \ufb01xed, based on 106 samples from each method after the burn-in\nperiod (i.e. we start to collect samples when the test log likelihood stabilizes). Magenta circle is the true\n(reference) posterior mean obtained from standard HMC and crosses represent the sample means computed\nfrom various methods. Ellipses represent iso-probability contours covering 95% probability mass. Note that\nthe contour of SGHMC is well beyond the scale of the plot especially in the large stepsize regime, in which\ncase we do not include it here.\n4.3 Discriminative Restricted Boltzmann Machine (DRBM)\n\nDRBM [11] is a self-contained non-linear classi\ufb01er, and the gradient of its discriminative objective\ncan be explicitly computed. Due to the limited space, we refer the readers to [11] for more details.\nWe trained a DRBM on different large-scale multi-class datasets from the LIBSVM1 dataset col-\nlection, including connect-4, letter, and SensIT Vehicle acoustic. The detailed information of these\ndatasets are presented in Table 2.\nWe selected the number of hidden units using cross-validation to achieve their best results. Since the\ndimension of parameters, Nd, is relatively high, we used only diagonal covariance matrix estimation\nfor CCAdL to signi\ufb01cantly reduce the computational cost, i.e. estimating the variance only along\neach dimension. The size of the subset was chosen as 500\u20131000 to obtain a reasonable variance\nestimation. For each dataset, we chose the \ufb01rst 20% of the total number of passes over the entire\ndataset as the burn-in period and collected the remaining samples for prediction.\n\nTable 2: Datasets used in DRBM with corresponding parameter con\ufb01gurations.\n\ntraining/test set\n54,046/13,511\n10,500/5,000\n78,823/19,705\n\nclasses\n\n3\n26\n3\n\nfeatures\n\n126\n16\n50\n\nhidden units\n\n20\n100\n20\n\ntotal number of parameters Nd\n\n2603\n4326\n1083\n\nDatasets\nconnect-4\n\nletter\n\nacoustic\n\nThe error rates computed by various methods on the test set using the posterior mean against the\nnumber of passes over the entire dataset were plotted in Figure 3. It can be observed that SGHMC\nand SGNHT only work well with a large value of the effective friction A, which corresponds to a\nstrong random walk effect and thus slows down the convergence. On the contrary, CCAdL works\n\n1http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/multiclass.html\n\n7\n\n0200400600\u2212800\u2212700\u2212600\u2212500\u2212400\u2212300Number of PassesTest Log Likelihood SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=100100200300\u2212800\u2212700\u2212600\u2212500\u2212400\u2212300Number of PassesTest Log Likelihood SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=100100200300\u2212800\u2212700\u2212600\u2212500\u2212400\u2212300Number of PassesTest Log Likelihood SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=100.030.0350.040.0450.050.055\u22125051015x 10\u22123w2w5 True(HMC)SGHMCSGNHTCCAdL0.030.0350.040.0450.050.055\u22125051015x 10\u22123w2w5 True(HMC)SGHMCSGNHTCCAdL0.030.0350.040.0450.050.055\u22125051015x 10\u22123w2w5 True(HMC)SGHMCSGNHTCCAdL\freliably (much better than the other two) in a wide range of A, and more importantly in the large\nstepsize regime, which speeds up the convergence rate in relation to the computational work per-\nformed. It can be easily seen that the performance of SGHMC heavily relies on using a small value\nof h and a large value of A, which signi\ufb01cantly limits its usefulness in practice.\n\n(1a) connect-4, h = 0.5\u00d710\u22123\n\n(1b) connect-4, h = 1\u00d710\u22123\n\n(1c) connect-4, h = 2\u00d710\u22123\n\n(2a) letter, h = 1\u00d710\u22123\n\n(2b) letter, h = 2\u00d710\u22123\n\n(2c) letter, h = 5\u00d710\u22123\n\n(3a) acoustic, h = 0.2\u00d710\u22123\n\n(3b) acoustic, h = 0.5\u00d710\u22123\n\n(3c) acoustic, h = 1\u00d710\u22123\n\nFigure 3: Comparisons of DRBM on datasets connect-4 (top row), letter (middle row), and acoustic (bottom\nrow) with various values of h and A indicated: test error rates of various methods using the posterior mean\nagainst the number of passes over the entire dataset.\n\n5 Conclusions and Future Work\n\nIn this article, we have proposed a novel CCAdL formulation that can effectively dissipate\nparameter-dependent noise while maintaining a desired invariant distribution. CCAdL combines\nideas of SGHMC and SGNHT from the literature, but achieves signi\ufb01cant improvements over each\nof these methods in practice. The additional error introduced by covariance estimation is expected\nto be small in a relative sense, i.e. substantially smaller than the error arising from the noisy gradi-\nent. Our \ufb01ndings have been veri\ufb01ed in large-scale machine learning applications. In particular, we\nhave consistently observed that SGHMC relies on a small stepsize h and a large friction A, which\nsigni\ufb01cantly reduces its usefulness in practice as discussed. The techniques presented in this article\ncould be of use in more general settings of large-scale Bayesian sampling and optimization, which\nwe leave for future work.\nA naive nonsymmetric splitting method has been applied for CCAdL for fair comparison in this\narticle. However, we point out that optimal design of splitting methods in ergodic SDE systems has\nbeen explored recently in the mathematics community [1, 13, 14]. Moreover, it has been shown\nin [15] that a certain type of symmetric splitting method for the Ad-Langevin/SGNHT method with\na clean (full) gradient inherits the superconvergence property (i.e. fourth order convergence to the\ninvariant distribution for con\ufb01gurational quantities) recently demonstrated in the setting of Langevin\ndynamics [12, 14]. We leave further exploration of this direction in the context of noisy gradients\nfor future work.\n\n8\n\n501001502000.270.280.290.30.310.320.33Number of PassesTest Error SGHMC, A=10SGHMC, A=50SGNHT, A=10SGNHT, A=50CCAdL, A=10CCAdL, A=50501001502000.270.280.290.30.310.320.33Number of PassesTest Error SGHMC, A=10SGHMC, A=50SGNHT, A=10SGNHT, A=50CCAdL, A=10CCAdL, A=50501001502000.270.280.290.30.310.320.33Number of PassesTest Error SGHMC, A=10SGHMC, A=50SGNHT, A=10SGNHT, A=50CCAdL, A=10CCAdL, A=501002003004000.10.150.20.25Test ErrorNumber of Passes SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=101002003004000.10.150.20.25Test ErrorNumber of Passes SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=101002003004000.10.150.20.25Test ErrorNumber of Passes SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=10501001502000.250.30.350.4Number of PassesTest Error SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=10501001502000.250.30.350.4Number of PassesTest Error SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=10501001502000.250.30.350.4Number of PassesTest Error SGHMC, A=1SGHMC, A=10SGNHT, A=1SGNHT, A=10CCAdL, A=1CCAdL, A=10\fReferences\n[1] A. Abdulle, G. Vilmart, and K. C. Zygalakis. Long time accuracy of Lie-Trotter splitting\n\nmethods for Langevin dynamics. SIAM Journal on Numerical Analysis, 53(1):1\u201316, 2015.\n\n[2] S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient\nFisher scoring. In Proceedings of the 29th International Conference on Machine Learning,\npages 1591\u20131598, 2012.\n\n[3] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng. Handbook of Markov Chain Monte Carlo.\n\nCRC Press, 2011.\n\n[4] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Pro-\nceedings of the 31st International Conference on Machine Learning, pages 1683\u20131691, 2014.\n[5] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using\nstochastic gradient thermostats. In Advances in Neural Information Processing Systems 27,\npages 3203\u20133211, 2014.\n\n[6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics\n\nLetters B, 195(2):216\u2013222, 1987.\n\n[7] D. Frenkel and B. Smit. Understanding Molecular Simulation: From Algorithms to Applica-\n\ntions, Second Edition. Academic Press, 2001.\n\n[8] W. G. Hoover. Computational Statistical Mechanics, Studies in Modern Thermodynamics.\n\nElsevier Science, 1991.\n\n[9] A. M. Horowitz. A generalized guided Monte Carlo algorithm. Physics Letters B, 268(2):247\u2013\n\n252, 1991.\n\n[10] A. Jones and B. Leimkuhler. Adaptive stochastic methods for sampling driven molecular sys-\n\ntems. The Journal of Chemical Physics, 135(8):084125, 2011.\n\n[11] H. Larochelle and Y. Bengio. Classi\ufb01cation using discriminative restricted Boltzmann ma-\nIn Proceedings of the 25th International Conference on Machine Learning, pages\n\nchines.\n536\u2013543, 2008.\n\n[12] B. Leimkuhler and C. Matthews. Rational construction of stochastic numerical methods for\n\nmolecular sampling. Applied Mathematics Research eXpress, 2013(1):34\u201356, 2013.\n\n[13] B. Leimkuhler and C. Matthews. Molecular Dynamics: With Deterministic and Stochastic\n\nNumerical Methods. Springer, 2015.\n\n[14] B. Leimkuhler, C. Matthews, and G. Stoltz. The computation of averages from equilibrium and\nnonequilibrium Langevin molecular dynamics. IMA Journal of Numerical Analysis, 36(1):13\u2013\n79, 2016.\n\n[15] B. Leimkuhler and X. Shang. Adaptive thermostats for noisy gradient systems. SIAM Journal\n\non Scienti\ufb01c Computing, 2016.\n\n[16] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of\nstate calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087,\n1953.\n\n[17] S. Nos\u00b4e. A uni\ufb01ed formulation of the constant temperature molecular dynamics methods. The\n\nJournal of Chemical Physics, 81(1):511, 1984.\n\n[18] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical\n\nStatistics, 22(2):400\u2013407, 1951.\n\n[19] C. Robert and G. Casella. Monte Carlo Statistical Methods, Second Edition. Springer, 2004.\n(Non-) asymptotic properties of stochastic\n[20] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh.\n\ngradient Langevin dynamics. arXiv preprint arXiv:1501.00438, 2015.\n\n[21] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\nProceedings of the 28th International Conference on Machine Learning, pages 681\u2013688, 2011.\n\n9\n\n\f", "award": [], "sourceid": 18, "authors": [{"given_name": "Xiaocheng", "family_name": "Shang", "institution": "University of Edinburgh"}, {"given_name": "Zhanxing", "family_name": "Zhu", "institution": "University of Edinburgh"}, {"given_name": "Benedict", "family_name": "Leimkuhler", "institution": "University of Edinburgh"}, {"given_name": "Amos", "family_name": "Storkey", "institution": "University of Edinburgh"}]}