{"title": "SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient", "book": "Advances in Neural Information Processing Systems", "page_first": 6245, "page_last": 6255, "abstract": "Uncertainty estimation in large deep-learning models is a computationally challenging\ntask, where it is difficult to form even a Gaussian approximation to the\nposterior distribution. In such situations, existing methods usually resort to a diagonal\napproximation of the covariance matrix despite the fact that these matrices\nare known to give poor uncertainty estimates. To address this issue, we propose\na new stochastic, low-rank, approximate natural-gradient (SLANG) method for\nvariational inference in large deep models. Our method estimates a \u201cdiagonal\nplus low-rank\u201d structure based solely on back-propagated gradients of the network\nlog-likelihood. This requires strictly less gradient computations than methods that\ncompute the gradient of the whole variational objective. Empirical evaluations\non standard benchmarks confirm that SLANG enables faster and more accurate\nestimation of uncertainty than mean-field methods, and performs comparably to\nstate-of-the-art methods.", "full_text": "SLANG: Fast Structured Covariance Approximations\nfor Bayesian Deep Learning with Natural Gradient\n\nUniversity of British Columbia\n\nEcole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nAaron Mishkin\u2217\n\nVancouver, Canada\n\namishkin@cs.ubc.ca\n\nFrederik Kunstner\u2217\n\nLausanne, Switzerland\n\nfrederik.kunstner@epfl.ch\n\nDidrik Nielsen\n\nRIKEN Center for AI Project\n\nTokyo, Japan\n\ndidrik.nielsen@riken.jp\n\nMark Schmidt\n\nUniversity of British Columbia\n\nVancouver, Canada\n\nschmidtm@cs.ubc.ca\n\nMohammad Emtiyaz Khan\nRIKEN Center for AI Project\n\nTokyo, Japan\n\nemtiyaz.khan@riken.jp\n\nAbstract\n\nUncertainty estimation in large deep-learning models is a computationally chal-\nlenging task, where it is dif\ufb01cult to form even a Gaussian approximation to the\nposterior distribution. In such situations, existing methods usually resort to a diag-\nonal approximation of the covariance matrix despite the fact that these matrices are\nknown to result in poor uncertainty estimates. To address this issue, we propose\na new stochastic, low-rank, approximate natural-gradient (SLANG) method for\nvariational inference in large, deep models. Our method estimates a \u201cdiagonal\nplus low-rank\u201d structure based solely on back-propagated gradients of the network\nlog-likelihood. This requires strictly less gradient computations than methods that\ncompute the gradient of the whole variational objective. Empirical evaluations\non standard benchmarks con\ufb01rm that SLANG enables faster and more accurate\nestimation of uncertainty than mean-\ufb01eld methods, and performs comparably to\nstate-of-the-art methods.\n\n1\n\nIntroduction\n\nDeep learning has had enormous recent success in \ufb01elds such as speech recognition and computer\nvision. In these problems, our goal is to predict well and we are typically less interested in the\nuncertainty behind the predictions. However, deep learning is now becoming increasingly popular\nin applications such as robotics and medical diagnostics, where accurate measures of uncertainty\nare crucial for reliable decisions. For example, uncertainty estimates are important for physicians\nwho use automated diagnosis systems to choose effective and safe treatment options. Lack of such\nestimates may lead to decisions that have disastrous consequences.\nThe goal of Bayesian deep learning is to provide uncertainty estimates by integrating over the\nposterior distribution of the parameters. Unfortunately, the complexity of deep learning models makes\n\u2217Equal contributions. This work was conducted during an internship at the RIKEN Center for AI project.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: This \ufb01gure illustrates the advantages of SLANG method over mean-\ufb01eld approaches on\nthe USPS dataset (see Section 4.1 for experimental details). The \ufb01gure on the left compares our\nstructured covariance approximation with the one obtained by a full Gaussian approximation. For\nclarity, only off-diagonal entries are shown. We clearly see that our approximation becomes more\naccurate as the rank is increased. The \ufb01gures on the right compare the means and variances (the\ndiagonal of the covariance). The means match closely for all methods, but the variance is heavily\nunderestimated by the mean-\ufb01eld method. SLANG\u2019s covariance approximations do not suffer form\nthis problem, which is likely due to the off-diagonal structure it learns.\n\nit infeasible to perform the integration exactly. Sampling methods such as stochastic-gradient Markov\nchain Monte Carlo [9] have been applied to deep models, but they usually converge slowly. They\nalso require a large memory to store the samples and often need large preconditioners to mix well\n[2, 5, 32]. In contrast, variational inference (VI) methods require much less memory and can scale\nto large problems by exploiting stochastic gradient methods [7, 12, 28]. However, they often make\ncrude simpli\ufb01cations, like the mean-\ufb01eld approximation, to reduce the memory and computation cost.\nThis can result in poor uncertainty estimates [36]. Fast and accurate estimation of uncertainty for\nlarge models remains a challenging problem in Bayesian deep learning.\nIn this paper, we propose a new variational inference method to estimate Gaussian approximations\nwith a diagonal plus low-rank covariance structure. This gives more accurate and \ufb02exible approxima-\ntions than the mean-\ufb01eld approach. Our method also enables fast estimation by using an approximate\nnatural-gradient algorithm that builds the covariance estimate solely based on the back-propagated\ngradients of the network log-likelihood. We call our method stochastic low-rank approximate natural-\ngradient (SLANG). SLANG requires strictly less gradient computations than methods that require\ngradients of the variational objective obtained using the reparameterization trick [24, 26, 35]. Our\nempirical comparisons demonstrate the improvements obtained over mean-\ufb01eld methods (see Figure\n1 for an example) and show that SLANG gives comparable results to the state-of-the-art on standard\nbenchmarks.\nThe code to reproduce the experimental results in this paper is available at https://github.com/\naaronpmishkin/SLANG.\n\n1.1 Related Work\n\nGaussian variational distributions with full covariance matrices have been used extensively for\nshallow models [6, 8, 16, 22, 24, 30, 34, 35]. Several ef\ufb01cient ways of computing the full covariance\nmatrix are discussed by Seeger [31]. Other works have considered various structured covariance\napproximations, based on the Cholesky decomposition [8, 35], sparse covariance matrices [34] and\nlow-rank plus diagonal structure [6, 26, 30]. Recently, several works [24, 26] have applied stochastic\ngradient descent on the variational objective to estimate such a structure. These methods often employ\nan adaptive learning rate method, such as Adam or RMSprop, which increases the memory cost. All\nof these methods have only been applied to shallow models, and it remains unclear how they will\nperform (and whether they can be adapted) for deep models. Moreover, a natural-gradient method is\npreferable to gradient-based methods when optimizing the parameters of a distribution [3, 15, 18].\n\n2\n\nFull GaussianSLANG (Rank 1)SLANG (Rank 5)SLANG (Rank 10)-0.4-0.3-0.2-0.10.00.1\u00d71020.60.40.20.00.20.4MeanFull GaussianMFSLANG-1SLANG-5SLANG-10Image pixel (by row)0.020.030.04Variance\fOur work shows that a natural-gradient method not only has better convergence properties, but also\nhas lower computation and memory cost than gradient-based methods.\nFor deep models, a variety of methods have been proposed based on mean-\ufb01eld approximations.\nThese methods optimize the variational objective using stochastic-gradient methods and differ from\neach other in the way they compute those gradients [7, 12, 14, 19, 28, 38]. They all give poor\nuncertainty estimates in the presence of strong posterior correlations and also shrink variances [36].\nSLANG is designed to add extra covariance structure and ensure better performance than mean-\ufb01eld\napproaches.\nA few recent works have explored structured covariance approximations for deep models. In [38],\nthe Kronecker-factored approximate curvature (K-FAC) method is applied to perform approximate\nnatural-gradient VI. Another recent work has applied K-FAC to \ufb01nd a Laplace approximation [29].\nHowever, the Laplace approximation can perform worse than variational inference in many scenarios,\ne.g., when the posterior distribution is not symmetric [25]. Other types of approximation methods\ninclude Bayesian dropout [10] and methods that use matrix-variate Gaussians [21, 33]. All of these\napproaches make structural assumptions that are different from our low-rank plus diagonal structure.\nHowever, similarly to our work, they provide new ways to improve the speed and accuracy of\nuncertainty estimation in deep learning.\n\n2 Gaussian Approximation with Natural-Gradient Variational Inference\n\nOur goal is to estimate the uncertainty in deep models using Bayesian inference. Given N data\nexamples D = {Di}N\ni=1, a Bayesian version of a deep model can be speci\ufb01ed by using a likelihood\np(Di|\u03b8) parametrized by a deep network with parameters \u03b8 \u2208 RD and a prior distribution p(\u03b8).\nFor simplicity, we assume that the prior is a Gaussian distribution, such as an isotropic Gaussian\np(\u03b8) \u223c N (0, (1/\u03bb)I) with the scalar precision parameter \u03bb > 0. However, the methods presented\nin this paper can easily be modi\ufb01ed to handle many other types of prior distributions. Given\nsuch a model, Bayesian approaches compute an estimate of uncertainty by using the posterior\ndistribution: p(\u03b8|D) = p(D|\u03b8)p(\u03b8)/p(D). This requires computation of the marginal likelihood\n\np(D) =(cid:82) p(D|\u03b8)p(\u03b8)d\u03b8, which is a high-dimensional integral and dif\ufb01cult to compute.\n\nVariational inference (VI) simpli\ufb01es the problem by approximating p(\u03b8|D) with a distribution\nq(\u03b8).\nIn this paper, our focus is on obtaining approximations that have a Gaussian form, i.e.,\nq(\u03b8) = N (\u03b8|\u00b5, \u03a3) with mean \u00b5 and covariance \u03a3. The parameters \u00b5 and \u03a3 are referred to as\nthe variational parameters and can be obtained by maximizing a lower bound on p(D) called the\nevidence lower bound (ELBO),\n\nELBO: L(\u00b5, \u03a3) := Eq [log p(D|\u03b8)] \u2212 D\n\nKL[q(\u03b8)(cid:107) p(\u03b8)].\n\n(1)\n\nwhere DKL[\u00b7] denotes the Kullback-Leibler divergence.\nA straightforward and popular approach to optimize L is to use stochastic gradient methods [24, 26,\n28, 35]. However, natural-gradients are preferable when optimizing the parameters of a distribution\n[3, 15, 18]. This is because natural-gradient methods perform optimization on the Riemannian\nmanifold of the parameters, which can lead to a faster convergence when compared to gradient-based\nmethods. Typically, natural-gradient methods are dif\ufb01cult to implement, but many easy-to-implement\nupdates have been derived in recent works [15, 17, 19, 38]. We build upon the approximate natural-\ngradient method proposed in [19] and modify it to estimate structured covariance-approximations.\nSpeci\ufb01cally, we extend the Variational Online Gauss-Newton (VOGN) method [19]. This method\nuses the following update for \u00b5 and \u03a3 (a derivation is in Appendix A),\nt+1 = (1 \u2212 \u03b2t)\u03a3\u22121\ngi(\u03b8t), and \u02c6G(\u03b8t) := \u2212 N\nM\n\n(cid:105)\n,\ngi(\u03b8t)gi(\u03b8t)(cid:62),\n\n(cid:104) \u02c6G(\u03b8t) + \u03bbI\n(cid:88)\n\n\u00b5t+1 = \u00b5t \u2212 \u03b1t\u03a3t+1 [\u02c6g(\u03b8t) + \u03bb\u00b5t] , \u03a3\u22121\n\nwith \u02c6g(\u03b8t) := \u2212 N\nM\n\n(2)\n\nt + \u03b2t\n\ni\u2208M\n\n(cid:88)\n\ni\u2208M\n\nwhere t is the iteration number, \u03b1t, \u03b2t > 0 are learning rates, \u03b8t \u223c N (\u03b8|\u00b5t, \u03a3t), gi(\u03b8t) :=\n\u2207\u03b8 log p(Di|\u03b8t) is the back-propagated gradient obtained on the i\u2019th data example, \u02c6G(\u03b8t) is an\nEmpirical Fisher (EF) matrix, and M is a minibatch of M data examples. This update is an\napproximate natural-gradient update obtained by using the EF matrix as an approximation of the\n\n3\n\n\fHessian [23] in a method called Variational Online Newton (VON) [19]. This is explained in more\ndetail in Appendix A. As discussed in [19], the VOGN method is an approximate Natural-gradient\nupdate which may not have the same properties as the exact natural-gradient update. However, an\nadvantage of the update (2) is that it only requires back-propagated gradients, which is a desirable\nfeature when working with deep models.\nThe update (2) is computationally infeasible for large deep models because it requires the storage\nand inversion of the D \u00d7 D covariance matrix. Storage takes O(D2) memory space and inversion\nrequires O(D3) computations, which makes the update very costly to perform for large models. We\ncannot form \u03a3 or invert it when D is in millions. Mean-\ufb01eld approximations avoid this issue by\nrestricting \u03a3 to be a diagonal matrix, but they often give poor Gaussian approximations. Our idea is\nto estimate a low-rank plus diagonal approximation of \u03a3 that reduces the computational cost while\npreserving some off-diagonal covariance structure. In the next section, we propose modi\ufb01cations to\nthe update (2) to obtain a method whose time and space complexity are both linear in D.\n\n3 Stochastic, Low-rank, Approximate Natural-Gradient (SLANG) Method\n\nOur goal is to modify the update (2) to obtain a method whose time and space complexity is linear in\nD. We propose to approximate the inverse of \u03a3t by a \u201clow-rank plus diagonal\u201d matrix:\n\nt \u2248 \u02c6\u03a3\u22121\n\u03a3\u22121\n\nt\n\n:= UtU(cid:62)\n\n(3)\nwhere Ut is a D \u00d7 L matrix with L (cid:28) D and Dt is a D \u00d7 D diagonal matrix. The cost of storing\nand inverting this matrix is linear in D and reasonable when L is small. We now derive an update\nfor Ut and Dt such that the resulting \u02c6\u03a3\u22121\nt+1 closely approximates the update shown in (2). We start\nby writing an approximation to the update of \u03a3\u22121\nt+1 where we replace covariance matrices by their\nstructured approximations:\n\nt + Dt,\n\nt+1 := Ut+1U(cid:62)\n\u02c6\u03a3\u22121\n\nt+1 + Dt+1 \u2248 (1 \u2212 \u03b2t) \u02c6\u03a3\u22121\n\nt + \u03b2t\n\n(4)\n\n(cid:104) \u02c6G(\u03b8t) + \u03bbI\n\n(cid:105)\n\n(cid:105)\n\nt + \u03b2t\n\n(1 \u2212 \u03b2t) \u02c6\u03a3\u22121\n\n(cid:104) \u02c6G(\u03b8t) + \u03bbI\n\nThis update cannot be performed exactly without potentially increasing the rank of the low-rank\ncomponent Ut+1, since the structured components on the right hand side are of rank at most L + M,\nwhere M is the size of the minibatch. This is shown in (5) below where we have rearranged the left\nhand side of (4) as the sum of a structured component and a diagonal component. To obtain a rank L\napproximation to the left hand side of (5), we propose to approximate the structured component by\nan eigenvalue decomposition as shown in (6) below,\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\nwhere Q1:L is a D\u00d7L matrix containing the \ufb01rst L leading eigenvectors of (1\u2212\u03b2t)UtU(cid:62)\nt +\u03b2t \u02c6G(\u03b8t)\nand \u039b1:L is an L \u00d7 L diagonal matrix containing the corresponding eigenvalues. Figure 2 visualizes\nthe update from (5) to (6).\nThe low-rank component Ut+1 can now be updated to mirror the low-rank component of (6),\n\n+ (1 \u2212 \u03b2t)Dt + \u03b2t\u03bbI\n\n+ (1 \u2212 \u03b2t)Dt + \u03b2t\u03bbI\n\n= (1 \u2212 \u03b2t)UtU(cid:62)\n\nRank at most L + M\n\n(cid:124)\nQ1:L\u039b1:LQ(cid:62)\n\nRank L approximation\n\n(cid:125)\n\nt + \u03b2t \u02c6G(\u03b8t)\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\nDiagonal component\n\nDiagonal component\n\n(cid:124)\n(cid:124)\n\n(cid:125)\n(cid:125)\n\n,\n\n,\n\n(cid:124)\n\n(cid:125)\n\n(5)\n\n(6)\n\n\u2248\n\n1:L\n\n(cid:104)\n\nand the diagonal Dt+1 can be updated to match the diagonal of the left and right sides of (4), i.e.,\n\ndiag\n\nUt+1U(cid:62)\n\nt+1 + Dt+1\n\n= diag\n\n(1 \u2212 \u03b2t)UtU(cid:62)\n\nt + \u03b2t \u02c6G(\u03b8t) + (1 \u2212 \u03b2t)Dt + \u03b2t\u03bbI\n\n(cid:105)\n\n(7)\n\n,\n\n(8)\n\n(9)\n\n(10)\n\nUt+1 = Q1:L\u039b1/2\n\n1:L, ,\n\n(cid:105)\n\n(cid:104)\n\n(cid:104)\n\nThis gives us the following update for Dt+1 using a diagonal correction \u2206t,\n\nDt+1 = (1 \u2212 \u03b2)Dt + \u03b2t\u03bbI + \u2206t,\n\n\u2206t = diag\n\n(1 \u2212 \u03b2)UtU(cid:62)\n\nt + \u03b2t \u02c6G(\u03b8t) \u2212 Ut+1U(cid:62)\n\nt+1\n\n.\n\n(cid:105)\n\nThis step is cheap since computing the diagonal of the EF matrix is linear in D.\n\n4\n\n\fD \u00d7 L\n\nL \u00d7 D\n\nD \u00d7 M\n\nM \u00d7 D\n\nD \u00d7 L\n\nL \u00d7 D\n\n+\n\n=\n\nfast_eig\n\n\u2248\n\n(1 \u2212 \u03b2)UtU(cid:62)\n\nt\n\n\u03b2G(\u03b8t)\n\nUt+1U(cid:62)\n\nt+1\n\nFigure 2: This \ufb01gure illustrates Equations (6) and (7) which are used to derive SLANG.\n\nThe new covariance approximation can now be used to update \u00b5t+1 according to (2) as shown below:\n\nSLANG: \u00b5t+1 = \u00b5t \u2212 \u03b1t\n\nUt+1U(cid:62)\n\nt+1 + Dt+1\n\n[\u02c6g(\u03b8t) + \u03bb\u00b5t] ,\n\n(11)\n\n(cid:104)\n\n(cid:105)\u22121\n\nThe above update uses a stochastic, low-rank covariance estimate to approximate natural-gradient\nupdates, which is why we use the name SLANG.\nWhen L = D, Ut+1U(cid:62)\nt+1 is full rank and SLANG is identical to the approximate natural-gradient\nupdate (2). When L < D, SLANG produces matrices \u02c6\u03a3\u22121\nt with diagonals matching (2) at every\niteration. The diagonal correction ensures that no diagonal information is lost during the low-rank\napproximation of the covariance. A formal statement and proof is given in Appendix D.\nWe also tried an alternative method where Ut+1 is learned using an exponential moving-average of\nthe eigendecompositions of \u02c6G(\u03b8). This previous iteration of SLANG is discussed in Appendix B,\nwhere we show that it gives worse results than the SLANG update.\nNext, we give implementation details of SLANG.\n\n3.1 Details of the SLANG Implementation\n\nt + Dt\n\nThe pseudo-code for SLANG is shown in Algorithm 1 in Figure 3.\nAt every iteration, we generate a sample \u03b8t \u223c N (\u03b8|\u00b5t, UtU(cid:62)\nt + Dt). This is implemented in\nline 4 of Algorithm 1 using the subroutine fast_sample. Pseudocode for this subroutine is given\nin Algorithm 3. This function uses the Woodbury identity and to compute the square-root matrix\n\n(cid:1)\u22121/2 [4]. The sample is then computed as \u03b8t = \u00b5t + At\u0001, where \u0001 \u223c N (0, I).\n\nAt =(cid:0)UtU(cid:62)\n\nThe function fast_sample requires computations in O(DL2 + DLS) to generate S samples, which\nis linear in D. More details are given in Appendix C.4.\nGiven a sample, we need to compute and store all the individual stochastic gradients gi(\u03b8t) for all\nexamples i in a minibatch M. The standard back-propagation implementation does not allow this.\nWe instead use a version of the backpropagation algorithm outlined in a note by Goodfellow [11],\nwhich enables ef\ufb01cient computation of the gradients \u02c6gi(\u03b8t). This is shown in line 6 of Algorithm 1,\nwhere a subroutine backprop_goodfellow is used (see details of this subroutine in Appendix C.1).\nIn line 7, we compute the eigenvalue decomposition of (1 \u2212 \u03b2t)UtUt + \u03b2t \u02c6G(\u03b8t) by using the\nfast_eig subroutine. The subroutine fast_eig implements a randomized eigenvalue decomposi-\ntion method discussed in [13]. It computes the top-L eigendecomposition of a low-rank matrix in\nO(DLM S + DL2). More details on the subroutine are given in Appendix C.2. The matrix Ut+1\nand Dt+1 are updated using the eigenvalue decomposition in lines 8, 9 and 10.\nIn lines 11 and 12, we compute the update vector [Ut+1U(cid:62)\nt+1 + Dt+1]\u22121 [\u02c6g(\u03b8t) + \u03bb\u00b5t], which\nrequires solving a linear system. We use the subroutine fast_inverse shown in Algorithm 2. This\nsubroutine uses the Woodbury identity to ef\ufb01ciently compute the inverse with a cost O(DL2). More\ndetails are given in Appendix C.3. Finally, in line 13, we update \u00b5t+1.\nThe overall computational complexity of SLANG is O(DL2 + DLM S) and its memory cost is\nO(DL + DM S). Both are linear in D and M. The cost is quadratic in L, but since L (cid:28) D (e.g., 5\n\n5\n\n\fAlgorithm 1 SLANG\nRequire: Data D, hyperparameters M, L, \u03bb, \u03b1, \u03b2\n1: Initialize \u00b5, U, d\n2: \u03b4 \u2190 (1 \u2212 \u03b2)\n3: while not converged do\n4:\n5: M \u2190 sample a minibatch\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: end while\n15: return \u00b5, U, d\n\n\u03b8 \u2190 fast_sample(\u00b5, U, d)\n[g1, .., gM ] \u2190 backprop_goodfellow(DM, \u03b8)\nV \u2190 fast_eig(\u03b4u1, .., \u03b4uL, \u03b2g1, .., \u03b2gM , L)\nU \u2190 V\nd \u2190 \u03b4d + \u2206d + \u03bb1\n\u2206\u00b5 \u2190 fast_inverse(\u02c6g, U, d)\n\u00b5 \u2190 \u00b5 \u2212 \u03b1\u2206\u00b5\n\n\u2206d \u2190(cid:80)L\n\u02c6g \u2190(cid:80)\n\ni +(cid:80)M\n\ni \u2212(cid:80)L\n\ni gi + \u03bb\u00b5\n\ni=1 \u03b4u2\n\ni=1 \u03b2g2\n\ni=1 v2\ni\n\nAlgorithm 2 fast_inverse(g, U, d)\n1: A \u2190 (IL + U(cid:62)d\n2: y \u2190 d\n3: return y\n\n\u22121U)\u22121\n\u22121UAU(cid:62)d\n\n\u22121g \u2212 d\n\n\u22121g\n\n\u22121/2 (cid:12) U\n\nAlgorithm 3 fast_sample(\u00b5, U, d)\n1: \u0001 \u223c N (0, ID)\n2: V \u2190 d\n3: A \u2190 Cholesky(V(cid:62)V)\n4: B \u2190 Cholesky(IL + V(cid:62)V)\n5: C \u2190 A\u2212(cid:62)(B \u2212 IL)A\u22121\n6: K \u2190 (C + V(cid:62)V)\u22121\n7: y \u2190 d\n\u22121/2\u0001 \u2212 VKV(cid:62)\u0001\n8: return \u00b5 + y\n\nFigure 3: Algorithm 1 gives the pseudo-code for SLANG. Here, M is the minibatch size, L is\nthe number of low-rank factors, \u03bb is the prior precision parameter, and \u03b1, \u03b2 are learning rates. The\ndiagonal component is denoted with a vector d and the columns of the matrix U and V are denoted\nby uj and vj respectively. The algorithm depends on multiple subroutines, described in more details\nin Section 3.1. The overall complexity of the algorithm is O(DL2 + DLM ).\n\nor 10), this only adds a small multiplicative constant in the runtime. SLANG reduces the cost of the\nupdate (2) signi\ufb01cantly while preserving some posterior correlations.\n\n4 Experiments\n\nIn this section, our goal is to show experimental results in support of the following claims: (1)\nSLANG gives reasonable posterior approximations, and (2) SLANG performs well on standard\nbenchmarks for Bayesian neural networks. We present evaluations on several LIBSVM datasets, the\nUCI regression benchmark, and MNIST classi\ufb01cation. SLANG beats mean-\ufb01eld methods on almost\nall tasks considered and performs comparably to state-of-the-art methods. SLANG also converges\nfaster than mean-\ufb01eld methods.\n\n4.1 Bayesian Logistic Regression\n\nWe considered four benchmark datasets for our comparison: USPS 3vs5, Australian, Breast-Cancer,\nand a1a. Details of the datasets are in Table 8 in Appendix E.2 along with the implementation details\nof the methods we compare to. We use L-BFGS [37] to compute the optimal full-Gaussian variational\napproximation that minimizes the ELBO using the method described in Marlin et al. [22]. We refer\nto the optimal full-Gaussian variational approximation as the \u201cFull-Gaussian Exact\u201d method. We also\ncompute the optimal mean-\ufb01eld Gaussian approximation and refer to it as \u201cMF Exact\u201d.\nFigure 1 shows a qualitative comparison of the estimated posterior means, variances, and covariances\nfor the USPS-3vs5 dataset (N = 770, D = 256). The \ufb01gure on the left compares covariance\napproximations obtained with SLANG to the Full-Gaussian Exact method. Only off-diagonal entries\nare shown. We see that the approximation becomes more and more accurate as the rank is increased.\nThe \ufb01gures on the right compare the means and variances. The means match closely for all methods,\nbut the variance is heavily underestimated by the MF Exact method; we see that the variances obtained\nunder the mean-\ufb01eld approximation estimate a high variance where Full-Gaussian Exact has a low\nvariance and vice-versa. This \u201ctrend-reversal\u201d is due to the typical shrinking behavior of mean-\ufb01eld\nmethods [36]. In contrast, SLANG corrects the trend reversal problem even when L = 1. Similar\nresults for other datasets are shown in Figure 7 in Appendix E.1.\n\n6\n\n\fTable 1: Results on Bayesian logistic regression where we compare SLANG to three full-Gaussian\nmethods and three mean-\ufb01eld methods. We measure negative ELBO, test log-loss, and symmetric\nKL-divergence between each approximation and the Full-Gaussian Exact method (last column).\nLower values are better. SLANG nearly always gives better results than the mean-\ufb01eld methods, and\nwith L = 10 is comparable to Full-Gaussian methods. This shows that our structured covariance\napproximation is reasonably accurate for Bayesian logistic regression.\n\nDataset Metrics\n\nAustralian\n\nBreast\nCancer\n\na1a\n\nUSPS\n3vs5\n\nMean-Field Methods\nEF Hess. Exact\n0.614 0.613 0.593\nELBO\n0.348 0.347 0.341\nNLL\nKL ( \u00d7104) 2.240 2.030 0.195\nELBO\n0.122 0.121 0.121\nNLL\n0.095 0.094 0.094\nKL ( \u00d7100) 8.019 9.071 7.771\nELBO\n0.384 0.383 0.383\nNLL\n0.339 0.339 0.339\nKL (\u00d7102) 2.590 2.208 1.295\nELBO\n0.268 0.268 0.267\nNLL\n0.139 0.139 0.138\nKL (\u00d7101) 7.684 7.188 7.083\n\nSLANG\n\nL = 1 L = 5 L = 10\n0.574 0.569 0.566\n0.342 0.339 0.338\n0.033 0.008 0.002\n0.112 0.111 0.111\n0.092 0.092 0.092\n0.911 0.842 0.638\n0.377 0.374 0.373\n0.339 0.339 0.339\n0.305 0.173 0.118\n0.210 0.198 0.193\n0.132 0.132 0.131\n1.492 0.755 0.448\n\nFull Gaussian\n\nEF Hess. Exact\n0.560 0.558 0.559\n0.340 0.339 0.338\n0.000 0.000 0.000\n0.111 0.109 0.109\n0.092 0.091 0.091\n0.637 0.002 0.000\n0.369 0.368 0.368\n0.339 0.339 0.339\n0.014 0.000 0.000\n0.189 0.186 0.186\n0.131 0.130 0.130\n0.180 0.001 0.000\n\nThe complete results for Bayesian logistic regression are summarized in Table 1, where we also\ncompare to four additional methods called \u201cFull-Gaussian EF\u201d, \u201cFull-Gaussian Hessian\u201d, \u201cMean-\nField EF\u201d, and \u201cMean-Field Hessian\u201d. The Full-Gaussian EF method is the natural-gradient update\n(2) which uses the EF matrix \u02c6G(\u03b8), while the Full-Gaussian Hessian method uses the Hessian instead\nof the EF matrix (the updates are given in (12) and (13) in Appendix A). The last two methods are the\nmean-\ufb01eld versions of the Full-Gaussian EF and Full-Gaussian Hessian methods, respectively. We\ncompare negative ELBO, test log-loss using cross-entropy, and symmetric KL-divergence between the\napproximations and the Full-Gaussian Exact method. We report averages over 20 random 50%-50%\ntraining-test splits of the dataset. Variances and results for SLANG with L = 2 are omitted here due\nto space constraints, but are reported in Table 6 in Appendix E.1.\nWe \ufb01nd that SLANG with L = 1 nearly always produces better approximations than the mean-\ufb01eld\nmethods. As expected, increasing L improves the quality of the variational distribution found by\nSLANG according to all three metrics. We also note that Full-Gaussian EF method has similar\nperformance to the Full-Gaussian Hessian method, which indicates that the EF approximation may\nbe acceptable for Bayesian logistic regression.\nThe left side in Figure 4 shows convergence results on the USPS 3vs5 and Breast Cancer datasets.\nThe three methods SLANG(1, 2, 3) refer to SLANG with L = 1, 5, 10. We compare these three\nSLANG methods to Mean-Field Hessian and Full-Gaussian Hessian. SLANG converges faster\nthan the mean-\ufb01eld method, and matches the convergence of the full-Gaussian method when L is\nincreased.\n\n4.2 Bayesian Neural Networks (BNNs)\n\nAn example for Bayesian Neural Networks on a synthetic regression dataset is given in Appendix F.1,\nwhere we illustrate the quality of SLANG\u2019s posterior covariance.\nThe right side in Figure 4 shows convergence results for the USPS 3vs5 and Breast Cancer datasets.\nHere, the three methods SLANG(1, 2, 3) refer to SLANG with L = 8, 16, 32. We compare SLANG\nto a mean-\ufb01eld method called Bayes by Backprop [7]. Similar to the Bayesian logistic regression\nexperiment, SLANG converges much faster than the mean-\ufb01eld method. However, the ELBO\nconvergence for SLANG shows that the optimization procedure does not necessarily converge to a\nlocal minimum. This issue does not appear to affect the test log-likelihood. While it might only be\ndue to stochasticity, it is possible that the problem is exacerbated by the replacement of the Hessian\nwith the EF matrix. We have not determined the speci\ufb01c cause and it warrants further investigation in\nfuture work.\n\n7\n\n\fFigure 4: This \ufb01gure compares the convergence behavior on two datasets: USPS 3vs5 (top) and\nBreast Cancer (bottom); and two models: Bayesian logistic regression (left) and Bayesian neural\nnetworks (BNN) (right). The three methods SLANG(1, 2, 3) refer to SLANG with L = 1, 5, 10 for\nlogistic regression. For BNN, they refer to SLANG with L = 8, 16, 32. The mean-\ufb01eld method\nis a natural-gradient mean-\ufb01eld method for logistic regression (see text) and BBB [7] for BNN.\nThis comparison clearly shows that SLANG converges faster than the mean-\ufb01eld method, and,\nfor Bayesian logistic regression, matches the convergence of the full-Gaussian method when L is\nincreased.\n\nTable 2: Comparison on UCI datasets using Bayesian neural networks. We repeat the setup used\nin Gal and Ghahramani [10]. SLANG uses L = 1, and outperforms BBB but gives comparable\nperformance to Dropout.\n\nTest RMSE\nDropout\n\nBBB\n\nDataset\nSLANG\n3.43 \u00b1 0.20 2.97 \u00b1 0.19 3.21 \u00b1 0.19\nBoston\nConcrete 6.16 \u00b1 0.13 5.23 \u00b1 0.12 5.58 \u00b1 0.19\n0.97 \u00b1 0.09 1.66 \u00b1 0.04 0.64 \u00b1 0.03\nEnergy\nKin8nm 0.08 \u00b1 0.00 0.10 \u00b1 0.00 0.08 \u00b1 0.00\n0.00 \u00b1 0.00 0.01 \u00b1 0.00 0.00 \u00b1 0.00\nNaval\n4.21 \u00b1 0.03 4.02 \u00b1 0.04 4.16 \u00b1 0.04\nPower\n0.64 \u00b1 0.01 0.62 \u00b1 0.01 0.65 \u00b1 0.01\nWine\n1.13 \u00b1 0.06 1.11 \u00b1 0.09 1.08 \u00b1 0.06\nYacht\n\nTest log-likelihood\n\nBBB\n\nDropout\n\nSLANG\n-2.66 \u00b1 0.06 -2.46 \u00b1 0.06 -2.58 \u00b1 0.05\n-3.25 \u00b1 0.02 -3.04 \u00b1 0.02 -3.13 \u00b1 0.03\n-1.45 \u00b1 0.10 -1.99 \u00b1 0.02 -1.12 \u00b1 0.01\n1.07 \u00b1 0.00\n1.06 \u00b1 0.00\n4.61 \u00b1 0.01\n4.76 \u00b1 0.00\n-2.86 \u00b1 0.01 -2.80 \u00b1 0.01 -2.84 \u00b1 0.01\n-0.97 \u00b1 0.01 -0.93 \u00b1 0.01 -0.97 \u00b1 0.01\n-1.56 \u00b1 0.02 -1.55 \u00b1 0.03 -1.88 \u00b1 0.01\n\n0.95 \u00b1 0.01\n3.80 \u00b1 0.01\n\nNext, we present results on the UCI regression datasets which are common benchmarks for Bayesian\nneural networks [14]. We repeat the setup2 used in Gal and Ghahramani [10]. Following their\nwork, we use neural networks with one hidden layer with 50 hidden units and ReLU activation\nfunctions. We compare SLANG with L = 1 to the Bayes By Backprop (BBB) method [7] and the\nBayesian Dropout method of [10]. For the 5 smallest datasets, we used a mini-batch size of 10 and 4\nMonte-Carlo samples during training. For the 3 larger datasets, we used a mini-batch size of 100\nand 2 Monte-Carlo samples during training. More details are given in Appendix F.3. We report test\nRMSE and test log-likelihood in Table 2. SLANG with just one rank outperforms BBB on 7 out\nof 8 datasets for RMSE and on 5 out of 8 datasets for log-likelihood. Moreover, SLANG shows\ncomparable performance to Dropout.\n\n2We use the data splits available at https://github.com/yaringal/DropoutUncertaintyExps\n\n8\n\n\fTable 3: Comparison of SLANG on the MNIST dataset. We use a two layer neural network with 400\nunits each. SLANG obtains good performances for moderate values of L.\n\nSLANG\n\nTest Error\n\nBBB\n1.82%\n\nL = 1\nL = 16 L = 32\n2.00% 1.95% 1.81% 1.92% 1.77% 1.73%\n\nL = 2\n\nL = 4\n\nL = 8\n\nFinally, we report results for classi\ufb01cation on MNIST. We train a BNN with two hidden layers of\n400 hidden units each. The training set consists of 50,000 examples and the remaining 10,000 are\nused as a validation set. The test set is a separate set which consists of 10,000 examples. We use\nSLANG with L = 1, 2, 4, 8, 16, 32. For each value of L, we choose the prior precision and learning\nrate based on performance on the validation set. Further details can be found in Appendix F.4. The\ntest accuracies are reported in Table 3 and compared to the results obtained in [7] by using BBB.\nFor SLANG, a good performance can be obtained for a moderate L. Note that there might be small\ndifferences between our experimental setup and the one used in [7] since BBB implementation is not\npublicly available. Therefore, the results might not be directly comparable. Nevertheless, SLANG\nappears to perform well compared to BBB.\n\n5 Conclusion\n\nWe consider the challenging problem of uncertainty estimation in large deep models. For such\nproblems, it is infeasible to form a Gaussian approximation to the posterior distribution. We address\nthis issue by estimating a Gaussian approximation that uses a covariance with low-rank plus diagonal\nstructure. We proposed an approximate natural-gradient algorithm to estimate the structured covari-\nance matrix. Our method, called SLANG, relies only on the back-propagated gradients to estimate the\ncovariance structure, which is a desirable feature when working with deep models. Empirical results\nstrongly suggest that the accuracy of our method is better than those obtained by using mean-\ufb01eld\nmethods. Moreover, we observe that, unlike mean-\ufb01eld methods, our method does not drastically\nshrink the marginal variances. Experiments also show that SLANG is highly \ufb02exible and that its\naccuracy can be improved by increasing the rank of the covariance\u2019s low-rank component. Finally,\nour method converges faster than the mean-\ufb01eld methods and can sometimes converge as fast as VI\nmethods that use a full-Gaussian approximation.\nThe experiments presented in this paper are restricted to feed-forward neural networks. This is partly\nbecause existing deep-learning software packages do not support individual gradient computations.\nIndividual gradients, which are required in line 6 of Algorithm 1, must be manually implemented for\nother types of architectures. Further work is therefore necessary to establish the usefulness of our\nmethod on other types of network architectures.\nSLANG is based on a natural-gradient method that employs the empirical Fisher approximation [19].\nOur empirical results suggest that this approximation is reasonably accurate. However, it is not clear\nif this is always the case. It is important to investigate this issue to gain better understanding of the\neffect of the approximation, both theoretically and empirically.\nDuring this work, we also found that comparing the quality of covariance approximations is a\nnontrivial task for deep neural networks. We believe that existing benchmarks are not suf\ufb01cient to\nestablish the quality of an approximate Bayesian inference method for deep models. An interesting\nand useful area of further research is the development of good benchmarks that better re\ufb02ect the\nquality of posterior approximations. This will facilitate the design of better inference algorithms.\n\nAcknowledgements\n\nWe thank the anonymous reviewers for their helpful feedback. We greatly appreciate useful discus-\nsions with Shun-ichi Amari (RIKEN), Rio Yokota (Tokyo Institute of Technology), Kazuki Oosawa\n(Tokyo Institute of Technology), Wu Lin (University of British Columbia), and Voot Tangkaratt\n(RIKEN). We are also thankful for the RAIDEN computing system and its support team at the RIKEN\nCenter for Advanced Intelligence Project, which we used extensively for our experiments.\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga,\nSherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden,\nMartin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor\ufb02ow: A system for large-scale machine learning. In\n12th USENIX Symposium on Operating Systems Design and Implementation, OSDI, pages 265\u2013283, 2016.\n\n[2] Sungjin Ahn, Anoop Korattikara Balan, and Max Welling. Bayesian posterior sampling via stochastic\ngradient \ufb01sher scoring. In Proceedings of the 29th International Conference on Machine Learning, ICML,\n2012.\n\n[3] Shun-ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013276,\n\n1998.\n\n[4] Sivaram Ambikasaran and Michael O\u2019Neil. Fast symmetric factorization of hierarchical matrices with\n\napplications. CoRR, abs/1405.0223, 2014.\n\n[5] Anoop Korattikara Balan, Vivek Rathod, Kevin P. Murphy, and Max Welling. Bayesian dark knowledge.\nIn Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information\nProcessing Systems, pages 3438\u20133446, 2015.\n\n[6] David Barber and Christopher M. Bishop. Ensemble learning for multi-layer networks. In Advances in\n\nNeural Information Processing Systems 10, pages 395\u2013401, 1997.\n\n[7] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural\n\nnetworks. CoRR, abs/1505.05424, 2015.\n\n[8] Edward Challis and David Barber. Concave gaussian variational approximations for inference in large-scale\nbayesian linear models. In Proceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics, AISTATS, pages 199\u2013207, 2011.\n\n[9] Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo.\n\nIn\nProceedings of the 31th International Conference on Machine Learning, ICML, pages 1683\u20131691, 2014.\n\n[10] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty\nin deep learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pages\n1050\u20131059, 2016.\n\n[11] Ian J. Goodfellow. Ef\ufb01cient per-example gradient computations. CoRR, abs/1510.01799, 2015.\n\n[12] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information\nProcessing Systems 24: 25th Annual Conference on Neural Information Processing Systems., pages\n2348\u20132356, 2011.\n\n[13] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic\n\nalgorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217\u2013288, 2011.\n\n[14] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato and Ryan P. Adams. Probabilistic backpropagation for scalable learning of\nbayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning,\nICML, pages 1861\u20131869, 2015.\n\n[15] Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. Stochastic variational\n\ninference. Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[16] Tommi Jaakkola and Michael Jordan. A variational approach to Bayesian logistic regression problems\nand their extensions. In Proceedings of the Sixth International Workshop on Arti\ufb01cial Intelligence and\nStatistics, AISTATS, 1997.\n\n[17] Mohammad Emtiyaz Khan and Wu Lin. Conjugate-computation variational inference: Converting varia-\ntional inference in non-conjugate models to inferences in conjugate models. In Proceedings of the 20th\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS, pages 878\u2013887, 2017.\n\n[18] Mohammad Emtiyaz Khan and Didrik Nielsen. Fast yet simple natural-gradient descent for variational\n\ninference in complex models. CoRR, abs/1807.04489, 2018.\n\n[19] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava.\nFast and scalable Bayesian deep learning by weight-perturbation in Adam. In Proceedings of the 35th\nInternational Conference on Machine Learning, pages 2611\u20132620, 2018.\n\n10\n\n\f[20] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[21] Christos Louizos and Max Welling. Structured and ef\ufb01cient variational deep learning with matrix gaussian\nposteriors. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pages\n1708\u20131716, 2016.\n\n[22] Benjamin M. Marlin, Mohammad Emtiyaz Khan, and Kevin P. Murphy. Piecewise bounds for estimating\nbernoulli-logistic latent gaussian models. In Proceedings of the 28th International Conference on Machine\nLearning, ICML, pages 633\u2013640, 2011.\n\n[23] James Martens. New perspectives on the natural gradient method. CoRR, abs/1412.1193, 2014.\n\n[24] Andrew C. Miller, Nicholas J. Foti, and Ryan P. Adams. Variational boosting: Iteratively re\ufb01ning posterior\napproximations. In Proceedings of the 34th International Conference on Machine Learning, ICML, pages\n2420\u20132429, 2017.\n\n[25] Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 9(Oct):2035\u20132078, 2008.\n\n[26] Victor M.-H. Ong, David J. Nott, and Michael S. Smith. Gaussian variational approximation with a factor\n\ncovariance structure. Journal of Computational and Graphical Statistics, 27(3):465\u2013478, 2018.\n\n[27] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In Autodiff\nWorkshop, during the Annual Conference on Neural Information Processing Systems, 2017.\n\n[28] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In Proceedings of\nthe Seventeenth International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS, pages 814\u2013822,\n2014.\n\n[29] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks.\n\nIn International Conference on Learning Representations, 2018.\n\n[30] Matthias W. Seeger. Bayesian model selection for support vector machines, gaussian processes and other\n\nkernel classi\ufb01ers. In Advances in Neural Information Processing Systems 12, pages 603\u2013609, 1999.\n\n[31] Matthias W. Seeger. Gaussian covariance and scalable variational inference. In Proceedings of the 27th\n\nInternational Conference on Machine Learning, ICML, pages 967\u2013974, 2010.\n\n[32] Umut Simsekli, Roland Badeau, A. Taylan Cemgil, and Ga\u00ebl Richard. Stochastic quasi-newton langevin\nmonte carlo. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pages\n642\u2013651, 2016.\n\n[33] Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight uncertainty in bayesian\nIn Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\n\nneural networks.\nStatistics, AISTATS, pages 1283\u20131292, 2017.\n\n[34] Linda S. L. Tan and David J. Nott. Gaussian variational approximation with sparse precision matrices.\n\nStatistics and Computing, 28(2):259\u2013275, 2018.\n\n[35] Michalis K. Titsias and Miguel L\u00e1zaro-Gredilla. Doubly stochastic variational bayes for non-conjugate\ninference. In Proceedings of the 31th International Conference on Machine Learning, ICML, pages\n1971\u20131979, 2014.\n\n[36] Richard E. Turner and Maneesh Sahani. Two problems with variational expectation maximisation for\ntime-series models. In D. Barber, T. Cemgil, and S. Chiappa, editors, Bayesian Time series models,\nchapter 5, pages 109\u2013130. Cambridge University Press, 2011.\n\n[37] Stephen J. Wright and J Nocedal. Numerical optimization. Springer New York, 1999.\n\n[38] Guodong Zhang, Shengyang Sun, David K. Duvenaud, and Roger B. Grosse. Noisy natural gradient as\nvariational inference. In Proceedings of the 35th International Conference on Machine Learning, ICML,\npages 5847\u20135856, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3083, "authors": [{"given_name": "Aaron", "family_name": "Mishkin", "institution": "University of British Columbia"}, {"given_name": "Frederik", "family_name": "Kunstner", "institution": "EPFL"}, {"given_name": "Didrik", "family_name": "Nielsen", "institution": "DTU Compute"}, {"given_name": "Mark", "family_name": "Schmidt", "institution": "University of British Columbia"}, {"given_name": "Mohammad Emtiyaz", "family_name": "Khan", "institution": "RIKEN, Tokyo"}]}