{"title": "Variational Dropout and the Local Reparameterization Trick", "book": "Advances in Neural Information Processing Systems", "page_first": 2575, "page_last": 2583, "abstract": "We explore an as yet unexploited opportunity for drastically improving the efficiency of stochastic gradient variational Bayes (SGVB) with global model parameters. Regular SGVB estimators rely on sampling of parameters once per minibatch of data, and have variance that is constant w.r.t. the minibatch size. The efficiency of such estimators can be drastically improved upon by translating uncertainty about global parameters into local noise that is independent across datapoints in the minibatch. Such reparameterizations with local noise can be trivially parallelized and have variance that is inversely proportional to the minibatch size, generally leading to much faster convergence.We find an important connection with regularization by dropout: the original Gaussian dropout objective corresponds to SGVB with local noise, a scale-invariant prior and proportionally fixed posterior variance. Our method allows inference of more flexibly parameterized posteriors; specifically, we propose \\emph{variational dropout}, a generalization of Gaussian dropout, but with a more flexibly parameterized posterior, often leading to better generalization. The method is demonstrated through several experiments.", "full_text": "Variational Dropout and\n\nthe Local Reparameterization Trick\n\nDiederik P. Kingma\u21e4, Tim Salimans\u21e5 and Max Welling\u21e4\u2020\n\n\u21e4 Machine Learning Group, University of Amsterdam\n\n\u21e5 Algoritmica\n\n\u2020 University of California, Irvine, and the Canadian Institute for Advanced Research (CIFAR)\n\nD.P.Kingma@uva.nl, salimans.tim@gmail.com, M.Welling@uva.nl\n\nAbstract\n\nWe investigate a local reparameterizaton technique for greatly reducing the vari-\nance of stochastic gradients for variational Bayesian inference (SGVB) of a pos-\nterior over model parameters, while retaining parallelizability. This local repa-\nrameterization translates uncertainty about global parameters into local noise that\nis independent across datapoints in the minibatch. Such parameterizations can be\ntrivially parallelized and have variance that is inversely proportional to the mini-\nbatch size, generally leading to much faster convergence. Additionally, we explore\na connection with dropout: Gaussian dropout objectives correspond to SGVB with\nlocal reparameterization, a scale-invariant prior and proportionally \ufb01xed posterior\nvariance. Our method allows inference of more \ufb02exibly parameterized posteriors;\nspeci\ufb01cally, we propose variational dropout, a generalization of Gaussian dropout\nwhere the dropout rates are learned, often leading to better models. The method\nis demonstrated through several experiments.\n\n1\n\nIntroduction\n\nDeep neural networks are a \ufb02exible family of models that easily scale to millions of parameters and\ndatapoints, but are still tractable to optimize using minibatch-based stochastic gradient ascent. Due\nto their high \ufb02exibility, neural networks have the capacity to \ufb01t a wide diversity of nonlinear patterns\nin the data. This \ufb02exbility often leads to over\ufb01tting when left unchecked: spurious patterns are found\nthat happen to \ufb01t well to the training data, but are not predictive for new data. Various regularization\ntechniques for controlling this over\ufb01tting are used in practice; a currently popular and empirically\neffective technique being dropout [10]. In [22] it was shown that regular (binary) dropout has a\nGaussian approximation called Gaussian dropout with virtually identical regularization performance\nbut much faster convergence. In section 5 of [22] it is shown that Gaussian dropout optimizes a lower\nbound on the marginal likelihood of the data. In this paper we show that a relationship between\ndropout and Bayesian inference can be extended and exploited to greatly improve the ef\ufb01ciency of\nvariational Bayesian inference on the model parameters. This work has a direct interpretation as a\ngeneralization of Gaussian dropout, with the same fast convergence but now with the freedom to\nspecify more \ufb02exibly parameterized posterior distributions.\nBayesian posterior inference over the neural network parameters is a theoretically attractive method\nfor controlling over\ufb01tting; exact inference is computationally intractable, but ef\ufb01cient approximate\nschemes can be designed. Markov Chain Monte Carlo (MCMC) is a class of approximate inference\nmethods with asymptotic guarantees, pioneered by [16] for the application of regularizing neural\nnetworks. Later useful re\ufb01nements include [23] and [1].\nAn alternative to MCMC is variational inference [11] or the equivalent minimum description length\n(MDL) framework. Modern variants of stochastic variational inference have been applied to neural\n\n1\n\n\fnetworks with some succes [8], but have been limited by high variance in the gradients. Despite\ntheir theoretical attractiveness, Bayesian methods for inferring a posterior distribution over neural\nnetwork weights have not yet been shown to outperform simpler methods such as dropout. Even a\nnew crop of ef\ufb01cient variational inference algorithms based on stochastic gradients with minibatches\nof data [14, 17, 19] have not yet been shown to signi\ufb01cantly improve upon simpler dropout-based\nregularization.\nIn section 2 we explore an as yet unexploited trick for improving the ef\ufb01ciency of stochastic gradient-\nbased variational inference with minibatches of data, by translating uncertainty about global param-\neters into local noise that is independent across datapoints in the minibatch. The resulting method\nhas an optimization speed on the same level as fast dropout [22], and indeed has the original Gaus-\nsian dropout method as a special case. An advantage of our method is that it allows for full Bayesian\nanalysis of the model, and that it\u2019s signi\ufb01cantly more \ufb02exible than standard dropout. The approach\npresented here is closely related to several popular methods in the literature that regularize by adding\nrandom noise; these relationships are discussed in section 4.\n\n2 Ef\ufb01cient and Practical Bayesian Inference\nWe consider Bayesian analysis of a dataset D, containing a set of N i.i.d. observations of tuples\n(x, y), where the goal is to learn a model with parameters or weights w of the conditional probabil-\nity p(y|x, w) (standard classi\ufb01cation or regression)1. Bayesian inference in such a model consists\nof updating some initial belief over parameters w in the form of a prior distribution p(w), after\nobserving data D, into an updated belief over these parameters in the form of (an approximation\nto) the posterior distribution p(w|D). Computing the true posterior distribution through Bayes\u2019 rule\np(w|D) = p(w)p(D|w)/p(D) involves computationally intractable integrals, so good approxima-\ntions are necessary. In variational inference, inference is cast as an optimization problem where we\noptimize the parameters  of some parameterized model q(w) such that q(w) is a close approx-\nimation to p(w|D) as measured by the Kullback-Leibler divergence DKL(q(w)||p(w|D)). This\ndivergence of our posterior q(w) to the true posterior is minimized in practice by maximizing the\nso-called variational lower bound L() of the marginal likelihood of the data:\nL() = DKL(q(w)||p(w)) + LD()\nEq(w) [log p(y|x, w)]\n\n(1)\n(2)\n\nwhere LD() = X(x,y)2D\n\nWe\u2019ll call LD() the expected log-likelihood. The bound L() plus DKL(q(w)||p(w|D)) equals\nthe (conditional) marginal log-likelihoodP(x,y)2D log p(y|x). Since this marginal log-likelihood\nis constant w.r.t. , maximizing the bound w.r.t.  will minimize DKL(q(w)||p(w|D)).\n2.1 Stochastic Gradient Variational Bayes (SGVB)\n\nVarious algorithms for gradient-based optimization of the variational bound (eq. (1)) with differ-\nentiable q and p exist. See section 4 for an overview. A recently proposed ef\ufb01cient method for\nminibatch-based optimization with differentiable models is the stochastic gradient variational Bayes\n(SGVB) method introduced in [14] (especially appendix F) and [17]. The basic trick in SGVB is\nto parameterize the random parameters w \u21e0 q(w) as: w = f (\u270f, ) where f (.) is a differen-\ntiable function and \u270f \u21e0 p(\u270f) is a random noise variable. In this new parameterisation, an unbiased\ndifferentiable minibatch-based Monte Carlo estimator of the expected log-likelihood can be formed:\n\nwhere (xi, yi)M\ni=1 is a minibatch of data with M random datapoints (xi, yi) \u21e0 D, and \u270f is a noise\nvector drawn from the noise distribution p(\u270f). We\u2019ll assume that the remaining term in the varia-\ntional lower bound, DKL(q(w)||p(w)), can be computed deterministically, but otherwise it may\nbe approximated similarly. The estimator (3) is differentiable w.r.t.  and unbiased, so its gradient\n1Note that the described method is not limited to classi\ufb01cation or regression and is straightforward to apply\n\nto other modeling settings like unsupervised models and temporal models.\n\nLD() ' LSGVB\n\nD\n\n() =\n\nN\nM\n\nlog p(yi|xi, w = f (\u270f, )),\n\n(3)\n\nMXi=1\n\n2\n\n\fis also unbiased: rLD() ' rLSGVB\nby randomly initializing  and performing stochastic gradient ascent on L() (1).\n2.2 Variance of the SGVB estimator\n\nD\n\n(). We can proceed with variational Bayesian inference\n\nThe theory of stochastic approximation tells us that stochastic gradient ascent using (3) will asymp-\ntotically converge to a local optimum for an appropriately declining step size and suf\ufb01cient weight\nupdates [18], but in practice the performance of stochastic gradient ascent crucially depends on\nthe variance of the gradients.\nIf this variance is too large, stochastic gradient descent will fail\nto make much progress in any reasonable amount of time. Our objective function consists of an\nexpected log likelihood term that we approximate using Monte Carlo, and a KL divergence term\nDKL(q(w)||p(w)) that we assume can be calculated analytically and otherwise be approximated\nwith Monte Carlo with similar reparameterization.\nAssume that we draw minibatches of datapoints with replacement; see appendix F for a similar\nanalysis for minibatches without replacement. Using Li as shorthand for log p(yi|xi, w = f (\u270fi, )),\nthe contribution to the likelihood for the i-th datapoint in the minibatch, the Monte Carlo estimator\n(3) may be rewritten as LSGVB\n\ni=1 Li, whose variance is given by\n\n() = N\n\nD\n\nVar\u21e5LSGVB\n\nD\n\n()\u21e4 =\n\nN 2\n\nMPM\nM 2\u21e3 MXi=1\n=N 2\u21e3 1\n\nM\n\nVar [Li] + 2\n\nVar [Li] +\n\nMXi=1\n\nM  1\nM\n\nCov [Li, Lj]\u2318\n\nMXj=i+1\nCov [Li, Lj]\u2318,\n\n(4)\n\n(5)\n\nwhere the variances and covariances are w.r.t. both the data distribution and \u270f distribution, i.e.\n\nVar [Li] = Var\u270f,xi,yi\u21e5log p(yi|xi, w = f (\u270f, ))\u21e4, with xi, yi drawn from the empirical distribu-\n\ntion de\ufb01ned by the training set. As can be seen from (5), the total contribution to the variance by\nVar [Li] is inversely proportional to the minibatch size M. However, the total contribution by the\ncovariances does not decrease with M. In practice, this means that the variance of LSGVB\n() can be\ndominated by the covariances for even moderately large M.\n\nD\n\n2.3 Local Reparameterization Trick\n\nD\n\nWe therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari-\nance of our stochastic gradients scales as 1/M. We then make this new estimator computationally\nef\ufb01cient by not sampling \u270f directly, but only sampling the intermediate variables f (\u270f) through which\n\u270f in\ufb02uences LSGVB\n(). By doing so, the global uncertainty in the weights is translated into a form\nof local uncertainty that is independent across examples and easier to sample. We refer to such a\nreparameterization from global noise to local noise as the local reparameterization trick. Whenever\na source of global noise can be translated to local noise in the intermediate states of computation\n(\u270f ! f (\u270f)), a local reparameterization can be applied to yield a computationally and statistically\nef\ufb01cient gradient estimator.\nSuch local reparameterization applies to a fairly large family of models, but is best explained through\na simple example: Consider a standard fully connected neural network containing a hidden layer\nconsisting of 1000 neurons. This layer receives an M \u21e5 1000 input feature matrix A from the layer\nbelow, which is multiplied by a 1000 \u21e5 1000 weight matrix W, before a nonlinearity is applied,\ni.e. B = AW. We then specify the posterior approximation on the weights to be a fully factor-\nized Gaussian, i.e. q(wi,j) = N (\u00b5i,j, 2\ni,j) 8wi,j 2 W, which means the weights are sampled as\nwi,j = \u00b5i,j + i,j\u270fi,j, with \u270fi,j \u21e0 N (0, 1). In this case we could make sure that Cov [Li, Lj] = 0\nby sampling a separate weight matrix W for each example in the minibatch, but this is not com-\nputationally ef\ufb01cient: we would need to sample M million random numbers for just a single layer\nof the neural network. Even if this could be done ef\ufb01ciently, the computation following this step\nwould become much harder: Where we originally performed a simple matrix-matrix product of the\nform B = AW, this now turns into M separate local vector-matrix products. The theoretical com-\nplexity of this computation is higher, but, more importantly, such a computation can usually not be\nperformed in parallel using fast device-optimized BLAS (Basic Linear Algebra Subprograms). This\nalso happens with other neural network architectures such as convolutional neural networks, where\noptimized libraries for convolution cannot deal with separate \ufb01lter matrices per example.\n\n3\n\n\fFortunately, the weights (and therefore \u270f) only in\ufb02uence the expected log likelihood through the\nneuron activations B, which are of much lower dimension. If we can therefore sample the random\nactivations B directly, without sampling W or \u270f, we may obtain an ef\ufb01cient Monte Carlo estimator\nat a much lower cost. For a factorized Gaussian posterior on the weights, the posterior for the\nactivations (conditional on the input A) is also factorized Gaussian:\n\nq(wi,j) = N (\u00b5i,j, 2\n\ni,j) 8wi,j 2 W =) q(bm,j|A) = N (m,j, m,j), with\n\nm,j =\n\nam,i\u00b5i,j,\n\nand m,j =\n\na2\nm,i2\n\ni,j.\n\n(6)\n\n1000Xi=1\n\n1000Xi=1\n\nRather than sampling the Gaussian weights and then computing the resulting activations, we may\nthus sample the activations from their implied Gaussian distribution directly, using bm,j = m,j +\n\npm,j\u21e3m,j, with \u21e3m,j \u21e0 N (0, 1). Here, \u21e3 is an M \u21e5 1000 matrix, so we only need to sample M\n\nthousand random variables instead of M million: a thousand fold savings.\nIn addition to yielding a gradient estimator that is more computationally ef\ufb01cient than drawing sep-\narate weight matrices for each training example, the local reparameterization trick also leads to an\nestimator that has lower variance. To see why, consider the stochastic gradient estimate with respect\nto the posterior parameter 2\ni,j for a minibatch of size M = 1. Drawing random weights W, we get\n\nIf, on the other hand, we form the same gradient using the local reparameterization trick, we get\n\n@LSGVB\nD\n@2\ni,j\n\n=\n\n@LSGVB\n@bm,j\n\nD\n\n\u270fi,jam,i\n\n2i,j\n\n.\n\n(7)\n\n(8)\n\n@LSGVB\nD\n@2\ni,j\n\n=\n\n@LSGVB\n@bm,j\n\nD\n\nm,i\n\n.\n\n\u21e3m,ja2\n\n2pm,j\n\nHere, there are two stochastic terms: The \ufb01rst is the backpropagated gradient @LSGVB\n/@bm,j, and\nthe second is the sampled random noise (\u270fi,j or \u21e3m,j). Estimating the gradient with respect to 2\ni,j\nthen basically comes down to estimating the covariance between these two terms. This is much\neasier to do for \u21e3m,j as there are much fewer of these: individually they have higher correlation\nwith the backpropagated gradient @LSGVB\n/@bm,j, so the covariance is easier to estimate. In other\nwords, measuring the effect of \u21e3m,j on @LSGVB\n/@bm,j is easy as \u21e3m,j is the only random variable\nD\ndirectly in\ufb02uencing this gradient via bm,j. On the other hand, when sampling random weights,\nthere are a thousand \u270fi,j in\ufb02uencing each gradient term, so their individual effects get lost in the\nnoise. In appendix D we make this argument more rigorous, and in section 5 we show that it holds\nexperimentally.\n\nD\n\nD\n\n3 Variational Dropout\n\nDropout is a technique for regularization of neural network parameters, which works by adding\nmultiplicative noise to the input of each layer of the neural network during optimization. Using the\nnotation of section 2.3, for a fully connected neural network dropout corresponds to:\n\nB = (A  \u21e0)\u2713, with \u21e0i,j \u21e0 p(\u21e0i,j)\n\n(9)\nwhere A is the M \u21e5 K matrix of input features for the current minibatch, \u2713 is a K \u21e5 L weight ma-\ntrix, and B is the M \u21e5 L output matrix for the current layer (before a nonlinearity is applied). The\n symbol denotes the elementwise (Hadamard) product of the input matrix with a M \u21e5 K matrix\nof independent noise variables \u21e0. By adding noise to the input during training, the weight parame-\nters \u2713 are less likely to over\ufb01t to the training data, as shown empirically by previous publications.\nOriginally, [10] proposed drawing the elements of \u21e0 from a Bernoulli distribution with probability\n1 p, with p the dropout rate. Later it was shown that using a continuous distribution with the same\nrelative mean and variance, such as a Gaussian N (1, \u21b5) with \u21b5 = p/(1 p), works as well or better\n[20].\nHere, we re-interpret dropout with continuous noise as a variational method, and propose a gen-\neralization that we call variational dropout. In developing variational dropout we provide a \ufb01rm\nBayesian justi\ufb01cation for dropout training by deriving its implicit prior distribution and variational\nobjective. This new interpretation allows us to propose several useful extensions to dropout, such as\na principled way of making the normally \ufb01xed dropout rates p adaptive to the data.\n\n4\n\n\f3.1 Variational dropout with independent weight noise\n\nIf the elements of the noise matrix \u21e0 are drawn independently from a Gaussian N (1, \u21b5), the marginal\ndistributions of the activations bm,j 2 B are Gaussian as well:\n\nq(bm,j|A) = N (m,j, m,j), with m,j =\n\nKXi=1\n\nam,i\u2713i,j, and m,j = \u21b5\n\na2\nm,i\u27132\n\ni,j.\n\n(10)\n\nKXi=1\n\nMaking use of this fact, [22] proposed Gaussian dropout, a regularization method where, instead\nof applying (9), the activations are directly drawn from their (approximate or exact) marginal distri-\nbutions as given by (10). [22] argued that these marginal distributions are exact for Gaussian noise\n\u21e0, and for Bernoulli noise still approximately Gaussian because of the central limit theorem. This\nignores the dependencies between the different elements of B, as present using (9), but [22] report\ngood results nonetheless.\nAs noted by [22], and explained in appendix B, this Gaussian dropout noise can also be interpreted\nas arising from a Bayesian treatment of a neural network with weights W that multiply the input to\ngive B = AW, where the posterior distribution of the weights is given by a factorized Gaussian with\ni,j). From this perspective, the marginal distributions (10) then arise through\nq(wi,j) = N (\u2713i,j, \u21b5\u27132\nthe application of the local reparameterization trick, as introduced in section 2.3. The variational\nobjective corresponding to this interpretation is discussed in section 3.3.\n\n3.2 Variational dropout with correlated weight noise\n\nInstead of ignoring the dependencies of the activation noise, as in section 3.1, we may retain the\ndependencies by interpreting dropout (9) as a form of correlated weight noise:\n\nB = (A  \u21e0)\u2713, \u21e0i,j \u21e0 N (1, \u21b5) () bm = amW, with\nW = (w01, w02, . . . , w0K)0, and wi = si\u2713i, with q(si) = N (1, \u21b5),\n\n(11)\nwhere am is a row of the input matrix and bm a row of the output. The wi are the rows of the\nweight matrix, each of which is constructed by multiplying a non-stochastic parameter vector \u2713i by\na stochastic scale variable si. The distribution on these scale variables we interpret as a Bayesian\nposterior distribution. The weight parameters \u2713i (and the biases) are estimated using maximum\nlikelihood. The original Gaussian dropout sampling procedure (9) can then be interpreted as arising\nfrom a local reparameterization of our posterior on the weights W.\n\n3.3 Dropout\u2019s scale-invariant prior and variational objective\n\nThe posterior distributions q(W) proposed in sections 3.1 and 3.2 have in common that they can\nbe decomposed into a parameter vector \u2713 that captures the mean, and a multiplicative noise term\ndetermined by parameters \u21b5. Any posterior distribution on W for which the noise enters this mul-\ntiplicative way, we will call a dropout posterior. Note that many common distributions, such as\nunivariate Gaussians (with nonzero mean), can be reparameterized to meet this requirement.\nDuring dropout training, \u2713 is adapted to maximize the expected log likelihood Eq\u21b5 [LD(\u2713)]. For this\nto be consistent with the optimization of a variational lower bound of the form in (2), the prior on\nthe weights p(w) has to be such that DKL(q(w)||p(w)) does not depend on \u2713. In appendix C we\nshow that the only prior that meets this requirement is the scale invariant log-uniform prior:\n\np(log(|wi,j|)) / c,\n\ni.e. a prior that is uniform on the log-scale of the weights (or the weight-scales si for section 3.2). As\nexplained in appendix A, this prior has an interesting connection with the \ufb02oating point format for\nstoring numbers: From an MDL perspective, the \ufb02oating point format is optimal for communicating\nnumbers drawn from this prior. Conversely, the KL divergence DKL(q(w)||p(w)) with this prior\nhas a natural interpretation as regularizing the number of signi\ufb01cant digits our posterior q stores\nfor the weights wi,j in the \ufb02oating-point format.\nPutting the expected log likelihood and KL-divergence penalty together, we see that dropout training\nmaximizes the following variatonal lower bound w.r.t. \u2713:\n\nEq\u21b5 [LD(\u2713)]  DKL(q\u21b5(w)||p(w)),\n\n(12)\n\n5\n\n\fwhere we have made the dependence on the \u2713 and \u21b5 parameters explicit. The noise parameters \u21b5\n(e.g. the dropout rates) are commonly treated as hyperparameters that are kept \ufb01xed during training.\nFor the log-uniform prior this then corresponds to a \ufb01xed limit on the number of signi\ufb01cant digits\nwe can learn for each of the weights wi,j. In section 3.4 we discuss the possibility of making this\nlimit adaptive by also maximizing the lower bound with respect to \u21b5.\nFor the choice of a factorized Gaussian approximate posterior with q(wi,j) = N (\u2713i,j, \u21b5\u27132\ni,j), as\ndiscussed in section 3.1, the lower bound (12) is analyzed in detail in appendix C. There, it is shown\nthat for this particular choice of posterior the negative KL-divergence DKL(q\u21b5(w)||p(w)) is not\nanalytically tractable, but can be approximated extremely accurately using\n\nDKL[q(wi)|p(wi)] \u21e1 constant + 0.5 log(\u21b5) + c1\u21b5 + c2\u21b52 + c3\u21b53,\n\nwith\n\nc1 = 1.16145124,\n\nc2 = 1.50204118,\n\nc3 = 0.58629921.\n\nThe same expression may be used to calculate the corresponding term DKL(q\u21b5(s)||p(s)) for the\nposterior approximation of section 3.2.\n\n3.4 Adaptive regularization through optimizing the dropout rate\n\nThe noise parameters \u21b5 used in dropout training (e.g. the dropout rates) are usually treated as \ufb01xed\nhyperparameters, but now that we have derived dropout\u2019s variational objective (12), making these\nparameters adaptive is trivial: simply maximize the variational lower bound with respect to \u21b5. We\ncan use this to learn a separate dropout rate per layer, per neuron, of even per separate weight. In\nsection 5 we look at the predictive performance obtained by making \u21b5 adaptive.\nWe found that very large values of \u21b5 correspond to local optima from which it is hard to escape due\nto large-variance gradients. To avoid such local optima, we found it bene\ufb01cial to set a constraint\n\u21b5 \uf8ff 1 during training, i.e. we maximize the posterior variance at the square of the posterior mean,\nwhich corresponds to a dropout rate of 0.5.\n\n4 Related Work\n\nPioneering work in practical variational inference for neural networks was done in [8], where a\n(biased) variational lower bound estimator was introduced with good results on recurrent neural net-\nwork models. In later work [14, 17] it was shown that even more practical estimators can be formed\nfor most types of continuous latent variables or parameters using a (non-local) reparameterization\ntrick, leading to ef\ufb01cient and unbiased stochastic gradient-based variational inference. These works\nfocused on an application to latent-variable inference; extensive empirical results on inference of\nglobal model parameters were reported in [6], including succesful application to reinforcement\nlearning. These earlier works used the relatively high-variance estimator (3), upon which we im-\nprove. Variable reparameterizations have a long history in the statistics literature, but have only\nrecently found use for ef\ufb01cient gradient-based machine learning and inference [4, 13, 19]. Related\nis also probabilistic backpropagation [9], an algorithm for inferring marginal posterior probabilities;\nhowever, it requires certain tractabilities in the network making it insuitable for the type of models\nunder consideration in this paper.\nAs we show here, regularization by dropout [20, 22] can be interpreted as variational inference.\nDropConnect [21] is similar to dropout, but with binary noise on the weights rather than hidden units.\nDropConnect thus has a similar interpretation as variational inference, with a uniform prior over the\nweights, and a mixture of two Dirac peaks as posterior. In [2], standout was introduced, a variation\nof dropout where a binary belief network is learned for producing dropout rates. Recently, [15]\nproposed another Bayesian perspective on dropout. In recent work [3], a similar reparameterization\nis described and used for variational inference; their focus is on closed-form approximations of the\n[15] and [7] also investigate a\nvariational bound, rather than unbiased Monte Carlo estimators.\nBayesian perspective on dropout, but focus on the binary variant.\n[7] reports various encouraging\nresults on the utility of dropout\u2019s implied prediction uncertainty.\n\n6\n\n\f5 Experiments\n\nWe compare our method to standard binary dropout and two popular versions of Gaussian dropout,\nwhich we\u2019ll denote with type A and type B. With Gaussian dropout type A we denote the pre-linear\nGaussian dropout from [20]; type B denotes the post-linear Gaussian dropout from [22]. This way,\nthe method names correspond to the matrix names in section 2 (A or B) where noise is injected.\nModels were implemented in Theano [5], and optimization was performed using Adam [12] with\ndefault hyper-parameters and temporal averaging.\nTwo types of variational dropout were included. Type A is correlated weight noise as introduced\nin section 3.2: an adaptive version of Gaussian dropout type A. Variational dropout type B has\nindependent weight uncertainty as introduced in section 3.1, and corresponds to Gaussian dropout\ntype B.\nA de facto standard benchmark for regularization methods is the task of MNIST hand-written digit\nclassi\ufb01cation. We choose the same architecture as [20]: a fully connected neural network with 3\nhidden layers and recti\ufb01ed linear units (ReLUs). We follow the dropout hyper-parameter recom-\nmendations from these earlier publications, which is a dropout rate of p = 0.5 for the hidden layers\nand p = 0.2 for the input layer. We used early stopping with all methods, where the amount of\nepochs to run was determined based on performance on a validation set.\n\nVariance. We start out by empirically comparing the variance of the different available stochastic\nestimators of the gradient of our variational objective. To do this we train the neural network de-\nscribed above for either 10 epochs (test error 3%) or 100 epochs (test error 1.3%), using variational\ndropout with independent weight noise. After training, we calculate the gradients for the weights of\nthe top and bottom level of our network on the full training set, and compare against the gradient\nestimates per batch of M = 1000 training examples. Appendix E contains the same analysis for the\ncase of variational dropout with correlated weight noise.\nTable 1 shows that the local reparameterization trick yields the lowest variance among all variational\ndropout estimators for all conditions, although it is still substantially higher compared to not hav-\ning any dropout regularization. The 1/M variance scaling achieved by our estimator is especially\nimportant early on in the optimization when it makes the largest difference (compare weight sample\nper minibatch and weight sample per data point). The additional variance reduction obtained by our\nestimator through drawing fewer random numbers (section 2.3) is about a factor of 2, and this re-\nmains relatively stable as training progresses (compare local reparameterization and weight sample\nper data point).\n\nstochastic gradient estimator\nlocal reparameterization (ours)\nweight sample per data point (slow)\nweight sample per minibatch (standard)\nno dropout noise (minimal var.)\n\nbottom layer\n100 epochs\n1.1 \u21e5 102\n2.5 \u21e5 102\n3.3 \u21e5 102\n9.0 \u21e5 100\nTable 1: Average empirical variance of minibatch stochastic gradient estimates (1000 examples) for\na fully connected neural network, regularized by variational dropout with independent weight noise.\n\nbottom layer\n10 epochs\n1.9 \u21e5 102\n4.3 \u21e5 102\n8.5 \u21e5 102\n1.3 \u21e5 102\n\ntop layer\n10 epochs\n7.8 \u21e5 103\n1.4 \u21e5 104\n4.9 \u21e5 104\n2.8 \u21e5 103\n\ntop layer\n100 epochs\n1.2 \u21e5 103\n2.6 \u21e5 103\n4.3 \u21e5 103\n5.9 \u21e5 101\n\nSpeed. We compared the regular SGVB estimator, with separate weight samples per datapoint\nwith the ef\ufb01cient estimator based on local reparameterization, in terms of wall-clock time ef\ufb01ciency.\nWith our implementation on a modern GPU, optimization with the na\u00a8\u0131ve estimator took 1635 sec-\nonds per epoch, while the ef\ufb01cient estimator took 7.4 seconds: an over 200 fold speedup.\n\nClassi\ufb01cation error. Figure 1 shows test-set classi\ufb01cation error for the tested regularization meth-\nods, for various choices of number of hidden units. Our adaptive variational versions of Gaussian\ndropout perform equal or better than their non-adaptive counterparts and standard dropout under all\ntested conditions. The difference is especially noticable for the smaller networks. In these smaller\nnetworks, we observe that variational dropout infers dropout rates that are on average far lower than\nthe dropout rates for larger networks. This adaptivity comes at negligable computational cost.\n\n7\n\n\f(a) Classi\ufb01cation error on the MNIST dataset\n\n(b) Classi\ufb01cation error on the CIFAR-10 dataset\n\nFigure 1: Best viewed in color. (a) Comparison of various dropout methods, when applied to fully-\nconnected neural networks for classi\ufb01cation on the MNIST dataset. Shown is the classi\ufb01cation\nerror of networks with 3 hidden layers, averaged over 5 runs. he variational versions of Gaussian\ndropout perform equal or better than their non-adaptive counterparts; the difference is especially\nlarge with smaller models, where regular dropout often results in severe under\ufb01tting. (b) Compar-\nison of dropout methods when applied to convolutional net a trained on the CIFAR-10 dataset, for\ndifferent settings of network size k. The network has two convolutional layers with each 32k and\n64k feature maps, respectively, each with stride 2 and followed by a softplus nonlinearity. This is\nfollowed by two fully connected layers with each 128k hidden units.\n\nWe found that slightly downscaling the KL divergence part of the variational objective can be ben-\ne\ufb01cial. Variational (A2) in \ufb01gure 1 denotes performance of type A variational dropout but with a\nKL-divergence downscaled with a factor of 3; this small modi\ufb01cation seems to prevent under\ufb01tting,\nand beats all other dropout methods in the tested models.\n\n6 Conclusion\n\nEf\ufb01ciency of posterior inference using stochastic gradient-based variational Bayes (SGVB) can often\nbe signi\ufb01cantly improved through a local reparameterization where global parameter uncertainty is\ntranslated into local uncertainty per datapoint. By injecting noise locally, instead of globally at the\nmodel parameters, we obtain an ef\ufb01cient estimator that has low computational complexity, can be\ntrivially parallelized and has low variance. We show how dropout is a special case of SGVB with\nlocal reparameterization, and suggest variational dropout, a straightforward extension of regular\ndropout where optimal dropout rates are inferred from the data, rather than \ufb01xed in advance. We\nreport encouraging empirical results.\n\nAcknowledgments\nWe thank the reviewers and Yarin Gal for valuable feedback. Diederik Kingma is supported by the\nGoogle European Fellowship in Deep Learning, Max Welling is supported by research grants from\nGoogle and Facebook, and the NWO project in Natural AI (NAI.14.108).\n\nReferences\n[1] Ahn, S., Korattikara, A., and Welling, M. (2012). Bayesian posterior sampling via stochastic gradient\n\nFisher scoring. arXiv preprint arXiv:1206.6380.\n\n[2] Ba, J. and Frey, B. (2013). Adaptive dropout for training deep neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 3084\u20133092.\n\n[3] Bayer, J., Karol, M., Korhammer, D., and Van der Smagt, P. (2015). Fast adaptive weight noise. arXiv\n\npreprint arXiv:1507.05331.\n\n[4] Bengio, Y. (2013). Estimating or propagating gradients through stochastic neurons. arXiv preprint\n\narXiv:1305.2982.\n\n8\n\n\f[5] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,\nIn Proceedings of the\n\nD., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.\nPython for Scienti\ufb01c Computing Conference (SciPy), volume 4.\n\n[6] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural net-\n\nworks. arXiv preprint arXiv:1505.05424.\n\n[7] Gal, Y. and Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty\n\nin deep learning. arXiv preprint arXiv:1506.02142.\n\n[8] Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information\n\nProcessing Systems, pages 2348\u20132356.\n\n[9] Hern\u00b4andez-Lobato, J. M. and Adams, R. P. (2015). Probabilistic backpropagation for scalable learning of\n\nBayesian neural networks. arXiv preprint arXiv:1502.05336.\n\n[10] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving\n\nneural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.\n\n[11] Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing the descrip-\ntion length of the weights. In Proceedings of the sixth annual conference on Computational learning theory,\npages 5\u201313. ACM.\n\n[12] Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the Interna-\n\ntional Conference on Learning Representations 2015.\n\n[13] Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable models in auxiliary\n\nform. arXiv preprint arXiv:1306.0733.\n\n[14] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. Proceedings of the 2nd Inter-\n\nnational Conference on Learning Representations.\n\n[15] Maeda, S.-i. (2014). A Bayesian encourages dropout. arXiv preprint arXiv:1412.7003.\n[16] Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.\n[17] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\nIn Proceedings of the 31st International Conference on Machine\n\ninference in deep generative models.\nLearning (ICML-14), pages 1278\u20131286.\n\n[18] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407.\n\n[19] Salimans, T. and Knowles, D. A. (2013). Fixed-form variational posterior approximation through stochas-\n\ntic linear regression. Bayesian Analysis, 8(4).\n\n[20] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple\nway to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1):1929\u2013\n1958.\n\n[21] Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. (2013). Regularization of neural networks using\ndropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages\n1058\u20131066.\n\n[22] Wang, S. and Manning, C. (2013). Fast dropout training.\nConference on Machine Learning (ICML-13), pages 118\u2013126.\n\nIn Proceedings of the 30th International\n\n[23] Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In\n\nProceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681\u2013688.\n\n9\n\n\f", "award": [], "sourceid": 1515, "authors": [{"given_name": "Durk", "family_name": "Kingma", "institution": "University of Amsterdam"}, {"given_name": "Tim", "family_name": "Salimans", "institution": "Algoritmica"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of California, Irvine"}]}