{"title": "Bayesian Layers: A Module for Neural Network Uncertainty", "book": "Advances in Neural Information Processing Systems", "page_first": 14660, "page_last": 14672, "abstract": "We describe Bayesian Layers, a module designed for fast experimentation with neural network uncertainty. It extends neural network libraries with drop-in replacements for common layers. This enables composition via a unified abstraction over deterministic and stochastic functions and allows for scalability via the underlying system. These layers capture uncertainty over weights (Bayesian neural nets), pre-activation units (dropout), activations (``stochastic output layers''), or the function itself (Gaussian processes). They can also be reversible to propagate uncertainty from input to output. We include code examples for common architectures such as Bayesian LSTMs, deep GPs, and flow-based models. As demonstration, we fit a 5-billion parameter ``Bayesian Transformer'' on 512 TPUv2 cores for uncertainty in machine translation and a Bayesian dynamics model for model-based planning. Finally, we show how Bayesian Layers can be used within the Edward2 language for probabilistic programming with stochastic processes.", "full_text": "Bayesian Layers: A Module for\nNeural Network Uncertainty\n\nDustin Tran\nGoogle Brain\n\nMichael W. Dusenberry\n\nGoogle Brain\u2217\n\nMark van der Wilk\n\nProwler.io\n\nDanijar Hafner\nGoogle Brain\n\nAbstract\n\nWe describe Bayesian Layers, a module designed for fast experimentation with\nneural network uncertainty. It extends neural network libraries with drop-in re-\nplacements for common layers. This enables composition via a uni\ufb01ed abstraction\nover deterministic and stochastic functions and allows for scalability via the under-\nlying system. These layers capture uncertainty over weights (Bayesian neural nets),\npre-activation units (dropout), activations (\u201cstochastic output layers\u201d), or the func-\ntion itself (Gaussian processes). They can also be reversible to propagate uncer-\ntainty from input to output. We include code examples for common architectures\nsuch as Bayesian LSTMs, deep GPs, and \ufb02ow-based models. As demonstration,\nwe \ufb01t a 5-billion parameter \u201cBayesian Transformer\u201d on 512 TPUv2 cores for un-\ncertainty in machine translation and a Bayesian dynamics model for model-based\nplanning. Finally, we show how Bayesian Layers can be used within the Edward2\nlanguage for probabilistic programming with stochastic processes.1\n\nlstm = ed.layers.LSTMCellReparameterization(512)\noutput_layer = tf.keras.layers.Dense(10)\ndef loss_fn(features, labels, dataset_size):\nstate = lstm.get_initial_state(features)\nnll = 0.\nfor t in range(features.shape[1]):\nnet, state = lstm(features[:, t], state)\nlogits = output_layer(net)\nnll += tf.reduce_mean(\ntf.nn.softmax_cross_entropy_with_logits(\nlabels[:, t], logits))\n\nkl = sum(lstm.losses) / dataset_size\nreturn nll + kl\nFigure 1: Bayesian RNN (Fortunato et al., 2017).\nBayesian Layers integrates easily into existing\nby any training loop). Keras\u2019 model.fit is also\nwork\ufb02ows (here, a custom loss function followed\nsupported. See Appendix A for comparisons to\na vanilla TensorFlow, Edward1, and Pyro imple-\nmentation.\n\nWh\n\nWx\n\nxt\n\nbh\n\n\u00b7\u00b7\u00b7\nWy\n\nby\n\n\u00b7\u00b7\u00b7\n\nht\n\nyt\n\nFigure 2: Graphical model depiction. Default\narguments specify learnable distributions over\nthe LSTM\u2019s weights and biases; we apply a de-\nterministic output layer.\n\n\u2217Work done during Google AI residency.\n1All code is available at https://github.com/google/edward2 as part of the edward2 namespace.\nCode snippets assume import edward2 as ed; import tensorflow as tf; tensorflow==2.0.0.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1\n\nIntroduction\n\nThe rise of AI accelerators such as TPUs lets us utilize computation with 1016 FLOP/s and 4 TB\nof memory distributed across hundreds of processors (Jouppi et al., 2017). In principle, this lets us\n\ufb01t probabilistic models at many orders of magnitude larger than state of the art. We are particu-\nlarly inspired by research on uncertainty-aware functions: priors and algorithms for Bayesian neural\nnetworks (e.g., Wen et al., 2018; Hafner et al., 2019), scaling up Gaussian processes (e.g., Sal-\nimbeni and Deisenroth, 2017; John and Hensman, 2018), and expressive distributions via invertible\nfunctions (e.g., Rezende and Mohamed, 2015).\nUnfortunately, while research with uncertainty-aware functions are not limited by hardware, they\nare limited by software. Modern systems approach this by inventing a probabilistic programming\nlanguage which encompasses all computable probability models as well as a universal inference\nengine (Goodman et al., 2012; Carpenter et al., 2016) or with composable inference (Tran et al.,\n2016; Bingham et al., 2019; Probtorch Developers, 2017). Alternatively, the software may use high-\nlevel abstractions in order to specify and \ufb01t speci\ufb01c model classes with a hand-derived algorithm\n(GPy, 2012; Vanhatalo et al., 2013; Matthews et al., 2017). These systems have all met success, but\nthey tend to be monolothic in design. This prevents research \ufb02exibility such as utilizing low-level\ncommunication primitives to truly scale up models to billions of parameters, or in composability\nwith the rich abstractions from neural network libraries.\nMost recently, Edward2 provides lower-level \ufb02exibility by enabling arbitrary numerical ops with ran-\ndom variables (Tran et al., 2018). However, it remains unclear how to leverage random variables for\nuncertainty-aware functions. For example, current practices with Bayesian neural networks require\nexplicit network computation and variable management (Tran et al., 2016) or require indirection by\nintercepting weight instantiations of a deterministic layer (Bingham et al., 2019). Both designs are in-\n\ufb02exible for many real-world uses in research (see details in Section 1.1). In practice, researchers often\nuse the lower numerical level\u2014without a uni\ufb01ed design for uncertainty-aware functions as there are\nfor deterministic neural networks. This forces researchers to reimplement even basic methods such\nas Bayes by Backprop (Blundell et al., 2015)\u2014let alone build on more complex baselines.\nContributions. This paper describes Bayesian Layers, an extension of neural network libraries which\ncontributes one idea: instead of only deterministic functions as \u201clayers\u201d, enable distributions over\nfunctions. Bayesian Layers does not invent a new language. It inherits neural network semantics\nto specify uncertainty models as a composition of layers. Each layer may capture uncertainty over\nweights (Bayesian neural nets), pre-activation units (dropout), activations (\u201cstochastic output lay-\ners\u201d), or the function itself (Gaussian processes). They can also be reversible layers that propagate\nuncertainty from input to output. Bayesian Layers can be used inside typical machine learning work-\n\ufb02ows (Figure 1) as well as inside a probabilistic programming language (Section 2.5).\nTo the best of our knowledge, Bayesian Layers is the \ufb01rst to: propose a unifying design across\nuncertainty-aware functions; design uncertainty as part of existing deep learning semantics; and\ndemonstrate practical uncertainty examples on complex environments. We include code examples\nfor common architectures such as Bayesian LSTMs, deep GPs, and \ufb02ow-based models. We also\n\ufb01t a 5-billion parameter \u201cBayesian Transformer\u201d on 512 TPUv2 cores for uncertainty in machine\ntranslation and a Bayesian dynamics model for model-based planning.\n\n1.1 Related Work\n\nThere have been many software developments for distributions over functions. Our work takes classic\ninspiration from Radford Neal\u2019s software in 1995 to enable \ufb02exible modeling with both Bayesian neu-\nral nets and GPs (Neal, 1995). Modern software typically focuses on only one of these directions. For\nBayesian neural nets, researchers have commonly coupled variational sampling in neural net layers\n(e.g., code from Gal and Ghahramani (2016); Louizos and Welling (2017)). For Gaussian processes,\nthere have been signi\ufb01cant developments in libraries (Rasmussen and Nickisch, 2010; GPy, 2012;\nVanhatalo et al., 2013; Matthews et al., 2017; Al-Shedivat et al., 2017; Gardner et al., 2018), although\n\ufb02exible composability in the spirit of deep learning libraries remained a challenge.\nPerhaps most similar to our work, Aboleth (Aboleth Developers, 2017) features variational BNNs\nand GPs. They uses a di\ufb00erent design than Bayesian Layers, which we believe results in a less \ufb02ex-\nible framework that makes it more challenging to use for research. For example, their BNNs do not\n\n2\n\n\fsupport non-Gaussian priors or posterior approximations, di\ufb00erent estimators, or probabilistic pro-\ngramming with a model-inference separation; their GPs only support random feature approximations;\nand they create a new neural network language instead of build on an existing one.\nA closely related concept is MXFusion\u2019s probabilistic module (Dai et al., 2018), a module which im-\nplements a set of random variables alongside a dedicated inference algorithm. This has remarkable\nsimilarity to the way composing layers in Bayesian Layers ties estimation with the model speci\ufb01ca-\ntion (e.g., variational inference with deep GPs). Unlike MXFusion, Bayesian Layers enables a higher\ndegree of compositionality to form the overall model, ultimately exploiting conditional independence\nrelationships where, e.g., variational inference can be written as a series of layer-wise integral esti-\nmation problems. For example, deep GPs with variational inference in MXFusion involve a custom\nclass whereas Bayesian Layers simply composes variational GP layers.\nAnother related concept is Pyro\u2019s random module (Bingham et al., 2019), a design pattern which lifts\ninstantiations with a Pyro primitive (typically sample on a distribution). Pyro\u2019s random module is\ndeterministic neural layers to Bayesian ones. This is done with e\ufb00ect handlers which replace weight\ne\ufb00ective for implementing Bayes by Backprop (Blundell et al., 2015), but it does not enable more\nrecent estimators which avoid the high variance of weight sampling such as local reparameteriza-\ntion (Kingma et al., 2015), Flipout (Wen et al., 2018), and deterministic variational inference (Wu\net al., 2018). More importantly, the random module focuses strictly on weight uncertainty whereas\nBayesian Layers provides a unifying design across distribution over functions where uncertainty may\nexist anywhere in the computation\u2014whether it be the weights, pre-activation units, activations, func-\ntion, or propagating uncertainty from input to output.\nAnother related thread are probabilistic programming languages which build on the semantics of an\nexisting functional programming language. Examples include HANSEI on OCaml, Church on Lisp,\nand Hakaru on Haskell (Kiselyov and Shan, 2009; Goodman et al., 2012; Narayanan et al., 2016).\nNeural network libraries can be thought of as a (fairly simple) functional programming language,\nwith limited higher-order logic and a type system of (\ufb01nite lists of) n-dimensional arrays. Similar to\nthese works, Bayesian Layers augments the host language with methods for stochasticity.\n\n2 Bayesian Layers\n\nIn neural network libraries, architectures decompose as a composition of \u201clayer\u201d objects as the core\nbuilding block (Collobert et al., 2011; Al-Rfou et al., 2016; Jia et al., 2014; Chollet, 2016; Chen et al.,\n2015; Abadi et al., 2015; S. and N., 2016). These layers capture both the parameters and computation\nof a mathematical function into a programmable class.\nIn our work, we extend layers to capture \u201cdistributions over functions\u201d, which we describe as a layer\nwith uncertainty about some state in its computation\u2014be it uncertainty in the weights, pre-activation\nunits, activations, or the entire function. Each sample from the distribution instantiates a di\ufb00erent\nfunction, e.g., a layer with a di\ufb00erent weight con\ufb01guration.\n2.1 Bayesian Neural Network Layers\nThe Bayesian extension of any deterministic layer is to place a prior distribution over its weights and\nbiases. Bayesian neural networks have been to help address important challenges such as indicating\nmodel mis\ufb01t (Dusenberry et al., 2019), generalization to out of distribution examples (Louizos and\nWelling, 2017), balancing exploration and exploitation in sequential decision-making (Hafner et al.,\n2019), and transferring knowledge across a collection of datasets (Nguyen et al., 2017). Bayesian\nneural net layers require several considerations. Figure 1 implements a Bayesian RNN; Appendix B\nimplements a Bayesian CNN (ResNet-50).\n\nComputing the integral We need to compute often-intractable integrals over weights and biases \u03b8.\nConsider for example two cases, the variational objective for training and the approximate predictive\ndistribution for testing,\n\n(cid:90)\n(cid:90)\n\nELBO(\u03b8) =\nq(y | x) =\n\nq(\u03b8) log p(y | f\u03b8(x)) d\u03b8 \u2212 KL [q(\u03b8)(cid:107) p(\u03b8)] ,\nq(\u03b8)p(y | f\u03b8(x)) d\u03b8.\n\n3\n\n\fclass DenseReparameterization(tf.keras.layers.Dense):\n\"\"\"Variational Bayesian dense layer.\"\"\"\ndef __init__(\nself,\nunits,\nactivation=None,\nuse_bias=True,\nkernel_initializer='trainable_normal',\nbias_initializer='zero',\nkernel_regularizer='normal_kl_divergence',\nbias_regularizer=None,\nactivity_regularizer=None,\n**kwargs):\nsuper(DenseReparameterization,\nself).__init__(..., **kwargs)\n\nFigure 3: Bayesian layers are modularized to\n\ufb01t existing neural net semantics of\ninitializ-\ners, regularizers, and layers as they deem \ufb01t.\nHere, a Bayesian layer with reparameterization\n(Kingma and Welling, 2014; Blundell et al.,\ntation. The only change is the default for ker-\n2015) is the same as its deterministic implemen-\nnel_{initializer,regularizer}; no addi-\ntional methods are added.\n\nif FLAGS.be_bayesian:\nConv2D = ed.layers.Conv2DFlipout\nelse:Conv2D = tf.keras.layers.Conv2D\nmodel = tf.keras.Sequential([\n\nConv2D(32, 5, 1, padding='same'),\ntf.keras.layers.BatchNormalization(),\ntf.keras.layers.Activation('relu'),\nConv2D(32, 5, 2, padding='same'),\ntf.keras.layers.BatchNormalization(),\n...\n\n])\nFigure 4: Bayesian Layers are drop-in re-\nplacements for their deterministic counter-\nparts.\n\nHere, x may be a real-valued tensor as input features, y may be a vector-valued output for each data\npoint, and the function f encompasses the overall network as a composition of layers.\nTo enable di\ufb00erent methods to estimate these integrals, we implement each estimator as its own\nLayer. The same Bayesian neural net can use entirely di\ufb00erent computational graphs depending\non the estimation (and therefore entirely di\ufb00erent code). For example, sampling from q(\u03b8) with\nreparameterization and running the deterministic layer computation is a generic way to evaluate layer-\nwise integrals (Kingma and Welling, 2014) and is used in Edward and Pyro. Alternatively, one could\napproximate the integral deterministically (Wu et al., 2018), and having the \ufb02exibility to vary the\nestimator as we do in Bayesian Layers is important when \ufb01tting the models in practice.\n\nSignature We\u2019d like to have the Bayesian extension of a deterministic layer retain its manda-\ntory constructor arguments as well as its type signature of tensor-dimensional inputs and tensor-\ndimensional outputs. This enables compositionality, letting one easily combine deterministic and\nlayer requires a units argument determining its output dimensionality; a convolutional layer also\nstochastic layers (Figure 4; Laumann and Shridhar (2018)). For example, a dense (feedforward)\nincludes kernel_size.\nDistributions over parameters To specify distributions, a natural idea is to overload the existing\nparameter initialization arguments in a Layer\u2019s constructor; in Keras, it is kernel_initializer\nand bias_initializer. These arguments are extended to accept callables that take metadata\nsuch as input shape and return a distribution over the parameter. Distribution initializers may carry\ntrainable parameters, each with their own initializers.\nFor the distribution abstraction, we use Edward RandomVariables (Tran et al., 2018). Layers per-\nform forward passes using deterministic ops and the RandomVariables. The default initializer\nrepresents a trainable approximate posterior in a variational inference scheme (Figure 3). By con-\nvention, it is a fully factorized normal distribution with a reasonable initialization scheme, but note\nBayesian Layers supports arbitrarily \ufb02exible posterior approximations.2\n\nDistribution regularizers The variational training objective requires the evaluation of a KL term,\nwhich penalizes deviations of the learned q(\u03b8) from the prior p(\u03b8). Similar to distribution initializers,\n\n2 The only requirement for a distribution initializer is to return a sample (or most broadly, a Tensor of com-\npatible shape and dtype). There is no restriction of independence across layers or tractable densities; hierarchical\nvariational models (Ranganath et al., 2016) and implicit posteriors (Pawlowski et al., 2017) are compatible.\n\n4\n\n\fkernel_regularizer and bias_regularizer (Figure 3). These arguments are extended to\nwe overload the existing parameter regularization arguments in a layer\u2019s constructor; in Keras, it is\naccept callables that take in the kernel or bias RandomVariables and return a scalar Tensor. By\ndefault, we use a KL divergence toward the standard normal distribution, which represents the penalty\nterm common in variational Bayesian neural network training.\nImportantly, note that Bayesian Layers does not have explicit notions for \u201cprior\u201d and \u201cposterior\u201d.\nInstead, the layer re\ufb02ects the actual computation within an algorithm and overloads existing semantics\nsuch as \u201cinitialization\u201d (now the variational posterior) and \u201cregularization\u201d (now a KL divergence\ntoward the prior). This is a tradeo\ufb00 we made deliberately in that we lose separation of model and\ninference, but we bene\ufb01t from the rich composability of network layers and integration with third-\nparty libraries. (However, see Section 2.5 for how we might keep the separation if desired.)\n2.2 Gaussian Process Layers\nAs opposed to representing distributions over functions through the weights, Gaussian processes rep-\nresent distributions over functions by specifying the value of the function at di\ufb00erent inputs. Recent\nadvances have made Gaussian process inference computationally similar to Bayesian neural networks\n(Hensman et al., 2013). We only require a method to sample the function value at a new input, and\nevaluate KL regularizers. This allows GPs to be placed in the same framework as above.3 Figure 5\nimplements a deep GP.\nComputing the integral Each Gaussian process prior in a model is represented as a separate Layer,\nwhich can be composed together. GaussianProcess implements exact (but expensive) condition-\ning. Approximations are given in the form of SparseGaussianProcess for inducing points (lead-\ning to Salimbeni and Deisenroth (2017)) and RandomFourierFeatures for \ufb01nite trigonometric\nbasis function approximations (used by Cutajar et al. (2017)). Both these approximations allow sam-\npling from the predictive distribution of the function at particular inputs, which can be used for\nobtaining an unbiased estimate of the ELBO.\nSignature For the equivalent deterministic layer, maintain its mandatory arguments as well as\ntensor-dimensional inputs and outputs. For example, units in a Gaussian process layer determine\nthe GP\u2019s output dimensionality, where ed.layers.GaussianProcess(32) is the Bayesian non-\nparametric extension of tf.keras.layers.Dense(32). Instead of an activation function ar-\ngument, GP layers have mean and covariance function arguments which default to the zero function\nand squared exponential kernel respectively. Any state in the layer\u2019s computational graph may be\ntrainable such as kernel hyperparameters or inputs and outputs that the function conditions on.\n\nDistribution regularizers We use defaults which re\ufb02ect each inference method\u2019s standard for\ntraining, e.g., no regularizer for exact GPs, a KL divergence regularizer on the inducing output distri-\nbution for sparse GPs, and a KL regularizer on weights for random projection approximations.\n2.3 Stochastic Output Layers\nIn addition to uncertainty over the mapping de\ufb01ned by a layer, we may want to simply add stochas-\nticity to the output. These outputs have a tractable distribution, and we often would like to access\nits properties: for example, auto-encoding with stochastic encoders and decoders (Figure 6); or a\ndynamics model whose network output is a discretized mixture density (Appendix C).4\nSignature To implement stochastic output layers, we perform deterministic computations given a\ntensor-dimensional input and return a RandomVariable. Because RandomVariables are Tensor-\nlike objects, one can operate on them as if they were Tensors: composing stochastic output layers is\nvalid. In addition, using such a layer as the last one in a network allows one to compute properties\nsuch as a network\u2019s entropy or likelihood given data.\nStochastic output layers typically don\u2019t have mandatory constructor arguments. An optional units\nargument determines its output dimensionality (operated on via a trainable linear projection); the\ndefault maintains the input shape and has no such projection.\n\n3More broadly, these ideas extend to stochastic processes. Figure 8 uses a Poisson process.\n4 In previous \ufb01gures, we used loss functions such asmean_squared_error. With stochastic output layers,\n\nwe can replace them with a layer returning the likelihood and calling log_prob.\n\n5\n\n\fmodel = tf.keras.Sequential([\ntf.keras.layers.Flatten(),\ned.layers.SparseGaussianProcess(\nunits=256, num_inducing=512),\ned.layers.SparseGaussianProcess(\nunits=256, num_inducing=512),\ned.layers.SparseGaussianProcess(\nunits=10, num_inducing=512),\n])def loss_fn(features, labels):\npredictions = model(features)\nnll = tf.reduce_mean(\ntf.math.squared_difference(\nlabels, predictions.mean())\nkl = sum(model.losses)\nreturn nll + kl/dataset_size\nFigure 5: Three-layer deep GP with vari-\national inference (Salimbeni and Deisen-\nroth, 2017; Damianou and Lawrence,\n2013). We apply it for regression given\nbatches of spatial inputs and vector-valued\noutputs. We \ufb02atten inputs to use the de-\nfault squared exponential kernel; this nat-\nurally extends to pass in a more sophisti-\ncated kernel function.\n\nConv2D = functools.partial(\ntf.keras.layers.Conv2D,\npadding='same', activation='relu')\nDeconv2D = functools.partial(\ntf.keras.layers.Conv2DTranspose,\npadding='same', activation='relu')\nencoder = tf.keras.Sequential([\nConv2D(128, 5, 1),\nConv2D(128, 5, 2),\nConv2D(512, 7, 1, padding='valid'),\ned.layers.Normal(name='latent_code'),\n])decoder = tf.keras.Sequential([\nDeconv2D(256, 7, 1, padding='valid'),\nDeconv2D(128, 5, 2),\nDeconv2D(128, 5, 1),\nConv2D(3*256, 5, 1, activation=None),\ntf.keras.layers.Reshape([256, 256, 3, -1]),\ned.layers.Categorical(name='image'),\n])def loss_fn(features):\nencoding = encoder(features)\nnll = -decoder(encoding).log_prob(features)\nkl = encoding.kl_divergence(\ned.Normal(0., 1.))\nreturn tf.reduce_mean(nll + kl)\nFigure 6: A variational auto-encoder for compress-\ning 256x256x3 ImageNet into a 32x32x3 latent code.\nStochastic output layers are a natural approach for\nspecifying stochastic encoders and decoders, and uti-\nlizing their log-probability or KL divergence.\n\nmodel = tf.keras.Sequential([\ned.layers.RealNVP(ed.layers.MADE([512, 512])),\ned.layers.RealNVP(ed.layers.MADE([512, 512], order='right-to-left')),\ned.layers.RealNVP(ed.layers.MADE([512, 512])),\n])def loss_fn(features):\nbase = ed.Normal(loc=tf.zeros([batch_size, 32*32*3]), scale=1.)\noutputs = model(base)\nreturn -tf.reduce_sum(outputs.distribution.log_prob(features))\n\nFigure 7: A \ufb02ow-based model for image generation (Dinh et al., 2017).\n\n2.4 Reversible Layers\n\nWith random variables in layers, one can naturally capture invertible neural networks which propa-\ngate uncertainty from input to output. This allows one to perform transformations of random vari-\nables, ranging from simple transformations such as for a log-normal distribution or high-dimensional\ntransformations for \ufb02ow-based models. We recommend using these layers when generative modeling\nwith normalizing \ufb02ows (Dinh et al., 2017) or understanding how networks make predictions (Jacob-\nsen et al., 2018).\nWe make two considerations to design reversible layers:\n\nInversion Invertible neural networks are not possible with current libraries. A natural idea is to\ndesign a new abstraction for invertible functions such as TensorFlow\u2019s Bijectors (Dillon et al., 2017).\nsimply overload the notion of a \u201clayer\u201d by adding an additional method reverse which performs\nUnfortunately, this prevents interoperability with existing layer and model abstractions. Instead, we\nthe inverse computation of its call and optionally log_det_jacobian. A higher-order layer called\ned.layers.Reverse takes a layer as input and returns another layer swapping the forward and\nreverse computation; by ducktyping, the reverse layer raises an error only during its call if reverse\n\n6\n\n\fdef model(input_shape):\n\"\"\"Spatial point process.\"\"\"\nrate = tf.keras.Sequential([\ned.layers.GaussianProcess(64)\ned.layers.GaussianProcess(input_shape)\ntf.keras.layers.Activation('softplus'),\n])return ed.layers.PoissonProcess(rate)\n\ndef posterior():\n\"\"\"Approximate posterior of rate function.\"\"\"\nrate = tf.keras.Sequential([\n\ned.layers.SparseGaussianProcess(\nunits=64, num_inducing=512),\ned.layers.SparseGaussianProcess(\nunits=1, num_inducing=512),\ntf.keras.layers.Activation('softplus'),\n])return rate\n\nFigure 8: Cox process with a deep GP prior and a sparse GP posterior approximation. Unlike\nprevious examples, using Bayesian Layers in a probabilistic programming language allows for a clean\nseparation of model and inference, as well as more \ufb02exible inference algorithms.\n\nlayers compatible with other higher-order layers such as tf.keras.Sequential, which returns a\nis not implemented. Avoiding a new abstraction both simpli\ufb01es usage and also makes reversible\ncomposition of a sequence of layers.\n\nPropagating Uncertainty As with other deterministic layers, reversible layers take a tensor-\ninput to output, reversible layers may also take a RandomVariable as input and return a trans-\ndimensional input and return a tensor-dimensional output. In order to propagate uncertainty from\nformed RandomVariable determined by its call, reverse, and log_det_jacobian.5 Figure 7\nimplements RealNVP (Dinh et al., 2017), which is a reversible layer parameterized by another net-\nwork (here, MADE (Germain et al., 2015)). These ideas also extend to reversible networks that\nenable backpropagation without storing intermediate activations in memory during the forward pass\n(Gomez et al., 2017).\n\n2.5 Probabilistic Programming with Bayesian Layers\n\nSo far, the framework we laid out tightly integrates deep Bayesian modelling into existing ecosystems,\nbut we have deliberately limited our scope. In particular, our layers tie the model speci\ufb01cation to\nthe inference algorithm (typically, variational inference). A core assumption for this to work is the\nmodularization of inference per layer. This makes iterative procedures which depend on the full\nparameter space, such as Markov chain Monte Carlo, di\ufb03cult to \ufb01t within the framework (but note,\ne.g., variational distributions with correlations across layers is possible because the layer integrals\ndecompose conditionally).\nFigure 8 shows that one can utilize Bayesian Layers in the Edward2 probabilistic programming lan-\nguage for more \ufb02exible modeling and inference. It does this by \ufb01rst specifying the prior generative\nprocess in the model program; any layers with approximations are moved into a separate program,\nthe approximate posterior.6 We could use, e.g., expectation propagation (Bui et al., 2016), which is\npossible with Edward2\u2019s tracing mechanism to manipulate the individual random variables within the\nmodel and posterior. Importantly, Bayesian Layers provides modeling semantics to enable arbitrary\nand scalable probabilistic programming in function space.\n\n3 Experiments\nWe described a design for uncertainty models built on top of neural network libraries. In experiments,\nwe aim to illustrate one point: Bayesian Layers is e\ufb03cient and makes possible new model classes\nthat haven\u2019t been tried before (in either scale or \ufb02exibility). The \ufb01rst experiment is machine transla-\ntion, where training a model-parallel Bayesian model requires compatibility with Mesh TensorFlow\u2019s\nlow-level communication operations. The second experiment is model-based reinforcement learn-\ning, where using a Bayesian dynamics model requires \ufb01netuning model updates across sequences of\nposterior actions using the TF Agents API (Guadarrama et al., 2018).\n5We implemented.layers.Discretize this way in Appendix C. It takes a continuousRandomVariable\n6Above we used GaussianProcess for function priors. To specify function priors with Bayesian neural\n\nas input and returns a transformed variable with probabilities integrated over bins.\n\nnet layers, set the initializer to return the desired weight prior and remove the default regularizer.\n\n7\n\n\fBaseline\nVariational\n\nBLEU Calibration Error\n43.9\n43.8\n\n90.3%\n20.8%\n\nFigure 9: Bayesian Transformer implemented with model parallelism ranging from 8 TPUv2 shards\n(core) to 512. As desired, the model\u2019s training performance scales linearly as the number of cores\nincreases. It achieves the same BLEU score while also being well-calibrated.\n\nFigure 10: Results of the Bayesian PlaNet agent. The score shows the task median performance over\n5 seeds and 10 episodes each, with percentiles 5 to 95 shaded. Our Bayesian version of the method\nreaches the same task performance. The graph of the weight KL shows that the weight posterior\nlearns a non-trivial function. The open-loop video predictions show that the agent can accurately\nmake predictions into the future for 50 time steps.\n\n3.1 Model-Parallel Bayesian Transformer for Machine Translation\nWe implemented a \u201cBayesian Transformer\u201d for the WMT14 EN-FR translation task. Using Mesh\nTensorFlow (Shazeer et al., 2018), we took a 2.8 billion parameter Transformer which reports a\nstate-of-the-art BLEU score of 43.9. We then augmented the model by being Bayesian over the\nattention layers (using a stochastic layer with the Flipout estimator) and being Bayesian over the\nfeedforward layers (using a stochastic layer with the local reparameterization estimator). Figure 9\nshows that we can \ufb01t models with over 5-billion parameters (roughly twice as many due to a mean\nand standard deviation parameter), utilizing up to 2500 TFLOPs on 512 TPUv2 cores. Training time\nfor the deterministic Transformer takes roughly 13 hours; the Bayesian Transformer takes 16 hours\nand 2 extra gb per TPU.\nIn attempting these scales, we were able to reach state-of-the-art BLEU scores while achieving\nlower calibration error according to the sequence-level calibration error metric (Kumar and Sarawagi,\n2019). This suggests the Bayesian Transformer better accounts for predictive uncertainty given that\nthe dataset is actually fairly small given the size of the model.\n3.2 Bayesian Dynamics Model for Model-Based Reinforcement Learning\nIn reinforcement learning, uncertainty estimates can allow for directed exploration, safe exploration,\nand robust control. Still relatively few works leverage deep Bayesian models for control (Gal et al.,\n2016; Azizzadenesheli et al., 2018). We argue that this might be because implementing and train-\n\n8\n\nTrueContext6101520253035404550ModelTrueModelTrueModel550010001500200002004006008001000Bayesian PlaNetPlaNetCheetah Run (Score)55001000150020005101520Weight KL (nats)Cheetah Run (Weight KL)\fFigure 11: We use the Bayesian PlaNet agent to predict the true velocities of the reinforcement\nlearning environment from its encoded latent states. Compared to Figure 7 of Hafner et al. (2018),\nBayesian PlaNet appears to capture more information about the environment in the latent codes re-\nsulting in more precise velocity predictions.\n\ning these models can be di\ufb03cult and time consuming. To demonstrate our module, we implement\nBayesian PlaNet, based on the work of Hafner et al. (2018). The original PlaNet agent learns a latent\ndynamics model as a sequential VAE on image observations. A sample-based planner then searches\nfor the most promising action sequence in the latent space of the model.\ncounterparts,DenseReparameterization. Bayesian PlaNet reaches a score of 614 on the cheetah\nWe extend this agent by changing the feedforward layers of the transition function to their Bayesian\ntask, matching the performance of the original agent (Figure 10). Training time for the deterministic\ndynamics model takes 20 hours, 8 gb; the Bayesian dynamics model takes 22 hours; 8 gb. We monitor\nthe KL divergence of the weight posterior to verify that the model indeed learns a non-trivial belief.\nThe result opens up many potential bene\ufb01ts for exploration and robust control; see Figure 11 for an\nexample. It also demonstrates that incorporating uncertainty into agents can be straightforward given\nthe right composability of software abstractions.\n4 Discussion\nWe described Bayesian Layers, a module designed for fast experimentation with neural network un-\ncertainty. By capturing uncertainty-aware functions, Bayesian Layers lets one naturally experiment\nwith and scale up Bayesian neural networks, GPs, and \ufb02ow-based models.\nIn future work, we are applying Bayesian Layers in our methodological and applied research, fur-\nther expanding its support and examples. We are also exploring the use of uncertainty models in\nhealthcare production systems, where the goal is to improve clinical decision-making by providing\nAI-guided clinical tools and diagnostics.\nIn Bayesian Layers, we encapsulated probabilistic notions as part of existing neural network abstrac-\ntions such as layers, initializers, and regularizers. One question is whether this should also be done for\nother deep learning abstractions such as optimizers. Stochastic gradient MCMC easily \ufb01ts on top of\ngradient-based optimizers by adding noise, as well as certain variational inference algorithms (Zhang\net al., 2017; Khan et al., 2018). Further understanding this space, and how it interacts with proba-\nbilistic layers both in \ufb02exibility and inductive biases, is a potentially interesting direction.\n\n9\n\n\fReferences\nAbadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean,\nJ., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz,\nR., Kaiser, L., Kudlur, M., Levenberg, J., Man\u00e9, D., Monga, R., Moore, S., Murray, D., Olah, C.,\nSchuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasude-\nvan, V., Vi\u00e9gas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X.\n(2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available\nfrom tensor\ufb02ow.org.\nAboleth Developers (2017). Aboleth. https://github.com/data61/aboleth.\nAl-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien, F., Bayer,\nJ., Belikov, A., Belopolsky, A., Bengio, Y., Bergeron, A., Bergstra, J., Bisson, V., Bleecher Sny-\nder, J., Bouchard, N., Boulanger-Lewandowski, N., Bouthillier, X., de Br\u00e9bisson, A., Breuleux,\nO., Carrier, P.-L., Cho, K., Chorowski, J., Christiano, P., Cooijmans, T., C\u00f4t\u00e9, M.-A., C\u00f4t\u00e9, M.,\nCourville, A., Dauphin, Y. N., Delalleau, O., Demouth, J., Desjardins, G., Dieleman, S., Dinh,\nL., Duco\ufb00e, M., Dumoulin, V., Ebrahimi Kahou, S., Erhan, D., Fan, Z., Firat, O., Germain, M.,\nGlorot, X., Goodfellow, I., Graham, M., Gulcehre, C., Hamel, P., Harlouchet, I., Heng, J.-P., Hi-\ndasi, B., Honari, S., Jain, A., Jean, S., Jia, K., Korobov, M., Kulkarni, V., Lamb, A., Lamblin, P.,\nLarsen, E., Laurent, C., Lee, S., Lefrancois, S., Lemieux, S., L\u00e9onard, N., Lin, Z., Livezey, J. A.,\nLorenz, C., Lowin, J., Ma, Q., Manzagol, P.-A., Mastropietro, O., McGibbon, R. T., Memisevic,\nR., van Merri\u00ebnboer, B., Michalski, V., Mirza, M., Orlandi, A., Pal, C., Pascanu, R., Pezeshki, M.,\nRa\ufb00el, C., Renshaw, D., Rocklin, M., Romero, A., Roth, M., Sadowski, P., Salvatier, J., Savard,\nF., Schl\u00fcter, J., Schulman, J., Schwartz, G., Serban, I. V., Serdyuk, D., Shabanian, S., Simon, E.,\nSpieckermann, S., Subramanyam, S. R., Sygnowski, J., Tanguay, J., van Tulder, G., Turian, J.,\nUrban, S., Vincent, P., Visin, F., de Vries, H., Warde-Farley, D., Webb, D. J., Willson, M., Xu,\nK., Xue, L., Yao, L., Zhang, S., and Zhang, Y. (2016). Theano: A Python framework for fast\ncomputation of mathematical expressions. arXiv preprint arXiv:1605.02688.\n\nAl-Shedivat, M., Wilson, A. G., Saatchi, Y., Hu, Z., and Xing, E. P. (2017). Learning scalable deep\n\nkernels with recurrent structure. Journal of Machine Learning Research, 18(1).\n\nAzizzadenesheli, K., Brunskill, E., and Anandkumar, A. (2018). E\ufb03cient exploration through\n\nbayesian deep q-networks. arXiv preprint arXiv:1802.04412.\n\nBingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szer-\nlip, P., Horsfall, P., and Goodman, N. D. (2019). Pyro: Deep universal probabilistic programming.\nThe Journal of Machine Learning Research, 20(1):973\u2013978.\n\nBlundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural\n\nnetworks. In International Conference on Machine Learning.\n\nBui, T., Hernandez-Lobato, D., Hernandez-Lobato, J., Li, Y., and Turner, R. (2016). Deep gaussian\nprocesses for regression using approximate expectation propagation. In Proceedings of The 33rd\nInternational Conference on Machine Learning, volume 48 of Proceedings of Machine Learning\nResearch, pages 1472\u20131481.\n\nCarpenter, B., Gelman, A., Ho\ufb00man, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.,\nGuo, J., Li, P., and Riddell, A. (2016). Stan: A probabilistic programming language. Journal of\nStatistical Software.\n\nChen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z.\n(2015). MXNet: A \ufb02exible and e\ufb03cient machine learning library for heterogeneous distributed\nsystems. arXiv preprint arXiv:1512.01274.\nChollet, F. (2016). Keras. https://github.com/fchollet/keras.\nCollobert, R., Kavukcuoglu, K., and Farabet, C. (2011). Torch7: A matlab-like environment for\n\nmachine learning. In BigLearn, NIPS Workshop.\n\nCutajar, K., Bonilla, E. V., Michiardi, P., and Filippone, M. (2017). Random feature expansions\nfor deep Gaussian processes. In Proceedings of the 34th International Conference on Machine\nLearning, volume 70 of Proceedings of Machine Learning Research, pages 884\u2013893.\n\n10\n\n\fDai, Z., Meissner, E., and Lawrence, N. D. (2018). Mxfusion: A modular deep probabilistic pro-\n\ngramming library.\n\nDamianou, A. and Lawrence, N. (2013). Deep gaussian processes. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 207\u2013215.\n\nDillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi,\narXiv preprint\n\nA., Ho\ufb00man, M., and Saurous, R. A. (2017). TensorFlow Distributions.\narXiv:1711.10604.\n\nDinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using real nvp. In Interna-\n\ntional Conference on Learning Representations.\n\nDusenberry, M. W., Tran, D., Choi, E., Kemp, J., Nixon, J., Jerfel, G., Heller, K., and Dai, A. M.\n(2019). Analyzing the role of model uncertainty for electronic health records. arXiv preprint\narXiv:1906.03842.\n\nFortunato, M., Blundell, C., and Vinyals, O. (2017). Bayesian recurrent neural networks. arXiv\n\npreprint arXiv:1704.02798.\n\nGal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model\nuncertainty in deep learning. In international conference on machine learning, pages 1050\u20131059.\nGal, Y., McAllister, R., and Rasmussen, C. E. (2016). Improving pilco with bayesian neural network\n\ndynamics models. In Data-E\ufb03cient Machine Learning workshop, ICML, volume 4.\n\nGardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., and Wilson, A. G. (2018). Gpytorch:\n\nBlackbox matrix-matrix gaussian process inference with gpu acceleration. In NeurIPS.\n\nGermain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: Masked autoencoder for\n\ndistribution estimation. In International Conference on Machine Learning, pages 881\u2013889.\n\nGomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. (2017). The reversible residual network:\n\nBackpropagation without storing activations. In Neural Information Processing Systems.\n\nGoodman, N., Mansinghka, V., Roy, D. M., Bonawitz, K., and Tenenbaum, J. B. (2012). Church: a\nlanguage for generative models. arXiv preprint arXiv:1206.3255.\nGPy (since 2012). GPy: A gaussian process framework in python. http://github.com/\nSheffieldML/GPy.\nGuadarrama, S., Korattikara, A., Ramirez, O., Castro, P., Holly, E., Fishman, S., Wang, K., Gonina,\nlearning in tensor\ufb02ow. https://github.com/tensorflow/agents. [Online; accessed 30-\nE., Harris, C., Vanhoucke, V., and Brevdo, E. (2018). TF-Agents: A library for reinforcement\nNovember-2018].\n\nHafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. (2018). Learning\n\nlatent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551.\n\nHafner, D., Tran, D., Irpan, A., Lillicrap, T., and Davidson, J. (2019). Reliable uncertainty estimates\n\nin deep neural networks using noise contrastive priors.\n\nHensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. In Conference\n\non Uncertainty in Arti\ufb01cial Intelligence.\n\nJacobsen, J.-H., Smeulders, A., and Oyallon, E. (2018). i-revnet: Deep invertible networks. arXiv\n\npreprint arXiv:1802.07088.\n\nJia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell,\nT. (2014). Ca\ufb00e: Convolutional architecture for fast feature embedding. In Proceedings of the\n22nd ACM international conference on Multimedia, pages 675\u2013678. ACM.\n\nJohn, S. T. and Hensman, J. (2018). Large-scale Cox process inference using variational Fourier\nfeatures. In Proceedings of the 35th International Conference on Machine Learning, volume 80\nof Proceedings of Machine Learning Research, pages 2362\u20132370.\n\n11\n\n\fJouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden,\nN., Borchers, A., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In\nProceedings of the 44th Annual International Symposium on Computer Architecture.\n\nKhan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. (2018). Fast and scalable\n\nbayesian deep learning by weight-perturbation in adam. arXiv preprint arXiv:1806.04854.\n\nKingma, D. P., Salimans, T., and Welling, M. (2015). Variational dropout and the local reparameter-\n\nization trick. In Advances in Neural Information Processing Systems, pages 2575\u20132583.\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In International Conference\n\non Learning Representations.\n\nKiselyov, O. and Shan, C.-C. (2009). Embedded probabilistic programming. In DSL, volume 5658,\n\npages 360\u2013384. Springer.\n\nKumar, A. and Sarawagi, S. (2019). Calibration of encoder decoder models for neural machine\n\ntranslation. arXiv preprint arXiv:1903.00802.\n\nLaumann, F. and Shridhar, K. (2018). Bayesian convolutional neural networks. arXiv preprint\n\narXiv:1806.05978.\n\nLouizos, C. and Welling, M. (2017). Multiplicative normalizing \ufb02ows for variational bayesian neural\n\nnetworks. arXiv preprint arXiv:1703.01961.\n\nMatthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., Le\u00f3n-Villagr\u00e1, P.,\nGhahramani, Z., and Hensman, J. (2017). GP\ufb02ow: A Gaussian process library using TensorFlow.\nJournal of Machine Learning Research, 18(40):1\u20136.\n\nNarayanan, P., Carette, J., Romano, W., Shan, C.-c., and Zinkov, R. (2016). Probabilistic Infer-\nence by Program Transformation in Hakaru (System Description). In International Symposium on\nFunctional and Logic Programming, pages 62\u201379, Cham. Springer, Cham.\nNeal, R. (1995). Software for \ufb02exible bayesian modeling and markov chain sampling. https:\n//www.cs.toronto.edu/~radford/fbm.software.html.\nNguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. (2017). Variational continual learning. arXiv\n\npreprint arXiv:1710.10628.\n\nParmar, N., Vaswani, A., Uszkoreit, J., Kaiser, \u0141., Shazeer, N., Ku, A., and Tran, D. (2018). Image\n\ntransformer. In International Conference on Machine Learning.\n\nPawlowski, N., Brock, A., Lee, M. C., Rajchl, M., and Glocker, B. (2017). Implicit weight uncertainty\nin neural networks. arXiv preprint arXiv:1711.01297.\nProbtorch Developers (2017). Probtorch. https://github.com/probtorch/probtorch.\nRanganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models.\n\nIn International\n\nConference on Machine Learning, pages 324\u2013333.\n\nRasmussen, C. E. and Nickisch, H. (2010). Gaussian processes for machine learning (gpml) toolbox.\n\nJournal of machine learning research, 11(Nov):3011\u20133015.\n\nRezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. In Interna-\n\ntional Conference on Machine Learning.\n\nS., G. and N., S. (2016). TensorFlow-Slim: A lightweight library for de\ufb01ning, training and evaluating\n\ncomplex models in TensorFlow.\n\nSalimbeni, H. and Deisenroth, M. (2017). Doubly stochastic variational inference for deep gaussian\n\nprocesses. In Advances in Neural Information Processing Systems, pages 4588\u20134599.\n\nShazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H.,\nHong, M., Young, C., Sepassi, R., and Hechtman, B. (2018). Mesh-TensorFlow: Deep learning\nfor supercomputers. In Neural Information Processing Systems.\n\n12\n\n\fTran, D., Ho\ufb00man, M. D., Moore, D., Suter, C., Vasudevan, S., Radul, A., Johnson, M., and Saurous,\nR. A. (2018). Simple, distributed, and accelerated probabilistic programming. In Neural Informa-\ntion Processing Systems.\n\nTran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016). Edward: A\n\nlibrary for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787.\n\nvan den Oord, A., Vinyals, O., et al. (2017). Neural discrete representation learning. In Advances in\n\nNeural Information Processing Systems, pages 6306\u20136315.\n\nVanhatalo, J., Riihim\u00e4ki, J., Hartikainen, J., Jyl\u00e4nki, P., Tolvanen, V., and Vehtari, A. (2013).\nGpstu\ufb00: Bayesian modeling with gaussian processes. Journal of Machine Learning Research,\n14(Apr):1175\u20131179.\n\nVaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A. N., Gouws, S., Jones, L., Kaiser, L.,\nKalchbrenner, N., Parmar, N., Sepassi, R., Shazeer, N., and Uszkoreit, J. (2018). Tensor2tensor\nfor neural machine translation. CoRR, abs/1803.07416.\n\nWen, Y., Vicol, P., Ba, J., Tran, D., and Grosse, R. (2018). Flipout: E\ufb03cient pseudo-independent\nweight perturbations on mini-batches. In International Conference on Learning Representations.\nWu, A., Nowozin, S., Meeds, E., Turner, R. E., Hern\u00e1ndez-Lobato, J. M., and Gaunt, A. L. (2018).\nFixing variational bayes: Deterministic variational inference for bayesian neural networks. arXiv\npreprint arXiv:1810.03958.\n\nZhang, G., Sun, S., Duvenaud, D., and Grosse, R. (2017). Noisy natural gradient as variational\n\ninference. arXiv preprint arXiv:1712.02390.\n\n13\n\n\f", "award": [], "sourceid": 8286, "authors": [{"given_name": "Dustin", "family_name": "Tran", "institution": "Google Brain"}, {"given_name": "Mike", "family_name": "Dusenberry", "institution": "Google Brain"}, {"given_name": "Mark", "family_name": "van der Wilk", "institution": "PROWLER.io"}, {"given_name": "Danijar", "family_name": "Hafner", "institution": "Google"}]}