{"title": "Doubly Stochastic Variational Inference for Deep Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 4588, "page_last": 4599, "abstract": "Deep Gaussian processes (DGPs) are multi-layer generalizations of GPs, but inference in these models has proved challenging. Existing approaches to inference in DGP models assume approximate posteriors that force independence between the layers, and do not work well in practice. We present a doubly stochastic variational inference algorithm, which does not force independence between layers. With our method of inference we demonstrate that a DGP model can be used effectively on data ranging in size from hundreds to a billion points. We provide strong empirical evidence that our inference scheme for DGPs works well in practice in both classification and regression.", "full_text": "Doubly Stochastic Variational Inference\n\nfor Deep Gaussian Processes\n\nImperial College London and PROWLER.io\n\nImperial College London and PROWLER.io\n\nHugh Salimbeni\n\nhrs13@ic.ac.uk\n\nMarc Peter Deisenroth\n\nm.deisenroth@imperial.ac.uk\n\nAbstract\n\nGaussian processes (GPs) are a good choice for function approximation as they are\n\ufb02exible, robust to over\ufb01tting, and provide well-calibrated predictive uncertainty.\nDeep Gaussian processes (DGPs) are multi-layer generalizations of GPs, but\ninference in these models has proved challenging. Existing approaches to inference\nin DGP models assume approximate posteriors that force independence between the\nlayers, and do not work well in practice. We present a doubly stochastic variational\ninference algorithm that does not force independence between layers. With our\nmethod of inference we demonstrate that a DGP model can be used effectively\non data ranging in size from hundreds to a billion points. We provide strong\nempirical evidence that our inference scheme for DGPs works well in practice in\nboth classi\ufb01cation and regression.\n\n1\n\nIntroduction\n\nGaussian processes (GPs) achieve state-of-the-art performance in a range of applications including\nrobotics (Ko and Fox, 2008; Deisenroth and Rasmussen, 2011), geostatistics (Diggle and Ribeiro,\n2007), numerics (Briol et al., 2015), active sensing (Guestrin et al., 2005) and optimization (Snoek\net al., 2012). A Gaussian process is de\ufb01ned by its mean and covariance function. In some situations\nprior knowledge can be readily incorporated into these functions. Examples include periodicities\nin climate modelling (Rasmussen and Williams, 2006), change-points in time series data (Garnett\net al., 2009) and simulator priors for robotics (Cutler and How, 2015). In other settings, GPs are\nused successfully as black-box function approximators. There are compelling reasons to use GPs,\neven when little is known about the data: a GP grows in complexity to suit the data; a GP is robust\nto over\ufb01tting while providing reasonable error bars on predictions; a GP can model a rich class of\nfunctions with few hyperparameters.\nSingle-layer GP models are limited by the expressiveness of the kernel/covariance function. To some\nextent kernels can be learned from data, but inference over a large and richly parameterized space\nof kernels is expensive, and approximate methods may be at risk of over\ufb01tting. Optimization of\nthe marginal likelihood with respect to hyperparameters approximates Bayesian inference only if\nthe number of hyperparameters is small (Mackay, 1999). Attempts to use, for example, a highly\nparameterized neural network as a kernel function (Calandra et al., 2016; Wilson et al., 2016) incur the\ndownsides of deep learning, such as the need for application-speci\ufb01c architectures and regularization\ntechniques. Kernels can be combined through sums and products (Duvenaud et al., 2013) to create\nmore expressive compositional kernels, but this approach is limited to simple base kernels, and their\noptimization is expensive.\nA Deep Gaussian Process (DGP) is a hierarchical composition of GPs that can overcome the\nlimitations of standard (single-layer) GPs while retaining the advantages. DGPs are richer models\nthan standard GPs, just as deep networks are richer than generalized linear models. In contrast to\nmodels with highly parameterized kernels, DGPs learn a representation hierarchy non-parametrically\nwith very few hyperparmeters to optimize.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fUnlike their single-layer counterparts, DGPs have proved dif\ufb01cult to train. The mean-\ufb01eld variational\napproaches used in previous work (Damianou and Lawrence, 2013; Mattos et al., 2016; Dai et al.,\n2016) make strong independence and Gaussianity assumptions. The true posterior is likely to\nexhibit high correlations between layers, but mean-\ufb01eld variational approaches are known to severely\nunderestimate the variance in these situations (Turner and Sahani, 2011).\nIn this paper, we present a variational algorithm for inference in DGP models that does not force\nindependence or Gaussianity between the layers. In common with many state-of-the-art GP approxi-\nmation schemes we start from a sparse inducing point variational framework (Matthews et al., 2016)\nto achieve computational tractability within each layer, but we do not force independence between\nthe layers. Instead, we use the exact model conditioned on the inducing points as a variational\nposterior. This posterior has the same structure as the full model, and in particular it maintains the\ncorrelations between layers. Since we preserve the non-linearity of the full model in our variational\nposterior we lose analytic tractability. We overcome this dif\ufb01culty by sampling from the variational\nposterior, introducing the \ufb01rst source of stochasticity. This is computationally straightforward due to\nan important property of the sparse variational posterior marginals: the marginals conditioned on the\nlayer below depend only on the corresponding inputs. It follows that samples from the marginals\nat the top layer can be obtained without computing the full covariance within the layers. We are\nprimarily interested in large data applications, so we further subsample the data in minibatches. This\nsecond source of stochasticity allows us to scale to arbitrarily large data.\nWe demonstrate through extensive experiments that our approach works well in practice. We provide\nresults on benchmark regression and classi\ufb01cation data problems, and also demonstrate the \ufb01rst\nDGP application to a dataset with a billion points. Our experiments con\ufb01rm that DGP models are\nnever worse than single-layer GPs, and in many cases signi\ufb01cantly better. Crucially, we show that\nadditional layers do not incur over\ufb01tting, even with small data.\n\n2 Background\n\nIn this section, we present necessary background on single-layer Gaussian processes and sparse\nvariational inference, followed by the de\ufb01nition of the deep Gaussian process model. Throughout we\nemphasize a particular property of sparse approximations: the sparse variational posterior is itself a\nGaussian process, so the marginals depend only on the corresponding inputs.\n\n2.1 Single-layer Gaussian Processes\nWe consider the task of inferring a stochastic function f : RD \u2192 R, given a likelihood p(y|f ) and\na set of N observations y = (y1, . . . , yN )(cid:62) at design locations X = (x1, . . . , xN )(cid:62). We place a\nGP prior on the function f that models all function values as jointly Gaussian, with a covariance\nfunction k : RD \u00d7 RD \u2192 R and a mean function m : RD \u2192 R. We further de\ufb01ne an additional\nset of M inducing locations Z = (z1, . . . , zM )(cid:62). We use the notation f = f (X) and u = f (Z) for\nthe function values at the design and inducing points, respectively. We de\ufb01ne also [m(X)]i = m(xi)\nand [k(X, Z)]ij = k(xi, zj). By the de\ufb01nition of a GP, the joint density p(f , u) is a Gaussian\nwhose mean is given by the mean function evaluated at every input (X, Z)(cid:62), and the corresponding\ncovariance is given by the covariance function evaluated at every pair of inputs. The joint density of\ny, f and u is\n\n.\n\n(1)\n\np(y, f , u) = p(f|u; X, Z)p(u; Z)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nGP prior\n\n(cid:89)N\n(cid:124)\n\n(cid:125)\n\ni=1\n\nlikelihood\n\np(yi|fi)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nIn (1) we factorized the joint GP prior p(f , u; X, Z) 1 into the prior p(u) = N (u|m(Z), k(Z, Z))\nand the conditional p(f|u; X, Z) = N (f|\u00b5, \u03a3), where for i, j = 1, . . . , N\n\n[\u00b5]i = m(xi) + \u03b1(xi)(cid:62)(u \u2212 m(Z)) ,\n[\u03a3]ij = k(xi, xj) \u2212 \u03b1(xi)(cid:62)k(Z, Z)\u03b1(xj) ,\n\n(2)\n(3)\n\n1Throughout this paper we use the semi-colon notation to clarify the input locations of the corresponding\nfunction values, which will become important later when we discuss multi-layer GP models. For example,\np(f|u; X, Z) indicates that the input locations for f and u are X and Z, respectively.\n\n2\n\n\fwith \u03b1(xi) = k(Z, Z)\u22121k(Z, xi). Note that the conditional mean \u00b5 and covariance \u03a3 de\ufb01ned via (2)\nand (3), respectively, take the form of mean and covariance functions of the inputs xi. Inference in\nthe model (1) is possible in closed form when the likelihood p(y|f ) is Gaussian, but the computation\nscales cubically with N.\nWe are interested in large datasets with non-Gaussian likelihoods. Therefore, we seek a variational\nposterior to overcome both these dif\ufb01culties simultaneously. Variational inference seeks an ap-\nproximate posterior q(f , u) by minimizing the Kullback-Leibler divergence KL[q||p] between the\nvariational posterior q and the true posterior p. Equivalently, we maximize the lower bound on the\nmarginal likelihood (evidence)\n\n(cid:20)\n\n(cid:21)\n\nL = Eq(f ,u)\n\nlog\n\np(y, f , u)\n\nq(f , u)\n\n,\n\n(4)\n\nwhere p(y, f , u) is given in (1). We follow Hensman et al. (2013) and choose a variational posterior\n\nq(f , u) = p(f|u; X, Z)q(u) ,\n\n(5)\nwhere q(u) = N (u|m, S). Since both terms in the variational posterior are Gaussian, we can\nanalytically marginalize u, which yields\nq(f|m, S; X, Z) =\n\np(f|u; X, Z)q(u)du = N (f| \u02dc\u00b5, \u02dc\u03a3) .\n\n(cid:90)\n\n(6)\n\nSimilar to (2) and (3), the expressions for \u02dc\u00b5 and \u02dc\u03a3 can be written as mean and covariance functions\nof the inputs. To emphasize this point we de\ufb01ne\n\n\u00b5m,Z(xi) = m(xi) + \u03b1(xi)(cid:62)(m \u2212 m(Z)) ,\n\n\u03a3S,Z(xi, xj) = k(xi, xj) \u2212 \u03b1(xi)(cid:62)(k(Z, Z) \u2212 S)\u03b1(xj) .\n\n(7)\n(8)\n\nWith these functions we de\ufb01ne [ \u02dc\u00b5]i = \u00b5m,Z(xi) and [ \u02dc\u03a3]ij = \u03a3S,Z(xi, xj). We have written the\nmean and covariance in this way to make the following observation clear.\nRemark 1. The fi marginals of the variational posterior (6) depend only on the corresponding\ninputs xi. Therefore, we can write the ith marginal of q(f|m, S; X, Z) as\n\nq(fi|m, S; X, Z) = q(fi|m, S; xi, Z) = N (fi|\u00b5m,Z(xi), \u03a3S,Z(xi, xi)) .\n\n(9)\n\nUsing our variational posterior (5) the lower bound (4) simpli\ufb01es considerably since (a) the condi-\ntionals p(f|u; X, Z) inside the logarithm cancel and (b) the likelihood expectation requires only the\nvariational marginals. We obtain\n\nL =\n\nEq(fi|m,S;xi,Z)[log p(yi|fi)] \u2212 KL[q(u)||p(u)] .\n\n(10)\n\n(cid:88)N\n\ni=1\n\nThe \ufb01nal (univariate) expectation of the log-likelihood can be computed analytically in some cases,\nwith quadrature (Hensman et al., 2015) or through Monte Carlo sampling (Bonilla et al., 2016; Gal\net al., 2015). Since the bound is a sum over the data, an unbiased estimator can be obtained through\nminibatch subsampling. This permits inference on large datasets. In this work we refer to a GP with\nthis method of inference as a sparse GP (SGP).\nThe variational parameters (Z, m and S) are found by maximizing the lower bound (10). This\nmaximization is guaranteed to converge since L is a lower bound to the marginal likelihood p(y|X).\nWe can also learn model parameters (hyperparameters of the kernel or likelihood) through the\nmaximization of this bound, though we should exercise caution as this introduces bias because the\nbound is not uniformly tight for all settings of hyperparameters (Turner and Sahani, 2011)\nSo far we have considered scalar outputs yi \u2208 R. In the case of D-dimensional outputs yi \u2208 RD we\nde\ufb01ne Y as the matrix with ith row containing the ith observation yi. Similarly, we de\ufb01ne F and U.\nd=1 p(Fd|Ud; X, Z)p(Ud; Z), which\nwe abbreviate as p(F|U; X, Z)p(U; Z) to lighten the notation.\n\nIf each output is an independent GP we have the GP prior(cid:81)D\n\n3\n\n\f2.2 Deep Gaussian Processes\n\nA DGP (Damianou and Lawrence, 2013) de\ufb01nes a prior recursively on vector-valued stochastic\nfunctions F 1, . . . , F L. The prior on each function F l is an independent GP in each dimension, with\ninput locations given by the noisy corruptions of the function values at the next layer: the outputs\nd, and the corresponding inputs are F l\u22121. The noise between layers is\nof the GPs at layer l are F l\nassumed i.i.d. Gaussian. Most presentations of DGPs (see, e.g. Damianou and Lawrence, 2013; Bui\net al., 2016) explicitly parameterize the noisy corruptions separately from the outputs of each GP. Our\nmethod of inference does not require us to parameterize these variables separately. For notational\nconvenience, we therefore absorb the noise into the kernel knoisy(xi, xj) = k(xi, xj) + \u03c32\nl \u03b4ij, where\n\u03b4ij is the Kronecker delta, and \u03c32\nl is the noise variance between layers. We use Dl for the dimension\nof the outputs at layer l. As with the single-layer case, we have inducing locations Zl\u22121 at each layer\nand inducing function values Ul for each dimension.\nAn instantiation of the process has the joint density\n\np(Y,{Fl, Ul}L\n\nl=1) =\n\np(yi|f L\ni )\n\np(Fl|Ul; Fl\u22121, Zl\u22121)p(Ul; Zl\u22121)\n\n,\n\n(11)\n\n(cid:89)N\n(cid:124)\n\ni=1\n\n(cid:123)(cid:122)\n\nlikelihood\n\n(cid:89)L\n(cid:124)\n\n(cid:125)\n\nl=1\n\n(cid:123)(cid:122)\n\nDGP prior\n\n(cid:125)\n\nwhere we de\ufb01ne F0 = X. Inference in this model is intractable, so approximations must be used.\nThe original DGP presentation (Damianou and Lawrence, 2013) uses a variational posterior that\nmaintains the exact model conditioned on Ul, but further forces the inputs to each layer to be inde-\npendent from the outputs of the previous layer. The noisy corruptions are parameterized separately,\nand the variational distribution over these variables is a fully factorized Gaussian. This approach\nrequires 2N (D1 + \u00b7\u00b7\u00b7 + DL\u22121) variational parameters but admits a tractable lower bound on the\nlog marginal likelihood if the kernel is of a particular form. A further problem of this bound is that\nthe density over the outputs is simply a single layer GP with independent Gaussian inputs. Since the\nposterior loses all the correlations between layers it cannot express the complexity of the full model\nand so is likely to underestimate the variance. In practice, we found that optimizing the objective\nin Damianou and Lawrence (2013) results in layers being \u2018turned off\u2019 (the signal to noise ratio tends\nto zero). In contrast, our posterior retains the full conditional structure of the true model. We sacri\ufb01ce\nanalytical tractability, but due to the sparse posterior within each layer we can sample the bound using\nunivariate Gaussians.\n\n3 Doubly Stochastic Variational Inference\n\nIn this section, we propose a novel variational posterior and demonstrate a method to obtain unbiased\nsamples from the resulting lower bound. The dif\ufb01culty with inferring the DGP model is that there\nare complex correlations both within and between layers. Our approach is straightforward: we use\nsparse variational inference to simplify the correlations within layers, but we maintain the correlations\nbetween layers. The resulting variational lower bound cannot be evaluated analytically, but we can\ndraw unbiased samples ef\ufb01ciently using univariate Gaussians. We optimize our bound stochastically.\nWe propose a posterior with three properties. Firstly, the posterior maintains the exact model, condi-\ntioned on Ul. Secondly, we assume that the posterior distribution of {Ul}L\nl=1 is factorized between\nlayers (and dimension, but we suppress this from the notation). Therefore, our posterior takes the\nsimple factorized form\n\nq({Fl, Ul}L\n\nl=1) =\n\np(Fl|Ul; Fl\u22121, Zl\u22121)q(Ul) .\n\n(12)\n\n(cid:89)L\n\nl=1\n\nThirdly, and to complete speci\ufb01cation of the posterior, we take q(Ul) to be a Gaussian with mean\nml and variance Sl. A similar posterior was used in Hensman and Lawrence (2014) and Dai et al.\n(2016), but each of these works contained additional terms for the noisy corruptions at each layer.\nAs in the single layer SGP, we can marginalize the inducing variables from each layer analytically.\nAfter this marginalization we obtain following distribution, which is fully coupled within and between\nlayers:\n\n(cid:89)L\n\nl=1\n\nN (Fl| \u02dc\u00b5l, \u02dc\u03a3\n\nl\n\n) .\n\n(13)\n\n(cid:89)L\n\nl=1\n\nq({Fl}L\n\nl=1) =\n\nq(Fl|ml, Sl; Fl\u22121, Zl\u22121) =\n\n4\n\n\fl\n\ni ) and [ \u02dc\u03a3\n\nl, where [ \u02dc\u00b5l]i = \u00b5ml,Zl\u22121 (f l\n\nHere, q(Fl|ml, Sl; Fl\u22121, Zl\u22121) is as in (6). Speci\ufb01cally, it is a Gaussian with mean and variance \u02dc\u00b5l\nand \u02dc\u03a3\ni is the ith row of\nFl). Since (12) is a product of terms that each take the form of the SGP variational posterior (5), we\nhave again the property that within each layer the marginals depend on only the corresponding inputs.\nIn particular, f L\n, and so on. Therefore,\nwe have the following property:\nRemark 2. The ith marginal of the \ufb01nal layer of the variational DGP posterior (12) depends only\non the ith marginals of all the other layers. That is,\n\n, which in turn depends only on f L\u22122\n\ni depends only on f L\u22121\n\nj) (recall that f l\n\n]ij = \u03a3Sl,Zl\u22121(f l\n\ni , f l\n\ni\n\ni\n\nq(f L\n\ni ) =\n\nq(f l\n\ni|ml, Sl; f l\u22121\n\ni\n\n, Zl\u22121)df l\ni .\n\n(14)\n\n(cid:90) (cid:89)L\u22121\n\nl=1\n\nThe consequence of this property is that taking a sample from q(f L\ni ) is straightforward, and further-\nmore we can perform the sampling using only univariate unit Gaussians using the \u2018re-parameterization\ni \u223c N (0, IDl ) and\ntrick\u2019 (Rezende et al., 2014; Kingma et al., 2015). Speci\ufb01cally, we \ufb01rst sample \u0001l\n, Zl\u22121) for l = 1, . . . , L \u2212 1 as\nthen recursively draw the sampled variables \u02c6f l\n, \u02c6f l\u22121\n\ni|ml, Sl; \u02c6f l\u22121\n\u03a3Sl,Zl\u22121(\u02c6f l\u22121\n\ni = \u00b5ml,Zl\u22121 (\u02c6f l\u22121\n\u02c6f l\n\ni (cid:12)(cid:113)\n\ni \u223c q(f l\n\n) + \u0001l\n\n(15)\n\n) ,\n\ni\n\ni\n\ni\n\ni\n\nwhere the terms in (15) are Dl-dimensional and the square root is element-wise. For the \ufb01rst layer\nwe de\ufb01ne \u02c6f 0\n\ni := xi.\n\nEf\ufb01cient computation of the evidence lower bound The evidence lower bound of the DGP is\n\nLDGP = E\n\nq({Fl,Ul}L\n\nl=1)\n\n.\n\n(16)\n\n(cid:88)N\n\nUsing (11) and (12) for the corresponding expressions in (16), we obtain after some re-arranging\n\nLDGP =\n\nE\nq(f L\n\ni )[log p(yn|f L\n\ni=1\n\nKL[q(Ul)||p(Ul; Zl\u22121)] ,\n\n(17)\n\nwhere we exploited the exact marginalization of the inducing variables (13) and the property of the\nmarginals of the \ufb01nal layer (14). A detailed derivation is provided in the supplementary material.\nThis bound has complexity O(N M 2(D1 + \u00b7\u00b7\u00b7 + DL)) to evaluate.\nWe evaluate the bound (17) approximately using two sources of stochasticity. Firstly, we approximate\nthe expectation with a Monte Carlo sample from the variational posterior (14), which we compute\naccording to (15). Since we have parameterized this sampling procedure in terms of isotropic\nGaussians, we can compute unbiased gradients of the bound (17). Secondly, since the bound\nfactorizes over the data we achieve scalability through sub-sampling the data. Both stochastic\napproximations are unbiased.\n\n(cid:21)\n\n(cid:20) p(Y,{Fl, Ul}L\nn )] \u2212(cid:88)L\n\nq({Fl, Ul}L\n\nl=1\n\nl=1)\n\nl=1)\n\nPredictions To predict we sample from the variational posterior changing the input locations to the\ntest location x\u2217. We denote the function values at the test location as f l\u2217. To obtain the density over\nf L\u2217 we use the Gaussian mixture\n\nq(f L\u2217 |mL, SL; f (s)\u2217\n\nL\u22121\n\n, ZL\u22121) ,\n\n(18)\n\n(cid:88)S\n\ns=1\n\nq(f L\u2217 ) \u2248 1\nS\nL\u22121\n\nwhere we draw S samples f (s)\u2217\n\nusing (15), but replacing the inputs xi with the test location x\u2217.\n\nFurther Model Details While GPs are often used with a zero mean function, we consider such a\nchoice inappropriate for the inner layers of a DGP. Using a zero mean function causes dif\ufb01culties with\nthe DGP prior as each GP mapping is highly non-injective. This effect was analyzed in Duvenaud\net al. (2014) where the authors suggest adding the original input X to each layer. Instead, we consider\nan alternative approach and include a linear mean function m(X) = XW for all the inner layers.\nIf the input and output dimension are the same we use the identity matrix for W, otherwise we\ncompute the SVD of the data and use the top Dl left eigenvectors sorted by singular value (i.e. the\nPCA mapping). With these choices it is effective to initialize all inducing mean values ml = 0. This\nchoice of mean function is partly inspired by the \u2018skip layer\u2019 approach of the ResNet (He et al., 2016)\narchitecture.\n\n5\n\n\fFigure 1: Regression test log-likelihood results on benchmark datasets. Higher (to the right) is better.\nThe sparse GP with the same number of inducing points is highlighted as a baseline.\n\n4 Results\n\nWe evaluate our inference method on a number of benchmark regression and classi\ufb01cation datasets.\nWe stress that we are interested in models that can operate in both the small and large data regimes,\nwith little or no hand tuning. All our experiments were run with exactly the same hyperparameters\nand initializations. See the supplementary material for details. We use min(30, D0) for all the inner\nlayers of our DGP models, where D0 is the input dimension, and the RBF kernel for all layers.\n\nRegression Benchmarks We compare our approach to other state-of-the-art methods on 8 standard\nsmall to medium-sized UCI benchmark datasets. Following common practice (e.g. Hern\u00e1ndez-Lobato\nand Adams, 2015) we use 20-fold cross validation with a 10% randomly selected held out test set\nand scale the inputs and outputs to zero mean and unit standard deviation within the training set\n(we restore the output scaling for evaluation). While we could use any kernel, we choose the RBF\nkernel with a lengthscale for each dimension for direct comparison with Bui et al. (2016). The test\nlog-likelihood results are shown in Fig. 1. We compare our models of 2, 3, 4 and 5 layers (DGP\n2\u20135), each with 100 inducing points, with (stochastically optimized) sparse GPs (Hensman et al.,\n2013) with 100 and 500 inducing points points (SGP, SGP 500). We compare also to a two-layer\nBayesian neural network with ReLu activations, 50 hidden units (100 for protein and year), with\ninference by probabilistic backpropagation (Hern\u00e1ndez-Lobato and Adams, 2015) (PBP). The results\nare taken from Hern\u00e1ndez-Lobato and Adams (2015) and were found to be the most effective of\nseveral other methods for inferring Bayesian neural networks. We compare also with a DGP model\nwith approximate expectation propagation (EP) for inference (Bui et al., 2016). Using the authors\u2019\ncode 2 we ran a DGP model with 1 hidden layer using approximate expectation propagation (Bui et al.,\n2016) (AEPDGP 2). We used the input dimension for the hidden layer for a fair comparison with our\nmodels3. We found the time requirements to train a 3-layer model with this inference prohibitive.\nPlots for test RMSE and further results tables can be found in the supplementary material.\nOn \ufb01ve of the eight datasets, the deepest DGP model is the best. On \u2018wine\u2019, \u2018naval\u2019 and \u2018boston\u2019\nour DGP recovers the single-layer GP, which is not surprising: \u2018boston\u2019 is very small, \u2018wine\u2019 is\n\n2https://github.com/thangbui/deepGP_approxEP\n3We note however that in Bui et al. (2016) the inner layers were 2D, so the results we obtained are not\n\ndirectly comparable to those reported in Bui et al. (2016)\n\n6\n\n-2.89-2.63-2.37LinearSGPSGP 500AEDGP 2DGP 2DGP 3DGP 4DGP 5PBPboston N=506, D=13-3.75-3.43-3.11concrete N=1030, D=8-2.39-1.55-0.71energy N=768, D=80.250.781.31LinearSGPSGP 500AEDGP 2DGP 2DGP 3DGP 4DGP 5PBPkin8nm N=8192, D=83.925.396.86LinearSGPSGP 500AEDGP 2DGP 2DGP 3DGP 4DGP 5PBPnaval N=11934, D=26-2.92-2.83-2.73power N=9568, D=4-3.05-2.89-2.73protein N=45730, D=9-1.01-0.97-0.93LinearSGPSGP 500AEDGP 2DGP 2DGP 3DGP 4DGP 5PBPwine_red N=1599, D=22Bayesian NNSingle layer benchmarksDGP with approx EPThis work\fnear-linear (note the proximity of the linear model and the scale) and \u2018naval\u2019 is characterized by\nextremely high test likelihoods (the RMSE on this dataset is less than 0.001 for all SGP and DGP\nmodels), i.e. it is a very \u2018easy\u2019 dataset for a GP. The Bayesian network is not better than the sparse GP\nfor any dataset and signi\ufb01cantly worse for six. The Approximate EP inference for the DGP models\nis also not competitive with the sparse GP for many of the datasets, but this may be because the\ninitializations were designed for lower dimensional hidden layers than we used.\nOur results on these small and medium sized datasets con\ufb01rm that over\ufb01tting is not observed with the\nDGP model, and that the DGP is never worse and often better than the single layer GP. We note in\nparticular that on the \u2018power\u2019, \u2018protein\u2019 and \u2018kin8nm\u2019 datasets all the DGP models outperform the\nSGP with \ufb01ve times the number of inducing points.\n\nRectangles Benchmark We use the Rectangle-Images dataset4, which is speci\ufb01cally designed to\ndistinguish deep and shallow architectures. The dataset consists of 12,000 training and 50,000 testing\nexamples of size 28 \u00d7 28, where each image consists of a (non-square) rectangular image against\na different background image. The task is to determine which of the height and width is greatest.\nWe run 2, 3 and 4 layer DGP models, and observe increasing performance with each layer. Table 1\ncontains the results. Note that the 500 inducing point single-layer GP is signi\ufb01cantly less effective\nthan any of the deep models. Our 4-layer model achieves 77.9% classi\ufb01cation accuracy, exceeding\nthe best result of 77.5% reported in Larochelle et al. (2007) with a three-layer deep belief network.\nWe also exceed the best result of 76.4% reported in Krauth et al. (2016) using a sparse GP with an\nArcsine kernel, a leave-one-out objective, and 1000 inducing points.\n\nTable 1: Results on Rectangles-Images dataset (N = 12000, D = 784)\n\nSingle layer GP\nSGP\n76.1\n76.4\n\u22120.493 \u22120.485\n\nSGP 500 DGP 2\nDGP 4\n77.9\n77.3\n0.475 \u22120.460 \u22120.460\n\nOurs\nDGP 3\n77.8\n\nAccuracy (%)\nLikelihood\n\nLarochelle [2007] Krauth [2016]\nDBN-3\n77.5\n\nSVM\n76.96\n\nSGP 1000\n\n76.4\n\u22120.478\n\n-\n\n-\n\nLarge-Scale Regression To demonstrate our method on a large scale regression problem we use\nthe UCI \u2018year\u2019 dataset and the \u2018airline\u2019 dataset, which has been commonly used by the large-scale\nGP community. For the \u2018airline\u2019 dataset we take the \ufb01rst 700K points for training and next 100K for\ntesting. We use a random 10% split for the \u2018year\u2019 dataset. Results are shown in Table 2, with the\nlog-likelihood reported in the supplementary material. In both datasets we see that the DGP models\nperform better with increased depth, signi\ufb01cantly improving in both log likelihood and RMSE over\nthe single-layer model, even with 500 inducing points.\n\nTable 2: Regression test RMSE results for large datasets\n\nN\n463810\n700K\n1B\n\nD\n90\n8\n9\n\nSGP\n10.67\n25.6\n337.5\n\nyear\nairline\ntaxi\n\nSGP 500 DGP 2 DGP 3 DGP 4 DGP 5\n8.87\n24.1\n266.4\n\n8.93\n24.2\n268.0\n\n9.58\n24.6\n281.4\n\n8.98\n24.3\n270.4\n\n9.89\n25.1\n330.7\n\nMNIST Multiclass Classi\ufb01cation We apply the DGP with 2 and 3 layers to the MNIST multiclass\nclassi\ufb01cation problem. We use the robust-max multiclass likelihood (Hern\u00e1ndez-Lobato et al., 2011)\nand use full unprocessed data with the standard training/test split of 60K/10K. The single-layer GP\nwith 100 inducing points achieves a test accuracy of 97.48% and this is increased to 98.06% and\n98.11% with two and three layer DGPs, respectively. The 500 inducing point single layer model\nachieved 97.9% in our implementation, though a slightly higher result for this model has previously\nbeen reported of 98.1% (Hensman et al., 2013) and 98.4% (Krauth et al., 2016) for the same model\nwith 1000 inducing points. We attribute this difference to different hyperparameter initialization and\ntraining schedules, and stress that we use exactly the same initialization and learning schedule for all\nour models. The only other DGP result in the literature on this dataset is 94.24% (Wang et al., 2016)\nfor a two layer model with a two dimensional latent space.\n\n4http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/RectanglesData\n\n7\n\n\fLarge-Scale Classi\ufb01cation We use the HIGGS (N = 11M, D = 28) and SUSY (N = 5.5M,\nD = 18) datasets for large-scale binary classi\ufb01cation. These datasets have been constructed from\nMonte Carlo physics simulations to detect the presence of the Higgs boson and super-symmetry (Baldi\net al., 2014). We take a 10% random sample for testing and use the rest for training. We use the AUC\nmetric for comparison with Baldi et al. (2014). Our DGP models are the highest performing on the\nSUSY dataset (AUC of 0.877 for all the DGP models) compared to shallow neural networks (NN,\n0.875), deep neural networks (DNN, 0.876) and boosted decision trees (BDT, 0.863). On the HIGGS\ndataset we see a steady improvement in additional layers (0.830, 0.837, 0.841 and 0.846 for DGP\n2\u20134 respectively). On this dataset the DGP models exceed the performance of BDT (0.810) and NN\n(0.816) and both single layer GP models SGP (0.785) and SGP 500 (0.794). The best performing\nmodel on this dataset is a 5 layer DNN (0.885). Full results are reported in the supplementary\nmaterial.\n\nMassive-Scale Regression To demonstrate the ef\ufb01cacy of our\nmodel on massive data we use the New York city yellow taxi trip\ndataset of 1.21 billion journeys 5. Following Peng et al. (2017) we use\n9 features: time of day; day of the week; day of the month; month;\npick-up latitude and longitude; drop-off latitude and longitude; travel\ndistance. The target is to predict the journey time. We randomly select\n1B (109) examples for training and use 1M examples for testing, and\nwe scale both inputs and outputs to zero mean and unit standard de-\nviation in the training data. We discard journeys that are less than 10 s\nor greater than 5 h, or start/end outside the New York region, which\nwe estimate to have squared distance less than 5o from the center of\nNew York. The test RMSE results are the bottom row of Table 2 and\ntest log likelihoods are in the supplementary material. We note the signi\ufb01cant jump in performance\nfrom the single layer models to the DGP. As with all the large-scale experiments, we see a consistent\nimprovement extra layers, but on this dataset the improvement is particularly striking (DGP 5 achieves\na 21% reduction in RMSE compared to SGP)\n\nCPU GPU\n0.018\n0.14\n1.71\n0.11\n0.030\n0.36\n0.045\n0.49\n0.056\n0.65\n0.87\n0.069\n\nTable 3: Typical computation\ntime in seconds for a single\ngradient step.\n\nSGP\nSGP 500\nDGP 2\nDGP 3\nDGP 4\nDGP 5\n\n5 Related Work\n\nThe \ufb01rst example of the outputs of a GP used as the inputs to another GP can be found in Lawrence\nand Moore (2007). MAP approximation was used for inference. The seminal work of Titsias\nand Lawrence (2010) demonstrated how sparse variational inference could be used to propagate\nGaussian inputs through a GP with a Gaussian likelihood. This approach was extended in Damianou\net al. (2011) to perform approximate inference in the model of Lawrence and Moore (2007), and\nshortly afterwards in a similar model L\u00e1zaro-Gredilla (2012), which also included a linear mean\nfunction. The key idea of both these approaches is the factorization of the variational posterior\nbetween layers. A more general model (\ufb02exible in depth and dimensions of hidden layers) introduced\nthe term \u2018DGP\u2019 and used a posterior that also factorized between layers. These approaches require a\nlinearly increasing number of variational parameters in the number of data. For high-dimensional\nobservations, it is possible to amortize the cost of this optimization with an auxiliary model. This\napproach is pursued in Dai et al. (2016), and with a recurrent architecture in Mattos et al. (2016).\nAnother approach to inference in the exact model was presented in Hensman and Lawrence (2014),\nwhere a sparse approximation was used within layers for the GP outputs, similar to Damianou and\nLawrence (2013), but with a projected distribution over the inputs to the next layer. The particular\nform of the variational distribution was chosen to admit a tractable bound, but imposes a constraint\non the \ufb02exibility.\nAn alternative approach is to modify the DGP prior directly and perform inference in a parametric\nmodel. This is achieved in Bui et al. (2016) with an inducing point approximation within each\nlayer, and in Cutajar et al. (2017) with an approximation to the spectral density of the kernel. Both\napproaches then apply additional approximations to achieve tractable inference. In Bui et al. (2016),\nan approximation to expectation propagation is used, with additional Gaussian approximations to the\nlog partition function to propagate uncertainly through the non-linear GP mapping. In Cutajar et al.\n(2017) a fully factorized variational approximation is used for the spectral components. Both these\n\n5http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml\n\n8\n\n\fapproaches require speci\ufb01c kernels: in Bui et al. (2016) the kernel must have analytic expectations\nunder a Gaussian, and in Cutajar et al. (2017) the kernel must have an analytic spectral density.\nVafa (2016) also uses the same initial approximation as Bui et al. (2016) but applies MAP inference\nfor the inducing points, such that the uncertainty propagated through the layers only represents the\nquality of the approximation. In the limit of in\ufb01nitely many inducing points this approach recovers a\ndeterministic radial basis function network. A particle method is used in Wang et al. (2016), again\nemploying an online version of the sparse approximation used by Bui et al. (2016) within each layer.\nSimilarly to our approach, in Wang et al. (2016) samples are taken through the conditional model,\nbut differently from us they then use a point estimate for the latent variables. It is not clear how this\napproach propagates uncertainty through the layers, since the GPs at each layer have point-estimate\ninputs and outputs.\nA pathology with the DGP with zero mean function for the inner layers was identi\ufb01ed in Duvenaud\net al. (2014). In Duvenaud et al. (2014) a suggestion was made to concatenate the original inputs at\neach layer. This approach is followed in Dai et al. (2016) and Cutajar et al. (2017). The linear mean\nfunction was original used by L\u00e1zaro-Gredilla (2012), though in the special case of a two layer DGP\nwith a 1D hidden layer. To the best of our knowledge there has been no previous attempt to use a\nlinear mean function for all inner layers.\n\n6 Discussion\n\nOur experiments show that on a wide range of tasks the DGP model with our doubly stochastic\ninference is both effective and scalable. Crucially, we observe that on the small datasets the DGP\ndoes not over\ufb01t, while on the large datasets additional layers generally increase performance and\nnever deteriorate it. In particular, we note that the largest gain with increasing layers is achieved\non the largest dataset (the taxi dataset, with 1B points). We note also that on all the large scale\nexperiments the SGP 500 model is outperformed by the all the DGP models. Therefore, for the\nsame computational budget increasing the number of layers can be signi\ufb01cantly more effective than\nincreasing the accuracy of approximate inference in the single-layer model. Other than the additional\ncomputation time, which is fairly modest (see Table 3), we do not see downsides to using a DGP over\na single-layer GP, but substantial advantages.\nWhile we have considered simple kernels and black-box applications, any domain-speci\ufb01c kernel\ncould be used in any layer. This is in contrast to other methods (Damianou and Lawrence, 2013; Bui\net al., 2016; Cutajar et al., 2017) that require speci\ufb01c kernels and intricate implementations. Our\nimplementation is simple (< 200 lines), publicly available 6, and is integrated with GP\ufb02ow (Matthews\net al., 2017), an open-source GP framework built on top of Tensor\ufb02ow (Abadi et al., 2015).\n\n7 Conclusion\n\nWe have presented a new method for inference in Deep Gaussian Process (DGP) models. With our\ninference we have shown that the DGP can be used on a range of regression and classi\ufb01cation tasks\nwith no hand-tuning. Our results show that in practice the DGP always exceeds or matches the\nperformance of a single layer GP. Further, we have shown that the DGP often exceeds the single\nlayer signi\ufb01cantly, even when the quality of the approximation to the single layer is improved. Our\napproach is highly scalable and bene\ufb01ts from GPU acceleration.\nThe most signi\ufb01cant limitation of our approach is the dealing with high dimensional inner layers. We\nused a linear mean function for the high dimensional datasets but left this mean function \ufb01xed, as to\noptimize the parameters would go against our non-parametric paradigm. It would be possible to treat\nthis mapping probabilistically, following the work of Titsias and L\u00e1zaro-Gredilla (2013).\n\nAcknowledgments\n\nWe have greatly appreciated valuable discussions with James Hensman and Steindor Saemundsson\nin the preparation of this work. We thank Vincent Dutordoir and anonymous reviewers for helpful\nfeedback on the manuscript. We are grateful for a Microsoft Azure Scholarship and support through\na Google Faculty Research Award to Marc Deisenroth.\n\n6https://github.com/ICL-SML/Doubly-Stochastic-DGP\n\n9\n\n\fReferences\nM. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, L. Kaiser, M. Kudlur,\nJ. Levenberg, D. Man, R. Monga, S. Moore, D. Murray, J. Shlens, B. Steiner, I. Sutskever, P. Tucker,\nV. Vanhoucke, V. Vasudevan, O. Vinyals, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow:\nLarge-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint:1603.04467,\n2015.\n\nP. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-Energy Physics with\n\nDeep Learning. Nature Communications, 2014.\n\nE. V. Bonilla, K. Krauth, and A. Dezfouli. Generic Inference in Latent Gaussian Process Models.\n\narXiv preprint:1609.00577, 2016.\n\nF.-X. Briol, C. J. Oates, M. Girolami, M. A. Osborne, and D. Sejdinovic. Probabilistic Integration: A\n\nRole for Statisticians in Numerical Analysis? arXiv preprint:1512.00933, 2015.\n\nT. D. Bui, D. Hern\u00e1ndez-Lobato, Y. Li, J. M. Hern\u00e1ndez-Lobato, and R. E. Turner. Deep Gaussian\nProcesses for Regression using Approximate Expectation Propagation. International Conference\non Machine Learning, 2016.\n\nR. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian Processes for\n\nRegression. IEEE International Joint Conference on Neural Networks, 2016.\n\nK. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Random Feature Expansions for Deep\n\nGaussian Processes. International Conference on Machine Learning, 2017.\n\nM. Cutler and J. P. How. Ef\ufb01cient Reinforcement Learning for Robots using Informative Simulated\n\nPriors. IEEE International Conference on Robotics and Automation, 2015.\n\nZ. Dai, A. Damianou, J. Gonz\u00e1lez, and N. Lawrence. Variational Auto-encoded Deep Gaussian\n\nProcesses. International Conference on Learning Representations, 2016.\n\nA. C. Damianou and N. D. Lawrence. Deep Gaussian Processes. International Conference on\n\nArti\ufb01cial Intelligence and Statistics, 2013.\n\nA. C. Damianou, M. K. Titsias, and N. D. Lawrence. Variational Gaussian Process Dynamical\n\nSystems. Advances in Neural Information Processing Systems, 2011.\n\nM. P. Deisenroth and C. E. Rasmussen. PILCO: A Model-Based and Data-Ef\ufb01cient Approach to\n\nPolicy Search. International Conference on Machine Learning, 2011.\n\nP. J. Diggle and P. J. Ribeiro. Model-based Geostatistics. Springer, 2007.\n\nD. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure Discovery in\nNonparametric Regression through Compositional Kernel Search. International Conference on\nMachine Learning, 2013.\n\nD. Duvenaud, O. Rippel, R. P. Adams, and Z. Ghahramani. Avoiding Pathologies in Very Deep\n\nNetworks. Arti\ufb01cial Intelligence and Statistics, 2014.\n\nY. Gal, Y. Chen, and Z. Ghahramani. Latent Gaussian Processes for Distribution Estimation of\n\nMultivariate Categorical Data. International Conference on Machine Learning, 2015.\n\nR. Garnett, M. Osborne, and S. Roberts. Sequential Bayesian Prediction in the Presence of Change-\n\npoints. International Conference on Machine Learning, 2009.\n\nC. Guestrin, A. Krause, and A. P. Singh. Near-optimal Sensor Placements in Gaussian Processes.\n\nInternational Conference on Machine Learning, 2005.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition.\n\nConference on Computer Vision and Pattern Recognition, 2016.\n\nIEEE\n\nJ. Hensman and N. D. Lawrence. Nested Variational Compression in Deep Gaussian Processes. arXiv\n\npreprint:1412.1370, 2014.\n\n10\n\n\fJ. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. Uncertainty in Arti\ufb01cial\n\nIntelligence, 2013.\n\nJ. Hensman, A. Matthews, M. Fillipone, and Z. Ghahramani. MCMC for Variationally Sparse\n\nGaussian Processes. Advances in Neural Information Processing Systems, 2015.\n\nD. Hern\u00e1ndez-Lobato, H. Lobato, J. Miguel, and P. Dupont. Robust Multi-class Gaussian Process\n\nClassi\ufb01cation. Advances in Neural Information Processing Systems, 2011.\n\nJ. M. Hern\u00e1ndez-Lobato and R. Adams. Probabilistic Backpropagation for Scalable Learning of\n\nBayesian Neural Networks. International Conference on Machine Learning, 2015.\n\nD. P. Kingma, T. Salimans, and M. Welling. Variational Dropout and the Local Reparameterization\n\nTrick. 2015.\n\nJ. Ko and D. Fox. GP-BayesFilters: Bayesian Filtering using Gaussian Process Prediction and\n\nObservation Models. IEEE Intelligent Robots and Systems, 2008.\n\nK. Krauth, E. V. Bonilla, K. Cutajar, and M. Filippone. AutoGP: Exploring the Capabilities and\n\nLimitations of Gaussian Process Models. arXiv preprint:1610.05392, 2016.\n\nH. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An Empirical Evaluation of Deep\nArchitectures on Problems with Many Factors of Variation. International Conference on Machine\nLearning, 2007.\n\nN. D. Lawrence and A. J. Moore. Hierarchical Gaussian Process Latent Variable Models. International\n\nConference on Machine Learning, 2007.\n\nM. L\u00e1zaro-Gredilla. Bayesian Warped Gaussian Processes. Advances in Neural Information Process-\n\ning Systems, 2012.\n\nD. J. C. Mackay. Comparison of Approximate Methods for Handling Hyperparameters. Neural\n\ncomputation, 1999.\n\nA. G. Matthews, M. Van Der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1, Z. Ghahra-\nmani, and J. Hensman. GP\ufb02ow: A Gaussian process library using TensorFlow. Journal of Machine\nLearning Research, 2017.\n\nA. G. d. G. Matthews, J. Hensman, R. E. Turner, and Z. Ghahramani. On Sparse Variational Methods\nand The Kullback-Leibler Divergence Between Stochastic Processes. Arti\ufb01cial Intelligence and\nStatistics, 2016.\n\nC. L. C. Mattos, Z. Dai, A. Damianou, J. Forth, G. A. Barreto, and N. D. Lawrence. Recurrent\n\nGaussian Processes. International Conference on Learning Representations, 2016.\n\nH. Peng, S. Zhe, and Y. Qi. Asynchronous Distributed Variational Gaussian Processes. arXiv\n\npreprint:1704.06735, 2017.\n\nC. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\nD. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate Inference\n\nin Deep Generative Models. International Conference on Machine Learning, 2014.\n\nJ. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning\n\nAlgorithms. Advances in Neural Information Processing Systems, 2012.\n\nM. K. Titsias and N. D. Lawrence. Bayesian Gaussian Process Latent Variable Model. International\n\nConference on Arti\ufb01cial Intelligence and Statistics, 2010.\n\nM. K. Titsias and M. L\u00e1zaro-Gredilla. Variational Inference for Mahalanobis Distance Metrics in\n\nGaussian Process Regression. Advances in Neural Information Processing Systems, 2013.\n\nR. Turner and M. Sahani. Two Problems with Variational Expectation Maximisation for Time-Series\n\nModels. Bayesian Time Series Models, 2011.\n\n11\n\n\fK. Vafa. Training Deep Gaussian Processes with Sampling. Advances in Approximate Bayesian\n\nInference Workshop, Neural Information Processing Systems, 2016.\n\nY. Wang, M. Brubaker, B. Chaib-Draa, and R. Urtasun. Sequential Inference for Deep Gaussian\n\nProcess. Arti\ufb01cial Intelligence and Statistics, 2016.\n\nA. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep Kernel Learning. Arti\ufb01cial Intelligence\n\nand Statistics, 2016.\n\n12\n\n\f", "award": [], "sourceid": 2404, "authors": [{"given_name": "Hugh", "family_name": "Salimbeni", "institution": "Imperial College London"}, {"given_name": "Marc", "family_name": "Deisenroth", "institution": "Imperial College London"}]}