{"title": "MCMC for Variationally Sparse Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1648, "page_last": 1656, "abstract": "Gaussian process (GP) models form a core part of probabilistic machine learning. Considerable research effort has been made into attacking three issues with GP models: how to compute efficiently when the number of data is large; how to approximate the posterior when the likelihood is not Gaussian and how to estimate covariance function parameter posteriors. This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in sup- port of the function but otherwise free-form. The result is a Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian approximation over the function values and covariance parameters simultaneously, with efficient computations based on inducing-point sparse GPs.", "full_text": "MCMC for Variationally Sparse Gaussian Processes\n\nJames Hensman\n\nCHICAS, Lancaster University\n\njames.hensman@lancaster.ac.uk\n\nMaurizio Filippone\n\nEURECOM\n\nmaurizio.filippone@eurecom.fr\n\nAlexander G. de G. Matthews\n\nUniversity of Cambridge\nam554@cam.ac.uk\n\nZoubin Ghahramani\nUniversity of Cambridge\nzoubin@cam.ac.uk\n\nAbstract\n\nGaussian process (GP) models form a core part of probabilistic machine learning.\nConsiderable research effort has been made into attacking three issues with GP\nmodels: how to compute ef\ufb01ciently when the number of data is large; how to ap-\nproximate the posterior when the likelihood is not Gaussian and how to estimate\ncovariance function parameter posteriors. This paper simultaneously addresses\nthese, using a variational approximation to the posterior which is sparse in sup-\nport of the function but otherwise free-form. The result is a Hybrid Monte-Carlo\nsampling scheme which allows for a non-Gaussian approximation over the func-\ntion values and covariance parameters simultaneously, with ef\ufb01cient computations\nbased on inducing-point sparse GPs. Code to replicate each experiment in this pa-\nper is available at github.com/sparseMCMC.\n\n1\n\nIntroduction\n\nGaussian process models are attractive for machine learning because of their \ufb02exible nonparametric\nnature. By combining a GP prior with different likelihoods, a multitude of machine learning tasks\ncan be tackled in a probabilistic fashion [1]. There are three things to consider when using a GP\nmodel: approximation of the posterior function (especially if the likelihood is non-Gaussian), com-\nputation, storage and inversion of the covariance matrix, which scales poorly in the number of data;\nand estimation (or marginalization) of the covariance function parameters. A multitude of approx-\nimation schemes have been proposed for ef\ufb01cient computation when the number of data is large.\nEarly strategies were based on retaining a sub-set of the data [2]. Snelson and Ghahramani [3] in-\ntroduced an inducing point approach, where the model is augmented with additional variables, and\nTitsias [4] used these ideas in a variational approach. Other authors have introduced approximations\nbased on the spectrum of the GP [5, 6], or which exploit speci\ufb01c structures within the covariance\nmatrix [7, 8], or by making unbiased stochastic estimates of key computations [9]. In this work, we\nextend the variational inducing point framework, which we prefer for general applicability (no spe-\nci\ufb01c requirements are made of the data or covariance function), and because the variational inducing\npoint approach can be shown to minimize the KL divergence to the posterior process [10].\nTo approximate the posterior function and covariance parameters, Markov chain Monte-Carlo\n(MCMC) approaches provide asymptotically exact approximations. Murray and Adams [11] and\nFilippone et al. [12] examine schemes which iteratively sample the function values and covariance\nparameters. Such sampling schemes require computation and inversion of the full covariance ma-\ntrix at each iteration, making them unsuitable for large problems. Computation may be reduced\nsomewhat by considering variational methods, approximating the posterior using some \ufb01xed fam-\nily of distributions [13, 14, 15, 16, 1, 17], though many covariance matrix inversions are generally\nrequired. Recent works [18, 19, 20] have proposed inducing point schemes which can reduce the\n\n1\n\n\fTable 1: Existing variational approaches\n\nReference\nWilliams & Barber[21] [also 14, 17]\nTitsias [4]\nChai [18]\nNguyen and Bonilla [1]\nHensman et al. [20]\nThis work\n\np(y | f )\nprobit/logit\nGaussian\nsoftmax\n\nany factorized\n\nprobit\n\nany factorized\n\nSparse\n\nPosterior\n\n\u0017\n\u0013\n\u0013\n\u0017\n\u0013\n\u0013\n\nGaussian (assumed)\nGaussian (optimal)\nGaussian (assumed)\nMixture of Gaussians\nGaussian (assumed)\n\nfree-form\n\nHyperparam.\npoint estimate\npoint estimate\npoint estimate\npoint estimate\npoint estimate\n\nfree-form\n\ncomputation required substantially, though the posterior is assumed Gaussian and the covariance\nparameters are estimated by (approximate) maximum likelihood. Table 1 places our work in the\ncontext of existing variational methods for GPs.\nThis paper presents a general inference scheme, with the only concession to approximation being\nthe variational inducing point assumption. Non-Gaussian posteriors are permitted through MCMC,\nwith the computational bene\ufb01ts of the inducing point framework. The scheme jointly samples the\ninducing-point representation of the function with the covariance function parameters; with suf\ufb01-\ncient inducing points our method approaches full Bayesian inference over GP values and the covari-\nance parameters. We show empirically that the number of required inducing points is substantially\nsmaller than the dataset size for several real problems.\n\n2 Stochastic process posteriors\nThe model is set up as follows. We are presented with some data inputs X = {xn}N\nn=1 and responses\ny = {yn}N\nn=1. A latent function is assumed drawn from a GP with zero mean and covariance\nfunction k(x, x(cid:48)) with (hyper-) parameters \u03b8. Consistency of the GP means that only those points\nwith data are considered: the latent vector f represents the values of the function at the observed\npoints f = {f (xn)}N\nn=1, and has conditional distribution p(f | X, \u03b8) = N (f | 0, Kf f ), where Kf f\nis a matrix composed of evaluating the covariance function at all pairs of points in X. The data\nlikelihood depends on the latent function values: p(y | f ). To make a prediction for latent function\nvalue test points f (cid:63) = {f (x(cid:63))}x(cid:63)\u2208X(cid:63), the posterior function values and parameters are integrated:\n\np(f (cid:63) | y) =\n\np(f (cid:63) | f , \u03b8)p(f , \u03b8 | y) d\u03b8 df .\n\n(1)\n\n(cid:90)(cid:90)\n\n(cid:90)(cid:90)\n\nIn order to make use of the computational savings offered by the variational inducing point frame-\nwork [4], we introduce additional input points to the function Z and collect the responses of the\nfunction at that point into the vector u = {um = f (zm)}M\nm=1. With some variational posterior\nq(u, \u03b8), new points are predicted similarly to the exact solution\n\nq(f (cid:63)) =\n\np(f (cid:63) | u, \u03b8)q(u, \u03b8) d\u03b8 du .\n\n(2)\n\nThis makes clear that the approximation is a stochastic process in the same fashion as the true pos-\nterior: the length of the predictions vector f (cid:63) is potentially unbounded, covering the whole domain.\nTo obtain a variational objective, \ufb01rst consider the support of u under the true posterior, and for\nf under the approximation. In the above, these points are subsumed into the prediction vector f (cid:63):\nfrom here we shall be more explicit, letting f be the points of the process at X, u be the points of\nthe process at Z and f (cid:63) be a large vector containing all other points of interest1. All of the free\nparameters of the model are then f (cid:63), f , u, \u03b8, and using a variational framework, we aim to minimize\nthe Kullback-Leibler divergence between the approximate and true posteriors:\nK (cid:44) KL[q(f (cid:63), f , u, \u03b8)||p(f (cid:63), f , u, \u03b8 | y)] = \u2212E\n\np(f (cid:63) | u, f , \u03b8)p(u| f , \u03b8)p(f , \u03b8 | y)\np(f (cid:63) | u, f , \u03b8)p(f | u, \u03b8)q(u, \u03b8)\n\n(cid:20)\n\n(cid:21)\n\nq(f (cid:63),f ,u,\u03b8)\n\n(3)\n\nlog\n\n1The vector f (cid:63) here is considered \ufb01nite but large enough to contain any point of interest for prediction. The\n\nin\ufb01nite case follows Matthews et al. [10], is omitted here for brevity, and results in the same solution.\n\n2\n\n\fwhere the conditional distributions for f (cid:63) have been expanded to make clear that they are the same\nunder the true and approximate posteriors, and X, Z and X(cid:63) have been omitted for clarity. Straight-\nforward identities simplify the expression,\n\n(cid:21)\n\n(4)\n\n(cid:20)\n(cid:20)\n\nK = \u2212Eq(f ,u,\u03b8)\n\n= \u2212Eq(f ,u,\u03b8)\n\nlog\n\nlog\n\np(u| f , \u03b8)p(f | \u03b8)p(\u03b8)p(y | f )/p(y)\n\np(f | u, \u03b8)q(u, \u03b8)\n\np(u| \u03b8)p(\u03b8)p(y | f )\n\nq(u, \u03b8)\n\n+ log p(y) ,\n\n(cid:21)\n\nresulting in the variational inducing-point objective investigated by Titsias [4], aside from the inclu-\nsion of \u03b8. This can be rearranged to give the following informative expression\n\nK = KL\n\nq(u, \u03b8)|| p(u| \u03b8)p(\u03b8) exp{Ep(f | u,\u03b8)[log p(y | f )]}\n\n\u2212 log C + log p(y).\n\n(5)\n\n(cid:21)\n\n(cid:20)\n\nC\n\nHere C is an intractable constant which normalizes the distribution and is independent of q. Mini-\nmizing the KL divergence on the right hand side reveals that the optimal variational distribution is\n\nlog \u02c6q(u, \u03b8) = Ep(f | u,\u03b8) [log p(y | f )] + log p(u| \u03b8) + log p(\u03b8) \u2212 log C.\n\n(6)\nFor general likelihoods, since the optimal distribution does not take any particular form, we intend to\nsample from it using MCMC, thus combining the bene\ufb01ts of variationally-sparse Gaussian processes\nwith a free-form posterior. Sampling is feasible using standard methods since log \u02c6q is computable\nup to a constant, using O(N M 2) computations. After completing this work, it was brought to our\nattention that a similar suggestion had been made in [22], though the idea was dismissed because\n\u201cprediction in sparse GP models typically involves some additional approximations\u201d. Our presenta-\ntion of the approximation consisting of the entire stochastic process makes clear that no additional\napproximations are required. To sample effectively, the following are proposed.\n\nWhitening the prior Noting that the problem (6) appears similar to a standard GP for u, albeit\nwith an interesting \u2018likelihood\u2019, we make use of an ancillary augmentation u = Rv, with RR(cid:62) =\nKuu, v \u223c N (0, I). This results in the optimal variational distribution\n\nlog \u02c6q(v, \u03b8) = Ep(f | u=Rv) [log p(y | f )] + log p(v) + log p(\u03b8) \u2212 log C\n\n(7)\nPreviously [11, 12] this parameterization has been used with schemes which alternate between sam-\npling the latent function values (represented by v or u) and the parameters \u03b8. Our scheme uses\nHMC across v and \u03b8 jointly, whose effectiveness is examined throughout the experiment section.\n\nQuadrature The \ufb01rst term in (6) is the expected log-likelihood. In the case of factorization across\nthe data-function pairs, this results in N one-dimensional integrals. For Gaussian or Poisson likeli-\nhood these integrals are tractable, otherwise they can be approximated by Gauss-Hermite quadrature.\nGiven the current sample v, the expectations are computed w.r.t. p(fn | v, \u03b8) = N (\u00b5n, \u03b3n), with:\n(8)\nwhere the kernel matrices Kuf , Kuu are computed similarly to Kf f , but over the pairs in\n(X, Z), (Z, Z) respectively. From here, one can compute the expected likelihood and it is subse-\nquently straightforward to compute derivatives in terms of Kuf , diag(Kf f ) and R.\n\n\u00b5 = A(cid:62)v; \u03b3 = diag(Kf f \u2212 A(cid:62)A); A = R\u22121Kuf ; RR(cid:62) = Kuu,\n\nReverse mode differentiation of Cholesky To compute derivatives with respect to \u03b8 and Z we use\nreverse-mode differentiation (backpropagation) of the derivative through the Cholesky matrix de-\ncomposition, transforming \u2202 log \u02c6q(v, \u03b8)/\u2202R into \u2202 log \u02c6q(v, \u03b8)/\u2202Kuu, and then \u2202 log \u02c6q(v, \u03b8)/\u2202\u03b8.\nThis is discussed by Smith [23], and results in a O(M 3) operation; an ef\ufb01cient Cython implemen-\ntation is provided in the supplement.\n\n3 Treatment of inducing point positions & inference strategy\n\nA natural question is, what strategy should be used to select the inducing points Z? In the original in-\nducing point formulation [3], the positions Z were treated as parameters to be optimized. One could\ninterpret them as parameters of the approximate prior covariance [24]. The variational formulation\n\n3\n\n\f[4] treats them as parameters of the variational approximation, thus protecting from over-\ufb01tting as\nthey form part of the variational posterior. In this work, since we propose a Bayesian treatment of\nthe model, we question whether it is feasible to treat Z in a Bayesian fashion.\nSince u and Z are auxiliary parameters, the form of their distribution does not affect the marginals of\nthe model. The term p(u| Z) has been de\ufb01ned by the consistency with the GP in order to preserve\nthe posterior-process interpretation above (i.e. u should be points on the GP), but we are free to\nchoose p(Z). Omitting dependence on \u03b8 for clarity, and choosing w.l.o.g. q(u, Z) = q(u| Z)q(Z),\nthe bound on the marginal likelihood, similarly to (4) is given by\n\nL = Ep(f | u,Z)q(u | Z)q(Z)\n\nlog\n\np(y | f )p(u| Z)p(Z)\n\nq(u| Z)q(Z)\n\n.\n\n(9)\n\nThe bound can be maximized w.r.t p(Z) by noting that the term only appears inside a (negative) KL\ndivergence: \u2212Eq(Z)[log q(Z)/p(Z)]. Substituting the optimal p(Z) = q(Z) reduces (9) to\n\nL = Eq(Z)\n\nEp(f | u,Z)q(u | Z)\n\nlog\n\np(y | f )p(u| Z)\n\nq(u| Z)\n\n,\n\n(10)\n\n(cid:20)\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n(cid:21)(cid:21)\n\nwhich can now be optimized w.r.t. q(Z). Since no entropy term appears for q(Z), the bound is\nmaximized when the distribution becomes a Dirac\u2019s delta. In summary, since we are free to choose\na prior for Z which maximizes the amount of information captured by u, the optimal distribution\nbecomes p(Z) = q(Z) = \u03b4(Z \u2212 \u02c6Z). This formally motivates optimizing the inducing points Z.\n\nDerivatives for Z For completeness we also include the derivative of the free form objective with\nrespect to the inducing point positions. Substituting the optimal distribution \u02c6q(u, \u03b8) into (4) to give\n\u02c6K and then differentiating we obtain\n= \u2212 \u2202 log C\n\u2202Z\n\nEp(f | u=Rv) [log p(y | f )]\n\n= \u2212E\u02c6q(v,\u03b8)\n\n(cid:20) \u2202\n\n\u2202 \u02c6K\n\u2202Z\n\n(cid:21)\n\n.\n\n(11)\n\n\u2202Z\n\nSince we aim to draw samples from \u02c6q(v, \u03b8), evaluating this free form inducing point gradient using\nsamples seems plausible but challenging. Instead we use the following strategy.\n1. Fit a Gaussian approximation to the posterior. We follow [20] in \ufb01tting a Gaussian approxi-\nmation to the posterior. The positions of the inducing points are initialized using k-means clustering\nof the data. The values of the latent function are represented by a mean vector (initialized randomly)\nand a lower-triangular matrix L forms the approximate posterior covariance as LL(cid:62). For large prob-\nlems (such as the MNIST experiment), stochastic optimization using AdaDelta is used. Otherwise,\nLBFGS is used. After a few hundred iterations with the inducing points positions \ufb01xed, they are\noptimized in free-form alongside the variational parameters and covariance function parameters.\n2. Initialize the model using the approximation. Having found a satisfactory approximation, the\nHMC strategy takes the optimized inducing point positions from the Gaussian approximation. The\ninitial value of v is drawn from the Gaussian approximation, and the covariance parameters are ini-\ntialized at the (approximate) MAP value.\n3. Tuning HMC. The HMC algorithm has two free parameters to tune, the number of leapfrog\nsteps and the step-length. We follow a strategy inspired by Wang et al. [25], where the number of\nleapfrog steps is drawn randomly from 1 to Lmax, and Bayesian optimization is used to maximize\nLmax. Rather than allow an adaptive (but\nthe expected square jump distance (ESJD), penalized by\nconvergent) scheme as [25], we run the optimization for 30 iterations of 30 samples each, and use\nthe best parameters for a long run of HMC.\n4. Run tuned HMC to obtain predictions. Having tuned the HMC, it is run for several thousand\niterations to obtain a good approximation to \u02c6q(v, \u03b8). The samples are used to estimate the integral in\nequation (2). The following section investigates the effectiveness of the proposed sampling scheme.\n\n\u221a\n\n4 Experiments\n\n4.1 Ef\ufb01cient sampling using Hamiltonian Monte Carlo\n\nThis section illustrates the effectiveness of Hamiltonian Monte Carlo in sampling from \u02c6q(v, \u03b8). As\nalready pointed out, the form assumed by the optimal variational distribution \u02c6q(v, \u03b8) in equation (6)\nresembles the joint distribution in a GP model with a non-Gaussian likelihood.\n\n4\n\n\fFor a \ufb01xed \u03b8, sampling v is relatively straightforward, and this can be done ef\ufb01ciently using HMC\n[12, 26, 27] or Elliptical Slice Sampling [28]. A well tuned HMC has been reported to be extremely\nef\ufb01cient in sampling the latent variables, and this motivates our effort into trying to extend this\nef\ufb01ciency to the sampling of hyper-parameters as well. This is also particularly appealing due to the\nconvenience offered by the proposed representation of the model.\nThe problem of drawing samples from the posterior distribution over v, \u03b8 has been investigated in\ndetail in [11, 12]. In these works, it has been advocated to alternate between the sampling of v and\n\u03b8 in a Gibbs sampling fashion and condition the sampling of \u03b8 on a suitably chosen transformation\nof the latent variables. For each likelihood model, we compare ef\ufb01ciency and convergence speed of\nthe proposed HMC sampler with a Gibbs sampler where v is sampled using HMC and \u03b8 is sampled\nusing the Metropolis-Hastings algorithm. To make the comparison fair, we imposed the mass matrix\nin HMC and the covariance in MH to be isotropic, and any parameters of the proposal were tuned\nusing Bayesian optimization. Unlike in the proposed HMC sampler, for the Gibbs sampler we did\nnot penalize the objective function of the Bayesian optimization for large numbers of leapfrog steps,\nas in this case HMC proposals on the latent variables are computationally cheaper than those on\nthe hyper-parameters. We report ef\ufb01ciency in sampling from \u02c6q(v, \u03b8) using Effective Sample Size\n(ESS) and Time Normalized (TN)-ESS. In the supplement we include convergence plots based on\nthe Potential Scale Reduction Factor (PSRF) computed based on ten parallel chains; in these each\nchain is initialized from the VB solution and individually tuned using Bayesian optimization.\n\n4.2 Binary Classi\ufb01cation\n\nWe \ufb01rst use the image dataset [29] to investigate the bene\ufb01ts of the approach over a Gaussian ap-\nproximation, and to investigate the effect of changing the number of inducing points, as well as\noptimizing the inducing points under the Gaussian approximation. The data are 18 dimensional: we\ninvestigated the effect of our approximation using both ARD (one lengthscale per dimension) and an\nisotropic RBF kernel. The data were split randomly into 1000/1019 train/test sets; the log predictive\ndensity over ten random splits is shown in Figure 1.\nFollowing the strategy outlined above, we \ufb01tted a Gaussian approximation to the posterior, with Z\ninitialized with k-means. Figure 1 investigates the difference in performance when Z is optimized\nusing the Gaussian approximation, compared to just using k-means for Z. Whilst our strategy is not\nguaranteed to \ufb01nd the global optimum, it is clear that it improves the performance.\nThe second part of Figure 1 shows the performance improvement of our sampling approach over the\nGaussian approximation. We drew 10,000 samples, discarding the \ufb01rst 1000: we see a consistent\nimprovement in performance once M is large enough. For small M, The Gaussian approximation\nappears to work very well. The supplement contains a similar Figure for the case where a single\nlengthscale is shared: there, the improvement of the MCMC method over the Gaussian approxi-\nmation is smaller but consistent. We speculate that the larger gains for ARD are due to posterior\nuncertainty in the lengthscales, which is poorly represented by a point in the Gaussian/MAP approx-\nimation.\nThe ESS and TN-ESS are comparable between HMC and the Gibbs sampler. In particular, for 100\ninducing points and the RBF covariance, ESS and TN-ESS for HMC are 11 and 1.0 \u00b7 10\u22123 and for\nthe Gibbs sampler are 53 and 5.1 \u00b7 10\u22123. For the ARD covariance, ESS and TN-ESS for HMC are\n14 and 5.1 \u00b7 10\u22123 and for the Gibbs sampler are 1.6 and 1.5 \u00b7 10\u22124. Convergence, however, seems\nto be faster for HMC, especially for the ARD covariance (see the supplement).\n\n4.3 Log Gaussian Cox Processes\n\nWe apply our methods to Log Gaussian Cox processes [30]: doubly stochastic models where the\nrate of an inhomogeneous Poisson process is given by a Gaussian process. The main dif\ufb01culty for\ninference lies in that the likelihood of the GP requires an integral over the domain, which is typically\nintractable. For low dimensional problems, this integral can be approximated on a grid; assuming\nthat the GP is constant over the width of the grid leads to a factorizing Poisson likelihood for each\nof the grid points. Whilst some recent approaches allow for a grid-free approach [19], these usually\nrequire concessions in the model, such as an alternative link function, and do not approach full\nBayesian inference over the covariance function parameters.\n\n5\n\n\fFigure 1: Performance of the method on the image dataset, with one lengthscale per dimension.\nLeft, box-plots show performance for varying numbers of inducing points and Z strategies. Optimiz-\ning Z using the Gaussian approximation offers signi\ufb01cant improvement over the k-means strategy.\nRight: improvement of the performance of the Gaussian approximation method, with the same in-\nducing points. The method offers consistent performance gains when the number of inducing points\nis larger. The supplement contains a similar \ufb01gure with only a single lengthscale.\n\nFigure 2: The posterior of the rates for the coal mining disaster data. Left: posterior rates using our\nvariational MCMC method and a Gaussian approximation. Data are shown as vertical bars. Right:\nposterior samples for the covariance function parameters using MCMC. The Gaussian approxima-\ntion estimated the parameters as (12.06, 0.55).\n\nCoal mining disasters On the one-dimensional coal-mining disaster data. We held out 50% of\nthe data at random, and using a grid of 100 points with 30 evenly spaced inducing points Z, \ufb01tted\nboth a Gaussian approximation to the posterior process with an (approximate) MAP estimate for the\ncovariance function parameters (variance and lengthscale of an RBF kernel). With Gamma priors\non the covariance parameters we ran our sampling scheme using HMC, drawing 3000 samples.\nThe resulting posterior approximations are shown in Figure 2, alongside the true posterior using\na sampling scheme similar to ours (but without the inducing point approximation). The free-form\nvariational approximation matches the true posterior closely, whilst the Gaussian approximation\nmisses important detail. The approximate and true posteriors over covariance function parameters\nare shown in the right hand part of Figure 2, there is minimal discrepancy in the distributions.\nOver 10 random splits of the data, the average held-out log-likelihood was \u22121.229 for the Gaussian\napproximation and \u22121.225 for the free-form MCMC variant; the average difference was 0.003, and\nthe MCMC variant was always better than the Gaussian approximation. We attribute this improved\nperformance to marginalization of the covariance function parameters.\nEf\ufb01ciency of HMC is greater than for the Gibbs sampler; ESS and TN-ESS for HMC are 6.7 and\n3.1 \u00b7 10\u22122 and for the Gibbs sampler are 9.7 and 1.9 \u00b7 10\u22122. Also, chains converge within few\nthousand iterations for both methods, although convergence for HMC is faster (see the supplement).\n\n6\n\nZoptimizedZk-means51020501005102050100\u22120.4\u22120.2numberofinducingpointslogp(y?)[MCMC]ZoptimizedZk-means51020501005102050100024\u00b710\u22122numberofinducingpointslogp(y?)[MCMC]\u2212logp(y?)[Gauss.]186018801900192019401960012time(years)rateVB+GaussianVB+MCMCMCMCVB+MCMC0204060lengthscaleMCMC024variance\fFigure 3: Pine sapling data. From left to right: reported locations of pine saplings; posterior mean\nintensity on a 32x32 grid using full MCMC; posterior mean intensity on a 32x32 grid (with sparsity\nusing 225 inducing points), posterior mean intensity on a 64x64 grid (using 225 inducing points).\n\nPine saplings The advantages of the proposed approximation are prominent as the number of grid\npoints become higher, an effect emphasized with increasing dimension of the domain. We \ufb01tted a\nsimilar model to the above to the pine sapling data [30].\nWe compared the sampling solution obtained using 225 inducing points on a 32 x 32 grid to the\ngold standard full MCMC run with the same prior and grid size. Figure 3 shows that the agreement\nbetween the variational sampling and full sampling is very close. However the variational method\nwas considerably faster. Using a single core on a desktop computer required 3.4 seconds to obtain\n1 effective sample for a well tuned variational method whereas it took 554 seconds for well tuned\nfull MCMC. This effect becomes even larger as we increase the resolution of the grid to 64 x 64,\nwhich gives a better approximation to the underlying smooth function as can be seen in \ufb01gure 3. It\ntook 4.7 seconds to obtain one effective sample for the variational method, but now gold standard\nMCMC comparison was computationally extremely challenging to run for even a single HMC step.\nThis is because it requires linear algebra operations using O(N 3) \ufb02ops with N = 4096.\n\n4.4 Multi-class Classi\ufb01cation\n\nTo do multi-class classi\ufb01cation with Gaussian processes, one latent function is de\ufb01ned for each of\nthe classes. The functions are de\ufb01ned a-priori independent, but covary a posteriori because of the\nlikelihood. Chai [18] studies a sparse variational approximation to the softmax multi-class likelihood\nrestricted to a Gaussian approximation. Here, following [31, 32, 33], we use a robust-max likelihood.\nGiven a vector fn containing K latent functions evaluated at the point xn, the probability that the\nlabel takes the integer value yn is 1 \u2212 \u0001 if yn = argmax fn and \u0001/K \u2212 1 otherwise. As Girolami\nand Rogers [31] discuss, the \u2018soft\u2019 probit-like behaviour is recovered by adding a diagonal \u2018nugget\u2019\nto the covariance function. In this work, \u0001 was \ufb01xed to 0.001, though it would also be possible to\ntreat this as a parameter for inference. The expected log-likelihood is Ep(fn | v,\u03b8)[log p(yn | fn)] =\np log(\u0001)+(1\u2212p) log(\u0001/(K\u22121)), where p is the probability that the labelled function is largest, which\nis computable using one-dimensional quadrature. An ef\ufb01cient Cython implementation is contained\nin the supplement.\n\nToy example To investigate the proposed posterior approximation for the multivariate classi\ufb01ca-\ntion case, we turn to the toy data shown in Figure 4. We drew 750 data points from three Gaussian\ndistributions. The synthetic data was chosen to include non-linear decision boundaries and ambigu-\nous decision areas. Figure 4 shows that there are differences between the variational and sampling\nsolutions, with the sampling solution being more conservative in general (the contours of 95% con-\n\ufb01dence are smaller). As one would expect at the decision boundary there are strong correlations\nbetween the functions which could not be captured by the Gaussian approximation we are using.\nNote the movement of inducing points away from k-means and towards the decision boundaries.\nEf\ufb01ciency of HMC and the Gibbs sampler is comparable. In the RBF case, ESS and TN-ESS for\nHMC are 1.9 and 3.8\u00b7 10\u22124 and for the Gibbs sampler are 2.5 and 3.6\u00b7 10\u22124. In the ARD case, ESS\nand TN-ESS for HMC are 1.2 and 2.8 \u00b7 10\u22123 and for the Gibbs sampler are 5.1 and 6.8 \u00b7 10\u22124. For\nboth cases, the Gibbs sampler struggles to reach convergence even though the average acceptance\nrates are similar to those recommended for the two samplers individually.\n\nMNIST The MNIST dataset is a well studied benchmark with a de\ufb01ned training/test split. We used\n500 inducing points, initialized from the training data using k-means. A Gaussian approximation\n\n7\n\n\fFigure 4: A toy multiclass problem. Left: the Gaussian approximation, colored points show the\nsimulated data, lines show posterior probability contours at 0.3, 0.95, 0.99. Inducing points positions\nshows as black points. Middle: the free form solution with 10,000 posterior samples. The free-form\nsolution is more conservative (the contours are smaller). Right: posterior samples for v at the same\nposition but across different latent functions. The posterior exhibits strong correlations and edges.\n\nFigure 5: Left: three k-means centers used to initialize the inducing point positions. Center: the\npositions of the same inducing points after optimization. Right: difference.\n\nwas optimized using minibatch-based optimization over the means and variances of q(u), as well\nas the inducing points and covariance function parameters. The accuracy on the held-out data was\n98.04%, signi\ufb01cantly improving on previous approaches to classify these digits using GP models.\nFor binary classi\ufb01cation, Hensman et al. [20] reported that their Gaussian approximation resulted in\nmovement of the inducing point positions toward the decision boundary. The same effect appears\nin the multivariate case, as shown in Figure 5, which shows three of the 500 inducing points used\nin the MNIST problem. The three examples were initialized close to the many six digits, and after\noptimization have moved close to other digits (\ufb01ve and four). The last example still appears to be\na six, but has moved to a more \u2018unusual\u2019 six shape, supporting the function at another extremity.\nSimilar effects are observed for all inducing-point digits. Having optimized the inducing point\npositions with the approximate q(v), and estimate for \u03b8, we used these optimal inducing points to\ndraw samples from v and \u03b8. This did not result in an increase in accuracy, but did improve the\nlog-density on the test set from -0.068 to -0.064. Evaluating the gradients for the sampler took\napproximately 0.4 seconds on a desktop machine, and we were easily able to draw 1000 samples.\nThis dataset size has generally be viewed as challenging in the GP community and consequently\nthere are not many published results to compare with. One recent work [34] reports a 94.05%\naccuracy using variational inference and a GP latent variable model.\n\n5 Discussion\n\nWe have presented an inference scheme for general GP models. The scheme signi\ufb01cantly reduces\nthe computational cost whilst approaching exact Bayesian inference, making minimal assumptions\nabout the form of the posterior. The improvements in accuracy in comparison with the Gaussian\napproximation of previous works has been demonstrated, as has the quality of the approximation\nto the hyper-parameter distribution. Our MCMC scheme was shown to be effective for several\nlikelihoods, and we note that the automatic tuning of the sampling parameters worked well over\nhundreds of experiments. This paper shows that MCMC methods are feasible for inference in large\nGP problems, addressing the unfair sterotype of \u2018slow\u2019 MCMC.\nAcknowledgments JH was funded by an MRC fellowship, AM and ZG by EPSRC grant\nEP/I036575/1 and a Google Focussed Research award.\n\n8\n\n\fReferences\n[1] T. V. Nguyen and E. V. Bonilla. Automated variational inference for Gaussian process models. In NIPS,\n\n[2] L. Csat\u00b4o and M. Opper. Sparse on-line Gaussian processes. Neural comp., 14(3):641\u2013668, 2002.\n[3] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In NIPS, pages 1257\u2013\n\n[4] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In AISTATS, pages\n\npages 1404\u20131412, 2014.\n\n1264, 2005.\n\n567\u2013574, 2009.\n\n[5] M. L\u00b4azaro-Gredilla, J. Qui\u02dcnonero-Candela, C. E. Rasmussen, and A. Figueiras-Vidal. Sparse spectrum\n\nGaussian process regression. JMLR, 11:1865\u20131881, 2010.\n\n[6] A. Solin and S. S\u00a8arkk\u00a8a. Hilbert space methods for reduced-rank Gaussian process regression. arXiv\n\npreprint 1401.5508, 2014.\n\n[7] A. G. Wilson, E. Gilboa, A. Nehorai, and J. P. Cunningham. Fast kernel learning for multidimensional\n\npattern extrapolation. In NIPS, pages 3626\u20133634. 2014.\n\n[8] S. S\u00a8arkk\u00a8a. Bayesian \ufb01ltering and smoothing, volume 3. Cambridge University Press, 2013.\n[9] M. Filippone and R. Engler. Enabling scalable stochastic gradient-based inference for Gaussian processes\n\nby employing the Unbiased LInear System SolvEr (ULISSE). ICML 2015, 2015.\n\n[10] A. G. D. G. Matthews, J. Hensman, R. E. Turner, and Z. Ghahramani. On sparse variational methods and\n\nthe KL divergence between stochastic processes. arXiv preprint 1504.07027, 2015.\n\n[11] I. Murray and R. P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. In\n\nNIPS, pages 1732\u20131740, 2010.\n\n[12] M. Filippone, M. Zhong, and M. Girolami. A comparative evaluation of stochastic-based inference meth-\n\nods for Gaussian process models. Mach. Learn., 93(1):93\u2013114, 2013.\n\n[13] M. N. Gibbs and D. J. C. MacKay. Variational Gaussian process classi\ufb01ers. IEEE Trans. Neural Netw.,\n\n[14] M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural comp., 21(3):\n\n[15] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classi\ufb01ca-\n\n[16] H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classi\ufb01cation. JMLR, 9:\n\n11(6):1458\u20131464, 2000.\n\n786\u2013792, 2009.\n\ntion. JMLR, 6:1679\u20131704, 2005.\n\n2035\u20132078, 2008.\n\n[17] E. Khan, S. Mohamed, and K. P. Murphy. Fast Bayesian inference for non-conjugate Gaussian process\n\nregression. In NIPS, pages 3140\u20133148, 2012.\n\n[18] K. M. A. Chai. Variational multinomial logit Gaussian process. JMLR, 13(1):1745\u20131808, June 2012.\n[19] C. Lloyd, T. Gunter, M. A. Osborne, and S. J. Roberts. Variational inference for Gaussian process modu-\n\n[20] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classi\ufb01cation. In\n\n[21] C. K. I. Williams and D. Barber. Bayesian classi\ufb01cation with Gaussian processes. IEEE Trans. Pattern\n\nlated poisson processes. ICML 2015, 2015.\n\nAISTATS, pages 351\u2013360, 2014.\n\nAnal. Mach. Intell., 20(12):1342\u20131351, 1998.\n\n[22] Michalis K Titsias, Neil Lawrence, and Magnus Rattray. Markov chain monte carlo algorithms for gaus-\nsian processes. In D. Barber, A. T. Chiappa, and S. Cemgil, editors, Bayesian time series models. 2011.\n\n[23] S. P. Smith. Differentiation of the cholesky algorithm. J. Comp. Graph. Stat., 4(2):134\u2013147, 1995.\n[24] J. Qui\u02dcnonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\n[25] Z. Wang, S. Mohamed, and N. De Freitas. Adaptive Hamiltonian and Riemann manifold Monte Carlo. In\n\nregression. JMLR, 6:1939\u20131959, 2005.\n\nICML, volume 28, pages 1462\u20131470, 2013.\n\n[26] J. Vanhatalo and A. Vehtari. Sparse Log Gaussian Processes via MCMC for Spatial Epidemiology. In\n\nGaussian processes in practice, volume 1, pages 73\u201389, 2007.\n\n[27] O. F. Christensen, G. O. Roberts, and J. S. Rosenthal. Scaling limits for the transient phase of local\n\nMetropolisHastings algorithms. JRSS:B, 67(2):253\u2013268, 2005.\n\n[28] I. Murray, R. P. Adams, and D. J. C. MacKay. Elliptical slice sampling. In AISTATS, volume 9, 2010.\n[29] G. R\u00a8atsch, T. Onoda, and K-R M\u00a8uller. Soft margins for adaboost. Mach. Learn., 42(3):287\u2013320, 2001.\n[30] J. M\u00f8ller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox processes. Scand. stat., 25(3):\n\n[31] M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian process\n\n[32] H. Kim and Z. Ghahramani. Bayesian Gaussian Process Classi\ufb01cation with the EM-EP Algorithm. IEEE\n\n451\u2013482, 1998.\n\npriors. Neural Comp., 18:2006, 2005.\n\nTPAMI, 28(12):1948\u20131959, 2006.\n\n[33] D. Hern\u00b4andez-Lobato, J. M. Hern\u00b4andez-Lobato, and P. Dupont. Robust multi-class Gaussian process\n\nclassi\ufb01cation. In NIPS, pages 280\u2013288, 2011.\n\n[34] Y. Gal, M. Van der Wilk, and Rasmussen C. E. Distributed variational inference in sparse Gaussian\n\nprocess regression and latent variable models. In NIPS. 2014.\n\n9\n\n\f", "award": [], "sourceid": 1013, "authors": [{"given_name": "James", "family_name": "Hensman", "institution": "Lancaster University"}, {"given_name": "Alexander", "family_name": "Matthews", "institution": "University of Cambridge"}, {"given_name": "Maurizio", "family_name": "Filippone", "institution": "University of Glasgow"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}