{"title": "Slice sampling covariance hyperparameters of latent Gaussian models", "book": "Advances in Neural Information Processing Systems", "page_first": 1732, "page_last": 1740, "abstract": "The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes.", "full_text": "Slice sampling covariance hyperparameters\n\nof latent Gaussian models\n\nIain Murray\n\nSchool of Informatics\n\nUniversity of Edinburgh\n\nRyan Prescott Adams\nDept. Computer Science\n\nUniversity of Toronto\n\nAbstract\n\nThe Gaussian process (GP) is a popular way to specify dependencies be-\ntween random variables in a probabilistic model. In the Bayesian framework\nthe covariance structure can be speci\ufb01ed using unknown hyperparameters.\nIntegrating over these hyperparameters considers di\ufb00erent possible expla-\nnations for the data when making predictions. This integration is often per-\nformed using Markov chain Monte Carlo (MCMC) sampling. However, with\nnon-Gaussian observations standard hyperparameter sampling approaches\nrequire careful tuning and may converge slowly. In this paper we present\na slice sampling approach that requires little tuning while mixing well in\nboth strong- and weak-data regimes.\n\n1 Introduction\n\nMany probabilistic models incorporate multivariate Gaussian distributions to explain de-\npendencies between variables. Gaussian process (GP) models and generalized linear mixed\nmodels are common examples. For non-Gaussian observation models, inferring the parame-\nters that specify the covariance structure can be di\ufb03cult. Existing computational methods\ncan be split into two complementary classes: deterministic approximations and Monte Carlo\nsimulation. This work presents a method to make the sampling approach easier to apply.\n\nIn recent work Murray et al. [1] developed a slice sampling [2] variant, elliptical slice sam-\npling, for updating strongly coupled a-priori Gaussian variates given non-Gaussian obser-\nvations. Previously, Agarwal and Gelfand [3] demonstrated the utility of slice sampling for\nupdating covariance parameters, conventionally called hyperparameters, with a Gaussian\nobservation model, and questioned the possibility of slice sampling in more general settings.\nIn this work we develop a new slice sampler for updating covariance hyperparameters. Our\nmethod uses a robust representation that should work well on a wide variety of problems,\nhas very few technical requirements, little need for tuning and so should be easy to apply.\n\n1.1 Latent Gaussian models\n\nWe consider generative models of data that depend on a vector of latent variables f that are\nGaussian distributed with covariance \u03a3\u03b8 set by unknown hyperparameters \u03b8. These models\nare common in the machine learning Gaussian process literature [e.g. 4] and throughout the\nstatistical sciences. We use standard notation for a Gaussian distribution with mean m and\ncovariance \u03a3,\n\n(1)\nand use f \u223c N (m, \u03a3) to indicate that f is drawn from a distribution with the density in (1).\n\nN (f ; m, \u03a3) \u2261 |2\u03c0\u03a3|\u22121/2 exp(cid:0)\u2212 1\n\n2 (f\u2212m)(cid:62)\u03a3\u22121(f\u2212m)(cid:1),\n\n1\n\n\f(a) Prior draws\n\n(b) Lengthscale given f\n\nFigure 1: (a) Shows draws from the prior over f using three di\ufb00erent lengthscales in the squared\nexponential covariance (2). (b) Shows the posteriors over log-lengthscale for these three draws.\n\nThe generic form of the generative models we consider is summarized by\n\ncovariance hyperparameters \u03b8 \u223c ph,\n\nlatent variables f \u223c N (0, \u03a3\u03b8),\n\nand a conditional likelihood P (data|f ) = L(f ).\n\nThe methods discussed in this paper apply to covariances \u03a3\u03b8 that are arbitrary positive\nde\ufb01nite functions parameterized by \u03b8. However, our experiments focus on the popular case\nwhere the covariance is associated with N input vectors {xn}N\nn=1 through the squared-\nexponential kernel,\n(xd,i\u2212xd,j )2\n\n(cid:17)\n\n(cid:80)D\n\n(2)\n\n(cid:16)\u2212 1\n\n2\n\n(\u03a3\u03b8)ij = k(xi, xj) = \u03c32\n\nf exp\nf ,{(cid:96)d}}. Here \u03c32\n\nwith hyperparameters \u03b8 ={\u03c32\nf is the \u2018signal variance\u2019 controlling the overall\nscale of the latent variables f . The (cid:96)d give characteristic lengthscales for converting the\ndistances between inputs into covariances between the corresponding latent values f .\n\nd=1\n\n(cid:96)2\nd\n\n,\n\nFor non-Gaussian likelihoods we wish to sample from the joint posterior over unknowns,\n\n(3)\nWe would like to avoid implementing new code or tuning algorithms for di\ufb00erent covariances\n\u03a3\u03b8 and conditional likelihood functions L(f ).\n\nP (f , \u03b8|data) = 1\n\nZ L(f )N (f ; 0, \u03a3\u03b8) ph(\u03b8) .\n\ninvariant if \u03c0(z(cid:48)) =(cid:82) T (z(cid:48) \u2190 z) \u03c0(z) dz. A standard way to sample from the joint poste-\n\n2 Markov chain inference\nA Markov chain transition operator T (z(cid:48) \u2190 z) de\ufb01nes a conditional distribution on a new\nposition z(cid:48) given an initial position z. The operator is said to leave a target distribution \u03c0\nrior (3) is to alternately simulate transition operators that leave its conditionals, P (f |data, \u03b8)\nand P (\u03b8| f ), invariant. Under fairly mild conditions the Markov chain will equilibrate to-\nwards the target distribution [e.g. 5].\n\nRecent work has focused on transition operators for updating the latent variables f given\ndata and a \ufb01xed covariance \u03a3\u03b8 [6, 1]. Updates to the hyperparameters for \ufb01xed latent\nvariables f need to leave the conditional posterior,\n\n(4)\ninvariant. The simplest algorithm for this is the Metropolis\u2013Hastings operator, see Algo-\nrithm 1. Other possibilities include slice sampling [2] and Hamiltonian Monte Carlo [7, 8].\n\nP (\u03b8|f ) \u221d N (f ; 0, \u03a3\u03b8) ph(\u03b8),\n\nAlternately \ufb01xing the unknowns f and \u03b8 is appealing from an implementation standpoint.\nHowever, the resulting Markov chain can be very slow in exploring the joint posterior distri-\nbution. Figure 1a shows latent vector samples using squared-exponential covariances with\ndi\ufb00erent lengthscales. These samples are highly informative about the lengthscale hyperpa-\nrameter that was used, especially for short lengthscales. The sharpness of P (\u03b8| f ), Figure 1b,\ndramatically limits the amount that any Markov chain can update the hyperparameters \u03b8\nfor \ufb01xed latent values f .\n\n2\n\n00.20.40.60.81\u22122\u22121.5\u22121\u22120.500.51Input Space, xLatent values, f  l = 0.1l = 0.5l = 210\u2212210\u2212110010100.020.040.060.080.1lengthscale, lp(log l | f)  l = 0.1l = 0.5l = 2\fAlgorithm 1 M\u2013H transition for \ufb01xed f\n\nInput: Current f and hyperparameters \u03b8;\nproposal dist. q; covariance function \u03a3().\n\nOutput: Next hyperparameters\n1: Propose: \u03b8(cid:48) \u223c q(\u03b8(cid:48) ; \u03b8)\n2: Draw u \u223c Uniform(0, 1)\n3: if u <\n4:\n5: else\n6:\n\nN (f ;0,\u03a3\u03b8(cid:48) ) ph(\u03b8(cid:48)) q(\u03b8 ; \u03b8(cid:48))\nN (f ;0,\u03a3\u03b8) ph(\u03b8) q(\u03b8(cid:48); \u03b8)\n\nreturn \u03b8(cid:48)\n\nreturn \u03b8\n\n(cid:46) Accept new state\n\n(cid:46) Keep current state\n\nAlgorithm 2 M\u2013H transition for \ufb01xed \u03bd\nInput: Current state \u03b8, f ; proposal dist. q;\ncovariance function \u03a3(); likelihood L().\n\n\u03a3\u03b8\n\nOutput: Next \u03b8, f\n1: Solve for N (0, I) variate: \u03bd = L\u22121\nf\n2: Propose \u03b8(cid:48) \u223c q(\u03b8(cid:48) ; \u03b8)\n3: Compute implied values: f(cid:48) = L\u03a3\u03b8(cid:48) \u03bd\n4: Draw u \u223c Uniform(0, 1)\nL(f(cid:48)) ph(\u03b8(cid:48)) q(\u03b8 ; \u03b8(cid:48))\n5: if u <\nL(f ) ph(\u03b8) q(\u03b8(cid:48); \u03b8)\n6:\n7: else\n8:\n\nreturn \u03b8(cid:48), f(cid:48)\n\nreturn \u03b8, f\n\n(cid:46) Accept new state\n\n(cid:46) Keep current state\n\n2.1 Whitening the prior\n\nOften the conditional likelihood is quite weak; this is why strong prior smoothing as-\nsumptions are often introduced in latent Gaussian models.\nIn the extreme limit in\ni.e. L is constant, the target distribution is the prior model,\nwhich there is no data,\nP (f , \u03b8) = N (f ; 0, \u03a3\u03b8) ph(\u03b8). Sampling from the prior should be easy, but alternately \ufb01x-\ning f and \u03b8 does not work well because they are strongly coupled. One strategy is to\nreparameterize the model so that the unknown variables are independent under the prior.\n\nIndependent random variables can be identi\ufb01ed from a commonly-used generative procedure\nfor the multivariate Gaussian distribution. A vector of independent normals, \u03bd, is drawn\nindependently of the hyperparameters and then deterministically transformed:\n\n\u03bd \u223c N (0, I),\n\n\u03a3\u03b8\n\n= \u03a3\u03b8.\n\nf = L\u03a3\u03b8 \u03bd,\n\nwhere L\u03a3\u03b8\n\n(5)\nNotation: Throughout this paper LC will be any user-chosen square root of covariance\nmatrix C. While any matrix square root can be used, the lower-diagonal Cholesky decom-\nposition is often the most convenient. We would reserve C 1/2 for the principal square root,\nbecause other square roots do not behave like powers: for example, chol(C)\u22121 (cid:54)= chol(C\u22121).\nWe can choose to update the hyperparameters \u03b8 for \ufb01xed \u03bd instead of \ufb01xed f . As the\noriginal latent variables f are deterministically linked to the hyperparameters \u03b8 in (5), these\nupdates will actually change both \u03b8 and f . The samples in Figure 1a resulted from using\nthe same whitened variable \u03bd with di\ufb00erent hyperparameters. They follow the same general\ntrend, but vary over the lengthscales used to construct them.\n\nL(cid:62)\n\nThe posterior over hyperparameters for \ufb01xed \u03bd is apparent by applying Bayes rule to the\ngenerative procedure in (5), or one can laboriously obtain it by changing variables in (3):\n\nP (\u03b8| \u03bd, data) \u221d P (\u03b8, \u03bd, data) = P (\u03b8, f = L\u03a3\u03b8 \u03bd, data)|L\u03a3\u03b8| \u221d \u00b7\u00b7\u00b7 \u221d L(f (\u03b8, \u03bd)) ph(\u03b8).\n\n(6)\n\nAlgorithm 2 is the Metropolis\u2013Hastings operator for this distribution. The acceptance rule\nnow depends on the latent variables through the conditional likelihood L(f ) instead of the\nprior N (f ; 0, \u03a3\u03b8) and these variables are automatically updated to respect the prior. In the\nno-data limit, new hyperparameters proposed from the prior are always accepted.\n\n3 Surrogate data model\n\nNeither of the previous two algorithms are ideal for statistical applications, which is illus-\ntrated in Figure 2. Algorithm 2 is ideal in the \u201cweak data\u201d limit where the latent variables f\nare distributed according to the prior. In the example, the likelihoods are too restrictive for\nAlgorithm 2\u2019s proposal to be acceptable. In the \u201cstrong data\u201d limit, where the latent vari-\nables f are \ufb01xed by the likelihood L, Algorithm 1 would be ideal. However, the likelihood\nterms in the example are not so strong that the prior can be ignored.\n\nFor regression problems with Gaussian noise the latent variables can be marginalised out an-\nalytically, allowing hyperparameters to be accepted or rejected according to their marginal\nposterior P (\u03b8|data).\nIf latent variables are required they can be sampled directly from\nthe conditional posterior P (f |\u03b8, data). To build a method that applies to non-Gaussian\nlikelihoods, we create an auxiliary variable model that introduces surrogate Gaussian ob-\nservations that will guide joint proposals of the hyperparameters and latent variables.\n\n3\n\n\fFigure 2: A regression problem with Gaussian observations illustrated by 2\u03c3 gray bars. The\ncurrent state of the sampler has a short lengthscale hyperparameter ((cid:96) = 0.3); a longer lengthscale\n((cid:96) = 1.5) is being proposed. The current latent variables do not lie on a straight enough line for the\nlong lengthscale to be plausible. Whitening the prior (Section 2.1) updates the latent variables to\na straighter line, but ignores the observations. A proposal using surrogate data (Section 3, with S\u03b8\nset to the observation noise) sets the latent variables to a draw that is plausible for the proposed\nlengthscale while being close to the current state.\n\nP (g| f , \u03b8) = N (g; f , S\u03b8).\n\nWe augment the latent Gaussian model with auxiliary variables, g, a noisy version of the\ntrue latent variables:\n\n(7)\nFor now S\u03b8 is an arbitrary free parameter that could be set by hand to either a \ufb01xed\nvalue or a value that depends on the current hyperparameters \u03b8. We will discuss how to\nautomatically set the auxiliary noise covariance S\u03b8 in Section 3.2.\nThe original model, f \u223c N (0, \u03a3\u03b8) and (7) de\ufb01ne a joint auxiliary distribution P (f , g| \u03b8)\ngiven the hyperparameters. It is possible to sample from this distribution in the opposite\norder, by \ufb01rst drawing the auxiliary values from their marginal distribution\n\nand then sampling the model\u2019s latent values conditioned on the auxiliary values from\n\nP (g| \u03b8) = N (g; 0, \u03a3\u03b8 +S\u03b8),\n\nP (f | g, \u03b8) = N (f ; m\u03b8,g, R\u03b8), where some standard manipulations give:\n\n\u03b8 +S\u22121\n\nR\u03b8 = (\u03a3\u22121\nm\u03b8,g = \u03a3\u03b8(\u03a3\u03b8 +S\u03b8)\u22121g = R\u03b8S\u22121\n\u03b8 g.\n\n\u03b8 )\u22121 = \u03a3\u03b8\u2212\u03a3\u03b8(\u03a3\u03b8 +S\u03b8)\u22121\u03a3\u03b8 = S\u03b8\u2212S\u03b8(S\u03b8 +\u03a3\u03b8)\u22121S\u03b8,\n\n(8)\n\n(9)\n\nThat is, under the auxiliary model the latent variables of interest are drawn from their\nposterior given the surrogate data g. Again we can describe the sampling process via a\ndraw from a spherical Gaussian:\n\n\u03b7 \u223c N (0, I),\n\nf = LR\u03b8 \u03b7 + m\u03b8,g,\n\nwhere LR\u03b8 L(cid:62)\n\nR\u03b8\n\n= R\u03b8.\n\n(10)\n\nWe then condition on the \u201cwhitened\u201d variables \u03b7 and the surrogate data g while updating\nthe hyperparameters \u03b8. The implied latent variables f (\u03b8, \u03b7, g) will remain a plausible draw\nfrom the surrogate posterior for the current hyperparameters. This is illustrated in Figure 2.\n\nWe can leave the joint distribution (3) invariant by updating the following conditional\ndistribution derived from the above generative model:\n\nP (\u03b8|\u03b7, g, data) \u221d P (\u03b8, \u03b7, g, data) \u221d L(cid:0)f (\u03b8, \u03b7, g)(cid:1)N (g; 0, \u03a3\u03b8 +S\u03b8) ph(\u03b8).\n\n(11)\n\nThe Metropolis\u2013Hastings Algorithm 3 contains a ratio of these terms in the acceptance rule.\n\n3.1 Slice sampling\nThe Metropolis\u2013Hastings algorithms discussed so far have a proposal distribution q(\u03b8(cid:48); \u03b8)\nthat must be set and tuned. The e\ufb03ciency of the algorithms depend crucially on careful\nchoice of the scale \u03c3 of the proposal distribution. Slice sampling [2] is a family of adaptive\nsearch procedures that are much more robust to the choice of scale parameter.\n\n4\n\n00.20.40.60.81\u22122\u22121.5\u22121\u22120.500.5Input Space, xObservations, y  current state fwhitened prior proposalsurrogate data proposal\fAlgorithm 3 Surrogate data M\u2013H\n\nAlgorithm 4 Surrogate data slice sampling\n\n\u03b8, f ; prop. dist. q; model of Sec. 3.\n\nR\u03b8\n\n\u03b7 = L\u22121\n\n(f \u2212 m\u03b8,g)\n\nInput:\nOutput: Next \u03b8, f\n1: Draw surrogate data: g \u223c N (f , S\u03b8)\n2: Compute implied latent variates:\n3: Propose \u03b8(cid:48) \u223c q(\u03b8(cid:48) ; \u03b8)\n4: Compute function f(cid:48) = LR\u03b8(cid:48) \u03b7 + m\u03b8(cid:48),g\n5: Draw u \u223c Uniform(0, 1)\nL(f(cid:48)) N (g;0,\u03a3\u03b8(cid:48)+S\u03b8(cid:48) ) ph(\u03b8(cid:48)) q(\u03b8 ; \u03b8(cid:48))\n6: if u <\nL(f ) N (g;0,\u03a3\u03b8+S\u03b8) ph(\u03b8) q(\u03b8(cid:48) ; \u03b8)\n7:\n8: else\n9:\n\nreturn \u03b8(cid:48), f(cid:48)\n\nreturn \u03b8, f\n\n(cid:46) Accept new state\n\n(cid:46) Keep current state\n\nInput:\n\u03b8, f ; scale \u03c3; model of Sec. 3.\nOutput: Next f , \u03b8\n1: Draw surrogate data: g \u223c N (f , S\u03b8)\n2: Compute implied latent variates:\n\n(f \u2212 m\u03b8,g)\n\n\u03b7 = L\u22121\nv \u223c Uniform(0, \u03c3), \u03b8min = \u03b8\u2212v, \u03b8max = \u03b8min +\u03c3\n\nR\u03b8\n\ny = uL(f )N (g; 0, \u03a3\u03b8 +S\u03b8) ph(\u03b8)\n\n3: Randomly center a bracket:\n4: Draw u \u223c Uniform(0, 1)\n5: Determine threshold:\n6: Draw proposal: \u03b8(cid:48) \u223c Uniform(\u03b8min, \u03b8max)\n7: Compute function f(cid:48) = LR\u03b8(cid:48) \u03b7 + m\u03b8(cid:48),g\n8: if L(f(cid:48))N (g; 0, \u03a3\u03b8(cid:48) +S\u03b8(cid:48)) ph(\u03b8(cid:48)) > y\n9:\n10: else if \u03b8(cid:48) < \u03b8\n11:\n12: else\n13:\n14: goto 6\n\nreturn f(cid:48), \u03b8(cid:48)\nShrink bracket minimum: \u03b8min = \u03b8(cid:48)\nShrink bracket maximum: \u03b8max = \u03b8(cid:48)\n\nAlgorithm 4 applies one possible slice sampling algorithm to a scalar hyperparameter \u03b8 in\nthe surrogate data model of this section. It has a free parameter \u03c3, the scale of the initial\nproposal distribution. However, careful tuning of this parameter is not required. If the initial\nscale is set to a large value, such as the width of the prior, then the width of the proposals will\nshrink to an acceptable range exponentially quickly. Stepping-out procedures [2] could be\nused to adapt initial scales that are too small. We assume that axis-aligned hyperparameter\nmoves will be e\ufb00ective, although reparameterizations could improve performance [e.g. 9].\n\n3.2 The auxiliary noise covariance S\u03b8\nThe surrogate data g and noise covariance S\u03b8 de\ufb01ne a pseudo-posterior distribution that\nsoftly speci\ufb01es a plausible region within which the latent variables f are updated. The noise\ncovariance determines the size of this region. The \ufb01rst two baseline algorithms of Section 2\nresult from limiting cases of S\u03b8 = \u03b1I: 1) if \u03b1 = 0 the surrogate data and the current latent\nvariables are equal and the acceptance ratio reduces to that of Algorithm 1. 2) as \u03b1 \u2192\u221e\nthe observations are uninformative about the current state and the pseudo-posterior tends\nto the prior. In the limit, the acceptance ratio reduces to that of Algorithm 2. One could\nchoose \u03b1 based on preliminary runs, but such tuning would be burdensome.\n\nFor likelihood terms that factorize, L(f ) =(cid:81)\n\ni Li(fi), we can measure how much the likeli-\n\nhood restricts each variable individually:\n\nP (fi|Li, \u03b8) \u221d Li(fi) N (fi; 0, (\u03a3\u03b8)ii).\n\n(12)\nA Gaussian can be \ufb01tted by moment matching or a Laplace approximation (matching sec-\nond derivatives at the mode). Such \ufb01ts, or close approximations, are often possible analyti-\ncally and can always be performed numerically as the distribution is only one-dimensional.\nGiven a Gaussian \ufb01t to the site-posterior (12) with variance vi, we can set the auxil-\niary noise to a level that would result in the same posterior variance at that site alone:\n\u22121)\u22121. (Any negative (S\u03b8)ii must be thresholded.) The moment match-\n(S\u03b8)ii = (v\u22121\ning procedure is a grossly simpli\ufb01ed \ufb01rst step of \u201cassumed density \ufb01ltering\u201d or \u201cexpectation\npropagation\u201d [10], which are too expensive for our use in the inner-loop of a Markov chain.\n\ni \u2212(\u03a3\u03b8)ii\n\n4 Related work\nWe have discussed samplers that jointly update strongly-coupled latent variables and hy-\nperparameters. The hyperparameters can move further in joint moves than their narrow\nconditional posteriors (e.g., Figure 1b) would allow. A generic way of jointly sampling real-\nvalued variables is Hamiltonian/Hybrid Monte Carlo (HMC) [7, 8]. However, this method\nis cumbersome to implement and tune, and using HMC to jointly update latent variables\nand hyperparameters in hierarchical models does not itself seem to improve sampling [11].\n\nChristensen et al. [9] have also proposed a robust representation for sampling in latent\nGaussian models. They use an approximation to the target posterior distribution to con-\n\n5\n\n\fstruct a reparameterization where the unknown variables are close to independent. The\napproximation replaces the likelihood with a Gaussian form proportional to N (f ; \u02c6f , \u039b(\u02c6f )):\n\n\u02c6f = argmaxf L(f ),\n\n\u039bij(\u02c6f ) = \u22022 log L(f )\n\n\u2202fi \u2202fj\n\n,\n\n(13)\n\nwhere \u039b is often diagonal, or it was suggested one would only take the diagonal part.\nThis Taylor approximation looks like a Laplace approximation, except that the likelihood\nfunction is not a probability density in f . This likelihood \ufb01t results in an approximate\n\nGaussian posterior N (f ; m\u03b8,g=\u02c6f , R\u03b8) as found in (9), with noise S\u03b8 = \u039b(\u02c6f )\u22121 and data g =\u02c6f .\n(f \u2212 m\u03b8,\u02c6f ).\n\u03c9\u223cN (0, I), f = LR\u03b8 \u03c9 + m\u03b8,\u02c6f , suggests using the reparameterization \u03c9 = L\u22121\n\nThinking of the current latent variables as a draw from this approximate posterior,\n\nR\u03b8\n\nWe can then \ufb01x the new variables and update the hyperparameters under\nP (\u03b8| \u03c9, data) \u221d L(f (\u03c9, \u03b8))N (f (\u03c9, \u03b8); 0, \u03a3\u03b8) ph(\u03b8)|LR\u03b8| .\n\n(14)\n\nWhen the likelihood is Gaussian, the reparameterized variables \u03c9 are independent of each\nother and the hyperparameters. The hope is that approximating non-Gaussian likelihoods\nwill result in nearly-independent parameterizations on which Markov chains will mix rapidly.\n\nTaylor expanding some common log-likelihoods around the maximum is not well de\ufb01ned,\nfor example approximating probit or logistic likelihoods for binary classi\ufb01cation, or Pois-\nson observations with zero counts. These Taylor expansions could be seen as giving \ufb02at or\nunde\ufb01ned Gaussian approximations that do not reweight the prior. When all of the like-\nlihood terms are \ufb02at the reparameterization approach reduces to that of Section 2.1. The\nalternative S\u03b8 auxiliary covariances that we have proposed could be used instead.\n\nR\u03b8\n\nThe surrogate data samplers of Section 3 can also be viewed as using reparameterizations,\n(f \u2212 m\u03b8,g) as an arbitrary random reparameterization for making pro-\nby treating \u03b7 = L\u22121\nposals. A proposal density q(\u03b7(cid:48), \u03b8(cid:48); \u03b7, \u03b8) in the reparameterized space must be multiplied by\nthe Jacobian |L\u22121\nR\u03b8(cid:48)| to give a proposal density in the original parameterization. The proba-\nbility of proposing the reparameterization must also be included in the Metropolis\u2013Hastings\nacceptance probability:\n\nmin\n\n1,\n\n\u22121\nP (\u03b8(cid:48),f(cid:48) | data)\u00b7P (g| f(cid:48),S\u03b8(cid:48) )\u00b7q(\u03b8;\u03b8(cid:48)) |L\nR\u03b8\n\u22121\n\u03b8(cid:48) |\nP (\u03b8,f | data)\u00b7P (g| f ,S\u03b8)\u00b7q(\u03b8(cid:48);\u03b8) |L\nR\n\n|\n\n.\n\n(15)\n\n(cid:12)(cid:12)(cid:12)\u02c6f\n\n(cid:19)\n\n(cid:18)\n\nA few lines of linear algebra con\ufb01rms that, as it must do, the same acceptance ratio results\nas before. Alternatively, substituting (3) into (15) shows that the acceptance probability\nis very similar to that obtained by applying Metropolis\u2013Hastings to (14) as proposed by\nChristensen et al. [9]. The di\ufb00erences are that the new latent variables f(cid:48) are computed\nusing di\ufb00erent pseudo-posterior means and the surrogate data method has an extra term\nfor the random, rather than \ufb01xed, choice of reparameterization.\n\nThe surrogate data sampler is easier to implement than the previous reparameterization\nwork because the surrogate posterior is centred around the current latent variables. This\nmeans that 1) no point estimate, such as the maximum likelihood \u02c6f , is required. 2) picking\nthe noise covariance S\u03b8 poorly may still produce a workable method, whereas a \ufb01xed repa-\nrameterized can work badly if the true posterior distribution is in the tails of the Gaussian\napproximation. Christensen et al. [9] pointed out that centering the approximate Gaus-\nsian likelihood in their reparameterization around the current state is tempting, but that\ncomputing the Jacobian of the transformation is then intractable. By construction, the\nsurrogate data model centers the reparameterization near to the current state.\n\n5 Experiments\n\nWe empirically compare the performance of the various approaches to GP hyperparameter\nsampling on four data sets: one regression, one classi\ufb01cation, and two Cox process inference\nproblems. Further details are in the rest of this section, with full code as supplementary\nmaterial. The results are summarized in Figure 3 followed by a discussion section.\n\n6\n\n\fIn each of the experimental con\ufb01gurations, we ran ten independent chains with di\ufb00erent\nrandom seeds, burning in for 1000 iterations and sampling for 5000 iterations. We quantify\nthe mixing of the chain by estimating the e\ufb00ective number of samples of the complete\ndata likelihood trace using R-CODA [12], and compare that with three cost metrics: the\nnumber of hyperparameter settings considered (each requiring a small number of covariance\ndecompositions with O(n3) time complexity), the number of likelihood evaluations, and the\ntotal elapsed time on a single core of an Intel Xeon 3GHz CPU.\n\nThe experiments are designed to test the mixing of hyperparameters \u03b8 while sampling from\nthe joint posterior (3). All of the discussed approaches except Algorithm 1 update the latent\nvariables f as a side-e\ufb00ect. However, further transition operators for the latent variables for\n\ufb01xed hyperparameters are required. In Algorithm 2 the \u201cwhitened\u201d variables \u03bd remain \ufb01xed;\nthe latent variables and hyperparameters are constrained to satisfy f = L\u03a3\u03b8 \u03bd. The surrogate\ndata samplers are ergodic: the full joint posterior distribution will eventually be explored.\nHowever, each update changes the hyperparameters and requires expensive computations\ninvolving covariances. After computing the covariances for one set of hyperparameters, it\nmakes sense to apply several cheap updates to the latent variables. For every method we\napplied ten updates of elliptical slice sampling [1] to the latent variables f between each\nhyperparameter update. One could also consider applying elliptical slice sampling to a\nreparameterized representation, for simplicity of comparison we do not. Independently of\nour work Titsias [13] has used surrogate data like reparameterizations to update latent\nvariables for \ufb01xed hyperparameters.\n\nMethods We implemented six methods for updating Gaussian covariance hyperparame-\nters. Each method used the same slice sampler, as in Algorithm 4, applied to the following\nmodel representations. \ufb01xed: \ufb01xing the latent function f [14]. prior-white: whitening\nwith the prior. surr-site: using surrogate data with the noise level set to match the site\nposterior (12). We used Laplace approximations for the Poisson likelihood. For classi\ufb01-\ncation problems we used moment matching, because Laplace approximations do not work\nwell [15]. surr-taylor: using surrogate data with noise variance set via Taylor expansion of\nthe log-likelihood (13). In\ufb01nite variances were truncated to a large value. post-taylor and\npost-site: as for the surr- methods but a \ufb01xed reparameterization based on a posterior\napproximation (14).\n\nBinary Classi\ufb01cation (Ionosphere) We evaluated four di\ufb00erent methods for perform-\ning binary GP classi\ufb01cation: fixed, prior-white, surr-site and post-site. We applied\nthese methods to the Ionosphere dataset [16], using 200 training data and 34 dimensions.\nWe used a logistic likelihood with zero-mean prior, inferring lengthscales as well as sig-\nnal variance. The -taylor methods reduce to other methods or don\u2019t apply because the\nmaximum of the log-likelihood is at plus or minus in\ufb01nity.\n\nGaussian Regression (Synthetic) When the observations have Gaussian noise the\npost-taylor reparameterization of Christensen et al. [9] makes the hyperparameters and\nlatent variables exactly independent. The random centering of the surrogate data model will\nbe less e\ufb00ective. We used a Gaussian regression problem to assess how much worse the sur-\nrogate data method is compared to an ideal reparameterization. The synthetic data set had\n200 input points in 10-D drawn uniformly within a unit hypercube. The GP had zero mean,\nunit signal variance and its ten lengthscales in (2) drawn from Uniform(0,\n10). Observation\nnoise had variance 0.09. We applied the fixed, prior-white, surr-site/surr-taylor,\nand post-site/post-taylor methods. For Gaussian likelihoods the -site and -taylor\nmethods coincide: the auxiliary noise matches the observation noise (S\u03b8 = 0.09 I).\n\n\u221a\n\nCox process inference We tested all six methods on an inhomogeneous Poisson process\nwith a Gaussian process prior for the log-rate. We sampled the hyperparameters in (2) and\na mean o\ufb00set to the log-rate. The model was applied to two point process datasets: 1) a\nrecord of mining disasters [17] with 191 events in 112 bins of 365 days. 2) 195 redwood tree\nlocations in a region scaled to the unit square [18] split into 25\u00d725 = 625 bins. The results\nfor the mining problem were initially highly variable. As the mining experiments were also\nthe quickest we re-ran each chain for 20,000 iterations.\n\n7\n\n\fFigure 3: The results of experimental comparisons of six MCMC methods for GP hyperparameter\ninference on four data sets. Each \ufb01gure shows four groups of bars (one for each experiment) and the\nvertical axis shows the e\ufb00ective number of samples of the complete data likelihood per unit cost.\nThe costs are per likelihood evaluation (left), per covariance construction (center), and per second\n(right). Means and standard errors for 10 runs are shown. Each group of bars has been rescaled for\nreadability: the number beneath each group gives the e\ufb00ective samples for the surr-site method,\nwhich always has bars of height 1. Bars are missing where methods are inapplicable (see text).\n\n6 Discussion\n\nOn the Ionosphere classi\ufb01cation problem both of the -site methods worked much better\nthan the two baselines. We slightly prefer surr-site as it involves less problem-speci\ufb01c\nderivations than post-site.\n\nOn the synthetic test the post- and surr- methods perform very similarly. We had expected\nthe existing post- method to have an advantage of perhaps up to 2\u20133\u00d7, but that was not\nrealized on this particular dataset. The post- methods had a slight time advantage, but\nthis is down to implementation details and is not notable.\n\nOn the mining problem the Poisson likelihoods are often close to Gaussian, so the exist-\ning post-taylor approximation works well, as do all of our new proposed methods. The\nGaussian approximations to the Poisson likelihood \ufb01t most poorly to sites with zero counts.\nThe redwood dataset discretizes two-dimensional space, leading to a large number of bins.\nThe majority of these bins have zero counts, many more than the mining dataset. Taylor\nexpanding the likelihood gives no likelihood contribution for bins with zero counts, so it\nis unsurprising that post-taylor performs similarly to prior-white. While surr-taylor\nworks better, the best results here come from using approximations to the site-posterior (12).\nFor unreasonably \ufb01ne discretizations the results can be di\ufb00erent again: the site- reparam-\neterizations do not always work well.\n\nOur empirical investigation used slice sampling because it is easy to implement and use.\nHowever, all of the representations we discuss could be combined with any other MCMC\nmethod, such as [19] recently used for Cox processes. The new surrogate data and post-site\nrepresentations o\ufb00er state-of-the-art performance and are the \ufb01rst such advanced methods\nto be applicable to Gaussian process classi\ufb01cation.\n\nAn important message from our results is that \ufb01xing the latent variables and updating\nhyperparameters according to the conditional posterior \u2014 as commonly used by GP practi-\ntioners \u2014 can work exceedingly poorly. Even the simple reparameterization of \u201cwhitening\nthe prior\u201d discussed in Section 2.1 works much better on problems where smoothness is\nimportant in the posterior. Even if site approximations are di\ufb03cult and the more ad-\nvanced methods presented are inapplicable, the simple whitening reparameterization should\nbe given serious consideration when performing MCMC inference of hyperparameters.\n\nAcknowledgements\n\nWe thank an anonymous reviewer for useful comments. This work was supported in part\nby the IST Programme of the European Community, under the PASCAL2 Network of\nExcellence, IST-2007-216886. This publication only re\ufb02ects the authors\u2019 views. RPA is a\njunior fellow of the Canadian Institute for Advanced Research.\n\n8\n\nionospheresyntheticminingredwoods01234Effective samples per likelihood evaluationionospheresyntheticminingredwoods01234Effective samples per covariance construction  fixedprior\u2212whitesurr\u2212sitepost\u2212sitesurr\u2212taylorpost\u2212taylorionospheresyntheticminingredwoods01234Effective samples per secondx1.6e\u221204   x3.3e\u221204   x4.3e\u221205   x4.8e\u221204   x2.9e\u221204   x1.1e\u221203   x7.4e\u221204   x3.7e\u221203   x7.7e\u221203   x5.4e\u221202   x1.2e\u221201   x1.5e\u221202   \fReferences\n\n[1] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical slice sampling.\nJournal of Machine Learning Research: W&CP, 9:541\u2013548, 2010. Proceedings of the\n13th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS).\n\n[2] Radford M. Neal. Slice sampling. Annals of Statistics, 31(3):705\u2013767, 2003.\n\n[3] Deepak K. Agarwal and Alan E. Gelfand. Slice sampling for simulation based \ufb01tting\n\nof spatial data models. Statistics and Computing, 15(1):61\u201369, 2005.\n\n[4] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for ma-\n\nchine learning. MIT Press, 2006.\n\n[5] Luke Tierney. Markov chains for exploring posterior distributions. The Annals of\n\nStatistics, 22(4):1701\u20131728, 1994.\n\n[6] Michalis Titsias, Neil D Lawrence, and Magnus Rattray. E\ufb03cient sampling for Gaussian\nprocess inference using control variables. In Advances in Neural Information Processing\nSystems 21, pages 1681\u20131688. MIT Press, 2009.\n\n[7] Simon Duane, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid Monte\n\nCarlo. Physics Letters B, 195(2):216\u2013222, September 1987.\n\n[8] Radford M. Neal. MCMC using Hamiltonian dynamics. To appear in the Handbook\n\nof Markov Chain Monte Carlo, Chapman & Hall / CRC Press, 2011.\nhttp://www.cs.toronto.edu/~radford/ftp/ham-mcmc.pdf.\n\n[9] Ole F. Christensen, Gareth O. Roberts, and Martin Sk\u02dcald. Robust Markov chain Monte\nCarlo methods for spatial generalized linear mixed models. Journal of Computational\nand Graphical Statistics, 15(1):1\u201317, 2006.\n\n[10] Thomas Minka. Expectation propagation for approximate Bayesian inference. In Pro-\nceedings of the 17th Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\npages 362\u2013369, 2001. Corrected version available from\nhttp://research.microsoft.com/~minka/papers/ep/.\n\n[11] Kiam Choo. Learning hyperparameters for neural network models using Hamiltonian\ndynamics. Master\u2019s thesis, Department of Computer Science, University of Toronto,\n2000. Available from http://www.cs.toronto.edu/~radford/ftp/kiam-thesis.ps.\n[12] Mary Kathryn Cowles, Nicky Best, Karen Vines, and Martyn Plummer. R-CODA\n\n0.10-5, 2006. http://www-fis.iarc.fr/coda/.\n\n[13] Michalis Titsias. Auxiliary sampling using imaginary data, 2010. Unpublished.\n\n[14] Radford M. Neal. Regression and classi\ufb01cation using Gaussian process priors. In J. M.\n\nBernardo et al., editors, Bayesian Statistics 6, pages 475\u2013501. OU Press, 1999.\n\n[15] Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary\nGaussian process classi\ufb01cation. Journal of Machine Learning Research, 6:1679\u20131704,\n2005.\n\n[16] V. G. Sigillito, S. P. Wing, L. V. Hutton, and K. B. Baker. Classi\ufb01cation of radar\nreturns from the ionosphere using neural networks. Johns Hopkins APL Technical\nDigest, 10:262\u2013266, 1989.\n\n[17] R. G. Jarrett. A note on the intervals between coal-mining disasters. Biometrika, 66\n\n(1):191\u2013193, 1979.\n\n[18] Brian D. Ripley. Modelling spatial patterns. Journal of the Royal Statistical Society,\n\nSeries B, 39:172\u2013212, 1977.\n\n[19] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian\nMonte Carlo methods. Journal of the Royal Statistical Society. Series B (Methodologi-\ncal), 2011. To appear.\n\n9\n\n\f", "award": [], "sourceid": 835, "authors": [{"given_name": "Iain", "family_name": "Murray", "institution": null}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}]}